You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using TFJob to run distributed training, there are multiple worker pods running. Some worker pods run successfully while others are still in progress. Since we want to save computing resources and evict nodes, deleting a pod might lead to the removal of successful worker pods. However, at this time, new worker pods will be restarted, which could result in task anomalies. How can we handle this situation?
Why is this needed?
Handling Successful Worker Pods during Node Eviction in Distributed Training
,We want to delete successful pods without affecting the overall execution of the task.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What you would like to be added?
Using TFJob to run distributed training, there are multiple worker pods running. Some worker pods run successfully while others are still in progress. Since we want to save computing resources and evict nodes, deleting a pod might lead to the removal of successful worker pods. However, at this time, new worker pods will be restarted, which could result in task anomalies. How can we handle this situation?
Why is this needed?
Handling Successful Worker Pods during Node Eviction in Distributed Training
,We want to delete successful pods without affecting the overall execution of the task.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: