Managing Pod Lifecycle in Distributed Training with TFJob #2454

mnmhouse · 2025-02-27T02:16:31Z

What you would like to be added?

Using TFJob to run distributed training, there are multiple worker pods running. Some worker pods run successfully while others are still in progress. Since we want to save computing resources and evict nodes, deleting a pod might lead to the removal of successful worker pods. However, at this time, new worker pods will be restarted, which could result in task anomalies. How can we handle this situation?

Why is this needed?

Handling Successful Worker Pods during Node Eviction in Distributed Training
，We want to delete successful pods without affecting the overall execution of the task.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

mnmhouse added kind/feature lifecycle/needs-triage labels Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Pod Lifecycle in Distributed Training with TFJob #2454

Managing Pod Lifecycle in Distributed Training with TFJob #2454

mnmhouse commented Feb 27, 2025

Managing Pod Lifecycle in Distributed Training with TFJob #2454

Managing Pod Lifecycle in Distributed Training with TFJob #2454

Comments

mnmhouse commented Feb 27, 2025

What you would like to be added?

Why is this needed?

Love this feature?