Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing Pod Lifecycle in Distributed Training with TFJob #2454

Open
mnmhouse opened this issue Feb 27, 2025 · 0 comments
Open

Managing Pod Lifecycle in Distributed Training with TFJob #2454

mnmhouse opened this issue Feb 27, 2025 · 0 comments

Comments

@mnmhouse
Copy link

What you would like to be added?

Using TFJob to run distributed training, there are multiple worker pods running. Some worker pods run successfully while others are still in progress. Since we want to save computing resources and evict nodes, deleting a pod might lead to the removal of successful worker pods. However, at this time, new worker pods will be restarted, which could result in task anomalies. How can we handle this situation?

Why is this needed?

Handling Successful Worker Pods during Node Eviction in Distributed Training
,We want to delete successful pods without affecting the overall execution of the task.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant