You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to execute distributed training in a Kubernetes environment using the command "kubectl apply -f train.yaml".
Which version of Kubeflow supports the torchrun command for distributed training across multiple PODs, with multiple GPUs in each POD?
Please provide a working example, including sample code and YAML files, with a focus on how to write the YAML file.
Thank you very much!
The text was updated successfully, but these errors were encountered:
githubthunder
changed the title
Distributed training with mutli-pod with multi-gpu in each pod
Distributed training with mutliple pods, with multi-gpu in each pod
Feb 28, 2025
I want to execute distributed training in a Kubernetes environment using the command "kubectl apply -f train.yaml".
Which version of Kubeflow supports the torchrun command for distributed training across multiple PODs, with multiple GPUs in each POD?
Please provide a working example, including sample code and YAML files, with a focus on how to write the YAML file.
Thank you very much!
The text was updated successfully, but these errors were encountered: