Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xpu monitor for dlrover #1290

Open
majieyue opened this issue Oct 12, 2024 · 2 comments
Open

add xpu monitor for dlrover #1290

majieyue opened this issue Oct 12, 2024 · 2 comments
Labels
Hacktoberfest todo issue or pr with 'todo' will ignore expiration

Comments

@majieyue
Copy link
Collaborator

Background

Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors

Requirement

We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc.
Although there are many xPU vendors in market, we can start from Nvidia...

@aqwertaqwert
Copy link

Is there a specific usage document for xpu_timer? Is it strongly dependent on the Dlrover framework or can all training frameworks learn from it?

@majieyue
Copy link
Collaborator Author

majieyue commented Nov 5, 2024

hi @aqwertaqwert
The xPU is a acronym for GPGPUs in the market, not xpu_timer at all :) We recommend to start from Nvidia GPU, e.g. add some code to collect metrics from Nvidia DCGM or PyNVML

@BalaBalaYi BalaBalaYi added the todo issue or pr with 'todo' will ignore expiration label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest todo issue or pr with 'todo' will ignore expiration
Projects
None yet
Development

No branches or pull requests

3 participants