-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read data directly from GPU APIs #87
Comments
This may be the same issue or a separate issue: On a multi-card node it usually happens that a single card goes down. But in this case, nvidia-smi hangs or errors out for all the cards. Going directly to the API might fix that problem too. (For NVIDIA at least that problem may be fixable while staying with nvidia-smi: we can enumerate devices by enumerating /dev/nvidia*, then use nvidia-smi -i to probe cards individually. But eight invocations for eight cards is not the situation we'd like to find ourselves in.) |
A related issue: Currently on ML9, sonar data (and nvidia-smi) say that one card is being used 100%, the others are idle. But nvtop says two cards are running at 100%. It would be useful to try to reduce this discrepancy. |
Re the output format: On the very new node gpu-13.fox, |
ml1: NVIDIA System Management Interface -- v545.23.08
gpu-13.fox: NVIDIA System Management Interface -- v550.54.14
It could look like the sensible thing to do here would be to decode the |
Going to fork that off as its own bug, and leave this bug to be about the original subject matter. |
As noted here we would need to build with the NVIDIA library called "nvml" to do this (on NVIDIA). It is poorly documented and part of a larger SDK, unclear if that is needed on every machine or just during build. |
From last week's Slurm conference: Slurm has been using nvml (and something related for AMD) to talk to the GPU but are finding that this is hard to manage - discrepancies between build system and deploy system are problematic. Plus NVIDIA have reportedly been changing the API even after promising not to do so. They are finding that they can get what they need from the /sys filesystem instead and in Slurm 24.11 the GPU monitoring will be via /sys. We should investigate this for the same reasons. |
|
Related to #86. Currently we run nvidia-smi and rocm-smi to obtain GPU data. This is bad for several reasons:
Much better would probably be to use the programmatic APIs towards the cards.
On the other hand, needing to link against these C libraries adds to the complexity of sonar and creates a situation where the same sonar binary may not be usable on all systems. So a compromise solution would be to create small C (probably) programs that we wrap around the programmatic APIs and invoke from sonar. These would need to be run once and would have a defined and compact output format.
The text was updated successfully, but these errors were encountered: