You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a Fortran code, for example on 16 GPUs, I can see in the traces the GPU sampling values: GPU Busy, GPU Memory, Power, etc. However, if I run the same code on 512 MI250Xs then those metrics disappear from the trace, I only get the HIP streams/queues. I'm using the same cfg file and the same running command, the only thing I change is the number of GPUs.
Operating System
SLES 15-SP5
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
In the cfg file I have:
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCM_SMI = true
OMNITRACE_SAMPLING_CPUS = none
OMNITRACE_SAMPLING_GPUS = all
I have also tried changing the sampling GPUs to a few ones instead of all, just in case, but it didn't work either.
The text was updated successfully, but these errors were encountered:
xaguilar
changed the title
[Issue]: Not seeing GPU sampling values (busy, power, temp, etc.) with high node counts
[Issue]: Not seeing GPU sampling values (busy, power, temp, etc.) with high GPU counts
Dec 20, 2024
Problem Description
When running a Fortran code, for example on 16 GPUs, I can see in the traces the GPU sampling values: GPU Busy, GPU Memory, Power, etc. However, if I run the same code on 512 MI250Xs then those metrics disappear from the trace, I only get the HIP streams/queues. I'm using the same cfg file and the same running command, the only thing I change is the number of GPUs.
Operating System
SLES 15-SP5
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
In the cfg file I have:
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCM_SMI = true
OMNITRACE_SAMPLING_CPUS = none
OMNITRACE_SAMPLING_GPUS = all
I have also tried changing the sampling GPUs to a few ones instead of all, just in case, but it didn't work either.
The text was updated successfully, but these errors were encountered: