Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Not seeing GPU sampling values (busy, power, temp, etc.) with high GPU counts #428

Open
xaguilar opened this issue Dec 20, 2024 · 1 comment

Comments

@xaguilar
Copy link

xaguilar commented Dec 20, 2024

Problem Description

When running a Fortran code, for example on 16 GPUs, I can see in the traces the GPU sampling values: GPU Busy, GPU Memory, Power, etc. However, if I run the same code on 512 MI250Xs then those metrics disappear from the trace, I only get the HIP streams/queues. I'm using the same cfg file and the same running command, the only thing I change is the number of GPUs.

Operating System

SLES 15-SP5

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

In the cfg file I have:
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCM_SMI = true
OMNITRACE_SAMPLING_CPUS = none
OMNITRACE_SAMPLING_GPUS = all

I have also tried changing the sampling GPUs to a few ones instead of all, just in case, but it didn't work either.

@xaguilar xaguilar changed the title [Issue]: Not seeing GPU sampling values (busy, power, temp, etc.) with high node counts [Issue]: Not seeing GPU sampling values (busy, power, temp, etc.) with high GPU counts Dec 20, 2024
@ppanchad-amd
Copy link

Hi @xaguilar. Internal ticket has been created to investigate your issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants