Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.is_available() aborts after module loading omnitrace #336

Open
R0n12 opened this issue Apr 10, 2024 · 2 comments
Open

torch.cuda.is_available() aborts after module loading omnitrace #336

R0n12 opened this issue Apr 10, 2024 · 2 comments

Comments

@R0n12
Copy link

R0n12 commented Apr 10, 2024

Before loading omnitrace:

(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.hip
'5.6.31061-8c743ae5d'
>>> torch.cuda.is_available()
True

After loading omnitrace/1.10.4:

(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> module load omnitrace/1.10.4
Using ROCm installation: /opt/rocm-5.6.0
(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> module list

Currently Loaded Modules:
  1) craype-x86-trento                       7) cce/15.0.0             13) darshan-runtime/3.4.0
  2) libfabric/1.15.2.0                      8) craype/2.7.19          14) hsi/default
  3) craype-network-ofi                      9) cray-dsmml/0.2.2       15) DefApps/default
  4) perftools-base/22.12.0                 10) cray-mpich/8.1.23      16) tmux/3.2a
  5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta  11) cray-libsci/22.12.1.1  17) rocm/5.6.0
  6) cray-pmi/6.1.8                         12) PrgEnv-cray/8.3.3      18) omnitrace/1.10.4

 

(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.hip
'5.6.31061-8c743ae5d'
>>> torch.cuda.is_available()
Aborted

PyTorch verison: 2.1.2+rocm5.6.0
Omnitrace: 1.10.4

Is there something that needs to be checked first?

I have attached my rocminfo output, note that since on MI250X we don't support the get_power_avg() functions, it is reflected as an unsupported feature, however, outputting functions.json still hangs at the end.
rocminfo.log

Thanks in advance!

@jrmadsen
Copy link
Collaborator

I have attached my rocminfo output, note that since on MI250X we don't support the get_power_avg() functions, it is reflected as an unsupported feature, however, outputting functions.json still hangs at the end.
rocminfo.log

This was fixed in #331 and included in the v1.11.1 release.

However, I don’t think this is related to your problem whatsoever. Could you do a module show for that omnitrace module? And maybe compare the env before/after. I’m thinking there’s something being changed with regards to the LD_LIBRARY_PATH and the PYTHONPATH when that module gets loaded.

@ppanchad-amd
Copy link

Hi @R0n12, do you still need assistance with this ticket? If not, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants