-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cupy
wheel and torch
wheel link to different NCCL shared libraries in RAPIDS CI containers
#4465
Comments
To reproduce: |
I feel something is wrong in your container in a way that I haven't fully understood. CuPy lazy-loads all CUDA libraries so The only CuPy module that links to libnccl is root@marie:/# for f in $(find / -type f,l -regex '/pyenv/**/.*.so'); do readelf -d $f | grep "nccl.so"; if [[ $? -eq 0 ]]; then echo $f; fi; done
0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy_backends/cuda/libs/nccl.cpython-39-x86_64-linux-gnu.so Now, if you monitor the loaded DSOs you'll see this module ( root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
6505: find library=libnccl.so.2.16.2 [0]; searching
6505: trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
6505: trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
6505: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
6505: calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0] What's worse, when the import order is swapped two distinct copies of libnccl are loaded, one from the system (as shown above) and the other from the nccl wheel: root@marie:/# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
6787: calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
6787: calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 So I am not sure what I'm looking at 🤷 Question for @tingyu66: If this container is owned/controlled by RAPIDS, can't you just remove the system NCCL? |
@leofang I just took another look and think I found the culprit. For wheel builds,
That explains why the runtime linker was trying to find the exact version 2.16.2 during import and not satisfied with any other libnccl even if RPATH and LD_LIBRARY_PATH are tweaked.
Changing
we finally have one nccl loaded. 🙃 |
Ah, good finding, I forgot there's the preload logic... Could you file a bug in CuPy's issue tracker? I think the preload logic needs to happen as part of the lazy loading, not before it. Now you have two ways to hack in your CI workflow :D |
Update: The fix from CuPy is expected to be released in version 13.2.0 sometime this week. |
We have experienced the same issue several times in CI wheel-tests workflows when using
cupy
andtorch>=2.2
together:The root cause is that
cupy
points to the builtinlibnccl.so
(2.16.2) in the container, whilepytorch
links tolibnccl.so
(2.20.5) from thenvidia-nccl-cu11
wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.One less-than-ideal solution is to always
import torch
first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!CC: @alexbarghi-nv @VibhuJawa @naimnv
The text was updated successfully, but these errors were encountered: