Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cupy wheel and torch wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

Open
tingyu66 opened this issue Jun 5, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@tingyu66
Copy link
Member

tingyu66 commented Jun 5, 2024

We have experienced the same issue several times in CI wheel-tests workflows when using cupy and torch>=2.2 together:

    torch = import_optional("torch")
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cugraph/utilities/utils.py:455: in import_optional
    return importlib.import_module(mod)
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/__init__.py:237: in <module>
    from torch._C import *  # noqa: F403
E   ImportError: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

The root cause is that cupy points to the builtin libnccl.so (2.16.2) in the container, while pytorch links to libnccl.so (2.20.5) from the nvidia-nccl-cu11 wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.

One less-than-ideal solution is to always import torch first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!

CC: @alexbarghi-nv @VibhuJawa @naimnv

@alexbarghi-nv alexbarghi-nv self-assigned this Jun 5, 2024
@alexbarghi-nv alexbarghi-nv added the bug Something isn't working label Jun 5, 2024
@tingyu66
Copy link
Member Author

tingyu66 commented Jun 5, 2024

To reproduce:
docker run --gpus all --rm -it --network=host rapidsai/citestwheel:cuda11.8.0-ubuntu20.04-py3.9 bash
pip install cupy-cuda11x
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu118
python -c "import cupy; import torch"

@leofang
Copy link
Member

leofang commented Jun 6, 2024

I feel something is wrong in your container in a way that I haven't fully understood. CuPy lazy-loads all CUDA libraries so import cupy does not trigger the loading of libnccl. Something else does (but I can't tell why).

The only CuPy module that links to libnccl is cupy_backends.cuda.libs.nccl, as can be confirmed as follows:

root@marie:/# for f in $(find / -type f,l -regex '/pyenv/**/.*.so'); do readelf -d $f | grep "nccl.so"; if [[ $? -eq 0 ]]; then echo $f; fi; done
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy_backends/cuda/libs/nccl.cpython-39-x86_64-linux-gnu.so

Now, if you monitor the loaded DSOs you'll see this module (nccl.cpython-39-x86_64-linux-gnu.so) is actually not loaded (by design), but libnccl still gets loaded

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505:	find library=libnccl.so.2.16.2 [0]; searching
      6505:	  trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:	  trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

What's worse, when the import order is swapped two distinct copies of libnccl are loaded, one from the system (as shown above) and the other from the nccl wheel:

root@marie:/# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      6787:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
      6787:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2

So I am not sure what I'm looking at 🤷

Question for @tingyu66: If this container is owned/controlled by RAPIDS, can't you just remove the system NCCL?

@tingyu66
Copy link
Member Author

tingyu66 commented Jun 6, 2024

@leofang I just took another look and think I found the culprit.
https://github.com/cupy/cupy/blob/a54b7abfed668e52de7f3eee7b3fe8ccaef34874/cupy/_environment.py#L270-L274

For wheel builds, cupy._environment loads specific CUDA library versions defined from .data/_wheel.json file.

root@1cc5aab-lcedt:~# cat /pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data/_wheel.json

{"cuda": "11.x", "packaging": "pip", "cutensor": {"version": "2.0.1", "filenames": ["libcutensor.so.2.0.1"]}, "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]}, "cudnn": {"version": "8.8.1", "filenames": ["libcudnn.so.8.8.1", "libcudnn_ops_infer.so.8.8.1", "libcudnn_ops_train.so.8.8.1", "libcudnn_cnn_infer.so.8.8.1", "libcudnn_cnn_train.so.8.8.1", "libcudnn_adv_infer.so.8.8.1", "libcudnn_adv_train.so.8.8.1"]}}root@1cc5aab-lcedt:/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data# 

That explains why the runtime linker was trying to find the exact version 2.16.2 during import and not satisfied with any other libnccl even if RPATH and LD_LIBRARY_PATH are tweaked.

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505:	find library=libnccl.so.2.16.2 [0]; searching
      6505:	  trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:	  trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

Changing "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]} to {"version": "2.20.5", "filenames": ["libnccl.so.2"]} in _wheel.json to match PyT's requirement and update LD_LIBRARY_PATH:

root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import cupy; import torch" 2>&1 | grep "calling init:.*nccl"
      4041:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      4182:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2

we finally have one nccl loaded. 🙃

@leofang
Copy link
Member

leofang commented Jun 6, 2024

Ah, good finding, I forgot there's the preload logic...

Could you file a bug in CuPy's issue tracker? I think the preload logic needs to happen as part of the lazy loading, not before it.

Now you have two ways to hack in your CI workflow :D

@tingyu66
Copy link
Member Author

Update: The fix from CuPy is expected to be released in version 13.2.0 sometime this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants