`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

tingyu66 · 2024-06-05T21:36:34Z

We have experienced the same issue several times in CI wheel-tests workflows when using cupy and torch>=2.2 together:

    torch = import_optional("torch")
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cugraph/utilities/utils.py:455: in import_optional
    return importlib.import_module(mod)
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/__init__.py:237: in <module>
    from torch._C import *  # noqa: F403
E   ImportError: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

The root cause is that cupy points to the builtin libnccl.so (2.16.2) in the container, while pytorch links to libnccl.so (2.20.5) from the nvidia-nccl-cu11 wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.

One less-than-ideal solution is to always import torch first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!

CC: @alexbarghi-nv @VibhuJawa @naimnv

The text was updated successfully, but these errors were encountered:

tingyu66 · 2024-06-05T21:41:23Z

To reproduce:
docker run --gpus all --rm -it --network=host rapidsai/citestwheel:cuda11.8.0-ubuntu20.04-py3.9 bash
pip install cupy-cuda11x
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu118
python -c "import cupy; import torch"

leofang · 2024-06-06T00:30:33Z

I feel something is wrong in your container in a way that I haven't fully understood. CuPy lazy-loads all CUDA libraries so import cupy does not trigger the loading of libnccl. Something else does (but I can't tell why).

The only CuPy module that links to libnccl is cupy_backends.cuda.libs.nccl, as can be confirmed as follows:

root@marie:/# for f in $(find / -type f,l -regex '/pyenv/**/.*.so'); do readelf -d $f | grep "nccl.so"; if [[ $? -eq 0 ]]; then echo $f; fi; done
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy_backends/cuda/libs/nccl.cpython-39-x86_64-linux-gnu.so

Now, if you monitor the loaded DSOs you'll see this module (nccl.cpython-39-x86_64-linux-gnu.so) is actually not loaded (by design), but libnccl still gets loaded

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505:	find library=libnccl.so.2.16.2 [0]; searching
      6505:	  trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:	  trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

What's worse, when the import order is swapped two distinct copies of libnccl are loaded, one from the system (as shown above) and the other from the nccl wheel:

root@marie:/# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      6787:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
      6787:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2

So I am not sure what I'm looking at 🤷

Question for @tingyu66: If this container is owned/controlled by RAPIDS, can't you just remove the system NCCL?

tingyu66 · 2024-06-06T02:01:02Z

@leofang I just took another look and think I found the culprit.
https://github.com/cupy/cupy/blob/a54b7abfed668e52de7f3eee7b3fe8ccaef34874/cupy/_environment.py#L270-L274

For wheel builds, cupy._environment loads specific CUDA library versions defined from .data/_wheel.json file.

root@1cc5aab-lcedt:~# cat /pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data/_wheel.json

{"cuda": "11.x", "packaging": "pip", "cutensor": {"version": "2.0.1", "filenames": ["libcutensor.so.2.0.1"]}, "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]}, "cudnn": {"version": "8.8.1", "filenames": ["libcudnn.so.8.8.1", "libcudnn_ops_infer.so.8.8.1", "libcudnn_ops_train.so.8.8.1", "libcudnn_cnn_infer.so.8.8.1", "libcudnn_cnn_train.so.8.8.1", "libcudnn_adv_infer.so.8.8.1", "libcudnn_adv_train.so.8.8.1"]}}root@1cc5aab-lcedt:/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data#

That explains why the runtime linker was trying to find the exact version 2.16.2 during import and not satisfied with any other libnccl even if RPATH and LD_LIBRARY_PATH are tweaked.

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505:	find library=libnccl.so.2.16.2 [0]; searching
      6505:	  trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:	  trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

Changing "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]} to {"version": "2.20.5", "filenames": ["libnccl.so.2"]} in _wheel.json to match PyT's requirement and update LD_LIBRARY_PATH:

root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import cupy; import torch" 2>&1 | grep "calling init:.*nccl"
      4041:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      4182:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2

we finally have one nccl loaded. 🙃

leofang · 2024-06-06T02:07:22Z

Ah, good finding, I forgot there's the preload logic...

Could you file a bug in CuPy's issue tracker? I think the preload logic needs to happen as part of the lazy loading, not before it.

Now you have two ways to hack in your CI workflow :D

tingyu66 · 2024-06-10T19:23:32Z

Update: The fix from CuPy is expected to be released in version 13.2.0 sometime this week.

alexbarghi-nv self-assigned this Jun 5, 2024

alexbarghi-nv added the bug Something isn't working label Jun 5, 2024

tingyu66 mentioned this issue Jun 6, 2024

libnccl.so is preloaded eagerly for CuPy wheel cupy/cupy#8354

Closed

alexbarghi-nv mentioned this issue Jul 31, 2024

GNN Packages: Update to cupy 13.2 rapidsai/cugraph-gnn#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

tingyu66 commented Jun 5, 2024 •

edited

Loading

tingyu66 commented Jun 5, 2024

leofang commented Jun 6, 2024

tingyu66 commented Jun 6, 2024

leofang commented Jun 6, 2024

tingyu66 commented Jun 10, 2024

cupy wheel and torch wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

cupy wheel and torch wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

Comments

tingyu66 commented Jun 5, 2024 • edited Loading

tingyu66 commented Jun 5, 2024

leofang commented Jun 6, 2024

tingyu66 commented Jun 6, 2024

leofang commented Jun 6, 2024

tingyu66 commented Jun 10, 2024

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers #4465

tingyu66 commented Jun 5, 2024 •

edited

Loading