[python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space #6824

mw66 · 2025-02-13T00:23:44Z

pip install \
    --no-binary lightgbm \
    --config-settings=cmake.define.USE_CUDA=ON \
    'lightgbm==4.5.0'

Hangs, consume all the memory (32 G) and swap space

No other error message.

top shows there are hundreds of <my Python virtual environment ...>/ninja --version command being executed.

Have to kill the install.

The text was updated successfully, but these errors were encountered:

mw66 · 2025-02-13T00:47:06Z

Maybe there are some infinite loop in the build script?

Since the full command of .../ninja --version does not take so much resources, but there are millions of this command.

jameslamb · 2025-02-13T00:50:47Z

Thanks for using LightGBM.

There is no way that °pip install lightgbm` should be generating "hundreds" or "millions" of processes.

Have you confirmed that those process are directly related to this pip install?

Can you share more details please?

operating system
pip --version
nvidia-smi
are you on a network filesystem?

mw66 · 2025-02-13T02:25:08Z

Thanks for using LightGBM.

There is no way that °pip install lightgbm` should be generating "hundreds" or "millions" of processes.

Have you confirmed that those process are directly related to this pip install?

I'm pretty sure, it's caused by pip install lightgbm

Can you share more details please?

Python 3.9.1 (default, Dec 11 2020, 14:32:07)

operating system

Ubuntu 22.04.5 LTS \n \l

Linux 5.15.0-126-generic #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

pip --version

pip 25.0

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX TITAN X     Off |   00000000:01:00.0 Off |                  N/A |
| 22%   57C    P8             34W /  250W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX TITAN X     Off |   00000000:06:00.0 Off |                  N/A |
| 22%   33C    P8             16W /  250W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

are you on a network filesystem?

No.

jameslamb · 2025-02-13T02:38:28Z

I'm pretty sure, it's caused by pip install lightgbm

So you don't see any of those processes immediately after interrupting pip install? Could you share a screenshot of the top output with all those ninja processes?

NVIDIA GeForce GTX TITAN X

I don't think we support that card in LightGBM's default configuration. From https://en.wikipedia.org/wiki/CUDA#GPUs_supported, it looks like that's a Maxwell GPU that requires CUDA Compute Capability 5.2.

The oldest compute capability LightGBM's build supports is for Pascal (6.x).

LightGBM/CMakeLists.txt

Line 226 in d24260f

set(CUDA_ARCHS "60" "61" "62" "70" "75")

Could you try adding "50" "52" "53" to that list and building the project from source?

Tell me what happens and please share all the logs.

git clone --recursive https://github.com/microsoft/LightGBM.git
cd ./LightGBM
git fetch origin --tags
git checkout v4.5.0
# (manually modify that line I asked you to modify)
cmake -B build -S . -DUSE_CUDA=ON
cmake --build build --target _lightgbm
sh build-python.sh --precompile

mw66 · 2025-02-13T03:05:36Z

Tell me what happens and please share all the logs.

cmake -B build -S . -DUSE_CUDA=ON

error out:

  #error -- unsupported clang version! clang version must be less than 13 and
  greater than 3.2 .  The nvcc flag '-allow-unsupported-compiler' can be used
  to override this version check; however, using an unsupported host compiler
  may cause compilation failure or incorrect run time execution.  Use at your
  own risk.

while:

$ clang --version
Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /bin

Is there a way to set '-allow-unsupported-compiler' on the command line?

jameslamb · 2025-02-14T04:25:24Z

Is there a way to set -allow-unsupported-compiler on the command line?

You can add flags to the environment variable CMAKE_CUDA_FLAGS, or just modify this line in your checkout out of LightGBM:

LightGBM/CMakeLists.txt

Line 222 in d24260f

    
               set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall")

If it's possible, it would be better to downgrade to an older clang (I guess 13.2) to stay within the range that's known to be compatible with your version of nvcc and targeting Maxwell GPUs. That is what that warning is saying.

mw66 · 2025-02-14T06:07:29Z

I removed the system default ninja:

$ dpkg -l | grep -i ninja                                                                                                                              
ii  ninja-build                                                 1.10.1-1                                                    amd64        small build system closest in spirit to Make       
$ sudo apt purge ninja-build

Now, the first problem ninja hang solved.

I'm still struggling with gcc errors: like this one: #5089

Mine: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

I'm wondering what is the best GCC version to compile lightgbm?

jameslamb · 2025-02-14T06:11:58Z

I'm sorry, but I'm really struggling to understand this report. You said you were compiling with clang, I asked you to downgrade the clang you're using, and in response you're saying "I uninstalled Ninja and am trying to compile with GCC".

But I'm glad to hear that you're no longer seeing Ninja issues (even if I still don't understand what the original problem was that you said was causing "millions" of processes to be spawned).

please, can you share the full logs as I asked above? Exactly like this report did: #5089 (comment)

mw66 · 2025-02-14T06:19:37Z

Sorry for the confusion. I'm trying two things:

pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON 'lightgbm==4.5.0'

after I remove ninja-build 1.10.1-1, pip no longer hangs.

But I got GCC error of #5089

build from git clone.

I updated CMakeLists.txt:

$ grep -n unsupported CMakeLists.txt
220:    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall -allow-unsupported-compiler -I/usr/include/c++/11")

However, that flag seems not picked up, I'm still seeing this error:

/usr/include/crt/host_config.h:147:2: error: -- unsupported clang version!
  clang version must be less than 13 and greater than 3.2 .  The nvcc flag
  '-allow-unsupported-compiler' can be used to override this version check;
  however, using an unsupported host compiler may cause compilation failure
  or incorrect run time execution.  Use at your own risk.

mw66 · 2025-02-14T06:37:29Z

Also, is there a way to pass '-allow-unsupported-compiler' flag to the pip command:

pip install     --no-binary lightgbm     --config-settings=cmake.define.USE_CUDA=ON     'lightgbm==4.5.0'

?

jameslamb · 2025-02-14T06:54:53Z

However, that flag seems not picked up

Try -XCompiler=--allow-unsupported-compiler instead.

Also, is there a way to pass '-allow-unsupported-compiler' flag to the pip command:

pip install as you're running it will end up invoking CMake. Let's focus just on trying to get the cmake invocation working for you, and then we can figure out how to get that working with pip install.

But if we're going to continue with this, please... when I ask for something, provide it or explain why you can't.

I've asked twice now for the "full" logs. Those contain lots of useful information that would help us make more debugging progress. Please run these commands on your checkout of LightGBM (with the changes we've discussed above, adding the Maxwell compute capabilities).

cmake -B build -S . -DUSE_CUDA=ON
cmake --build build --target _lightgbm
sh build-python.sh --precompile

And share ALL of the logs that that produces (not only error messages), like in #5089 (comment).

mw66 · 2025-02-14T08:07:15Z

I tried both double -- and single -:

-XCompiler=--allow-unsupported-compiler
-XCompiler=-allow-unsupported-compiler

Still the same error message, error: -- unsupported clang version!

I'm just an ordinary ML user (I need to keep my clang version for other tasks), and I founddevice : "gpu" (OpenCL) is working out of (pip install lightgbm) box, so I won't pursue "cuda" any more. I really do not have more time on this.

BTW, I found the pre-built package size is really small:

Using cached lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)

I'm just wondering if you can just provide a pip install lightgbm-CUDA binary package? then it will save the users lots of CUDA build trouble.

Thanks for all the help.

jameslamb · 2025-02-15T06:25:57Z

I'm just wondering if you can just provide a pip install lightgbm-CUDA binary package?

That's a good idea, and yes we may try to support that in the future. I've opened #6828 to document that, you might want to click "subscribe" there to be notified about discussions there.

Even if we did, it wouldn't have helped you in this case. It's unlikely we'd add support for Maxwell GPUs to any pre-built wheels that we distribute. So even if such a package existed, you still would have probably had to build from source.

I tried both double -- and single -:

Sorry that is still not working for you, I must be getting the syntax wrong. I don't have access to a Maxwell or similar GPU to test.

There may be warnings in CMake's logs that help us with that, but since you're not sharing the logs I don't know.

I really do not have more time on this.

Ok, we'll close this.

If you come back to open issues here in the future, we're happy to help.. but when the people helping you ask for things, please don't ignore those requests.

jameslamb added the question label Feb 13, 2025

jameslamb changed the title ~~pip install \ --no-binary lightgbm \ --config-settings=cmake.define.USE_CUDA=ON \ 'lightgbm==4.5.0' Hangs, consume all the memory (32 G) and swap space~~ [python-package] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space Feb 13, 2025

jameslamb added the awaiting response label Feb 13, 2025

github-actions bot removed the awaiting response label Feb 13, 2025

jameslamb added the awaiting response label Feb 14, 2025

github-actions bot removed the awaiting response label Feb 14, 2025

jameslamb mentioned this issue Feb 15, 2025

[RFC] [CUDA] distribute Python wheels with CUDA support #6828

Open

jameslamb changed the title ~~[python-package] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space~~ [python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space Feb 15, 2025

jameslamb closed this as completed Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space #6824

[python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space #6824

mw66 commented Feb 13, 2025

mw66 commented Feb 13, 2025

jameslamb commented Feb 13, 2025 •

edited

Loading

mw66 commented Feb 13, 2025 •

edited

Loading

jameslamb commented Feb 13, 2025

mw66 commented Feb 13, 2025

jameslamb commented Feb 14, 2025

mw66 commented Feb 14, 2025

jameslamb commented Feb 14, 2025

mw66 commented Feb 14, 2025

mw66 commented Feb 14, 2025

jameslamb commented Feb 14, 2025

mw66 commented Feb 14, 2025 •

edited

Loading

jameslamb commented Feb 15, 2025

[python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space #6824

[python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space #6824

Comments

mw66 commented Feb 13, 2025

mw66 commented Feb 13, 2025

jameslamb commented Feb 13, 2025 • edited Loading

mw66 commented Feb 13, 2025 • edited Loading

jameslamb commented Feb 13, 2025

mw66 commented Feb 13, 2025

jameslamb commented Feb 14, 2025

mw66 commented Feb 14, 2025

jameslamb commented Feb 14, 2025

mw66 commented Feb 14, 2025

mw66 commented Feb 14, 2025

jameslamb commented Feb 14, 2025

mw66 commented Feb 14, 2025 • edited Loading

jameslamb commented Feb 15, 2025

jameslamb commented Feb 13, 2025 •

edited

Loading

mw66 commented Feb 13, 2025 •

edited

Loading

mw66 commented Feb 14, 2025 •

edited

Loading