Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space #6824

Closed
mw66 opened this issue Feb 13, 2025 · 13 comments
Labels

Comments

@mw66
Copy link

mw66 commented Feb 13, 2025

pip install \
    --no-binary lightgbm \
    --config-settings=cmake.define.USE_CUDA=ON \
    'lightgbm==4.5.0'

Hangs, consume all the memory (32 G) and swap space

No other error message.

top shows there are hundreds of <my Python virtual environment ...>/ninja --version command being executed.

Have to kill the install.

@mw66
Copy link
Author

mw66 commented Feb 13, 2025

Maybe there are some infinite loop in the build script?

Since the full command of .../ninja --version does not take so much resources, but there are millions of this command.

@jameslamb
Copy link
Collaborator

jameslamb commented Feb 13, 2025

Thanks for using LightGBM.

There is no way that °pip install lightgbm` should be generating "hundreds" or "millions" of processes.

Have you confirmed that those process are directly related to this pip install?

Can you share more details please?

  • operating system
  • pip --version
  • nvidia-smi
  • are you on a network filesystem?

@jameslamb jameslamb changed the title pip install \ --no-binary lightgbm \ --config-settings=cmake.define.USE_CUDA=ON \ 'lightgbm==4.5.0' Hangs, consume all the memory (32 G) and swap space [python-package] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space Feb 13, 2025
@mw66
Copy link
Author

mw66 commented Feb 13, 2025

Thanks for using LightGBM.

There is no way that °pip install lightgbm` should be generating "hundreds" or "millions" of processes.

Have you confirmed that those process are directly related to this pip install?

I'm pretty sure, it's caused by pip install lightgbm

Can you share more details please?

Python 3.9.1 (default, Dec 11 2020, 14:32:07)

  • operating system

Ubuntu 22.04.5 LTS \n \l

Linux 5.15.0-126-generic #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • pip --version

pip 25.0

  • nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX TITAN X     Off |   00000000:01:00.0 Off |                  N/A |
| 22%   57C    P8             34W /  250W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX TITAN X     Off |   00000000:06:00.0 Off |                  N/A |
| 22%   33C    P8             16W /  250W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

  • are you on a network filesystem?

No.

@jameslamb
Copy link
Collaborator

I'm pretty sure, it's caused by pip install lightgbm

So you don't see any of those processes immediately after interrupting pip install? Could you share a screenshot of the top output with all those ninja processes?

NVIDIA GeForce GTX TITAN X

I don't think we support that card in LightGBM's default configuration. From https://en.wikipedia.org/wiki/CUDA#GPUs_supported, it looks like that's a Maxwell GPU that requires CUDA Compute Capability 5.2.

The oldest compute capability LightGBM's build supports is for Pascal (6.x).

set(CUDA_ARCHS "60" "61" "62" "70" "75")

Could you try adding "50" "52" "53" to that list and building the project from source?

Tell me what happens and please share all the logs.

git clone --recursive https://github.com/microsoft/LightGBM.git
cd ./LightGBM
git fetch origin --tags
git checkout v4.5.0
# (manually modify that line I asked you to modify)
cmake -B build -S . -DUSE_CUDA=ON
cmake --build build --target _lightgbm
sh build-python.sh --precompile

@mw66
Copy link
Author

mw66 commented Feb 13, 2025

Tell me what happens and please share all the logs.

cmake -B build -S . -DUSE_CUDA=ON

error out:

  #error -- unsupported clang version! clang version must be less than 13 and
  greater than 3.2 .  The nvcc flag '-allow-unsupported-compiler' can be used
  to override this version check; however, using an unsupported host compiler
  may cause compilation failure or incorrect run time execution.  Use at your
  own risk.

while:

$ clang --version
Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /bin

Is there a way to set '-allow-unsupported-compiler' on the command line?

@jameslamb
Copy link
Collaborator

Is there a way to set -allow-unsupported-compiler on the command line?

You can add flags to the environment variable CMAKE_CUDA_FLAGS, or just modify this line in your checkout out of LightGBM:

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall")

If it's possible, it would be better to downgrade to an older clang (I guess 13.2) to stay within the range that's known to be compatible with your version of nvcc and targeting Maxwell GPUs. That is what that warning is saying.

@mw66
Copy link
Author

mw66 commented Feb 14, 2025

I removed the system default ninja:

$ dpkg -l | grep -i ninja                                                                                                                              
ii  ninja-build                                                 1.10.1-1                                                    amd64        small build system closest in spirit to Make       
$ sudo apt purge ninja-build                    

Now, the first problem ninja hang solved.

I'm still struggling with gcc errors: like this one: #5089

Mine: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

I'm wondering what is the best GCC version to compile lightgbm?

@jameslamb
Copy link
Collaborator

I'm sorry, but I'm really struggling to understand this report. You said you were compiling with clang, I asked you to downgrade the clang you're using, and in response you're saying "I uninstalled Ninja and am trying to compile with GCC".

But I'm glad to hear that you're no longer seeing Ninja issues (even if I still don't understand what the original problem was that you said was causing "millions" of processes to be spawned).

please, can you share the full logs as I asked above? Exactly like this report did: #5089 (comment)

@mw66
Copy link
Author

mw66 commented Feb 14, 2025

Sorry for the confusion. I'm trying two things:

  1. pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON 'lightgbm==4.5.0'

after I remove ninja-build 1.10.1-1, pip no longer hangs.

But I got GCC error of #5089

  1. build from git clone.

I updated CMakeLists.txt:

$ grep -n unsupported CMakeLists.txt
220:    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall -allow-unsupported-compiler -I/usr/include/c++/11")

However, that flag seems not picked up, I'm still seeing this error:

/usr/include/crt/host_config.h:147:2: error: -- unsupported clang version!
  clang version must be less than 13 and greater than 3.2 .  The nvcc flag
  '-allow-unsupported-compiler' can be used to override this version check;
  however, using an unsupported host compiler may cause compilation failure
  or incorrect run time execution.  Use at your own risk.

@mw66
Copy link
Author

mw66 commented Feb 14, 2025

Also, is there a way to pass '-allow-unsupported-compiler' flag to the pip command:

pip install     --no-binary lightgbm     --config-settings=cmake.define.USE_CUDA=ON     'lightgbm==4.5.0'

?

@jameslamb
Copy link
Collaborator

However, that flag seems not picked up

Try -XCompiler=--allow-unsupported-compiler instead.

Also, is there a way to pass '-allow-unsupported-compiler' flag to the pip command:

pip install as you're running it will end up invoking CMake. Let's focus just on trying to get the cmake invocation working for you, and then we can figure out how to get that working with pip install.

But if we're going to continue with this, please... when I ask for something, provide it or explain why you can't.

I've asked twice now for the "full" logs. Those contain lots of useful information that would help us make more debugging progress. Please run these commands on your checkout of LightGBM (with the changes we've discussed above, adding the Maxwell compute capabilities).

cmake -B build -S . -DUSE_CUDA=ON
cmake --build build --target _lightgbm
sh build-python.sh --precompile

And share ALL of the logs that that produces (not only error messages), like in #5089 (comment).

@mw66
Copy link
Author

mw66 commented Feb 14, 2025

I tried both double -- and single -:

-XCompiler=--allow-unsupported-compiler
-XCompiler=-allow-unsupported-compiler

Still the same error message, error: -- unsupported clang version!

I'm just an ordinary ML user (I need to keep my clang version for other tasks), and I founddevice : "gpu" (OpenCL) is working out of (pip install lightgbm) box, so I won't pursue "cuda" any more. I really do not have more time on this.

BTW, I found the pre-built package size is really small:

Using cached lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)                                                                                                                     

I'm just wondering if you can just provide a pip install lightgbm-CUDA binary package? then it will save the users lots of CUDA build trouble.

Thanks for all the help.

@jameslamb jameslamb changed the title [python-package] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space [python-package] [CUDA] installing CUDA version with 'pip' hangs, consume all the memory (32 G) and swap space Feb 15, 2025
@jameslamb
Copy link
Collaborator

I'm just wondering if you can just provide a pip install lightgbm-CUDA binary package?

That's a good idea, and yes we may try to support that in the future. I've opened #6828 to document that, you might want to click "subscribe" there to be notified about discussions there.

Even if we did, it wouldn't have helped you in this case. It's unlikely we'd add support for Maxwell GPUs to any pre-built wheels that we distribute. So even if such a package existed, you still would have probably had to build from source.

I tried both double -- and single -:

Sorry that is still not working for you, I must be getting the syntax wrong. I don't have access to a Maxwell or similar GPU to test.

There may be warnings in CMake's logs that help us with that, but since you're not sharing the logs I don't know.

I really do not have more time on this.

Ok, we'll close this.

If you come back to open issues here in the future, we're happy to help.. but when the people helping you ask for things, please don't ignore those requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants