Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker build fails on Amazon SageMaker: fatal error: ATen/cuda/DeviceUtils.cuh: No such file or directory #include "ATen/cuda/DeviceUtils.cuh" #168

Open
la-cruche opened this issue Oct 20, 2021 · 3 comments

Comments

@la-cruche
Copy link

Hi,

I'm trying to do a Docker build . on a SageMaker-managed EC2 instance in AWS (ml.g4dn.12xlarge, with T4 cards).
docker build . runs for few minutes, outputs several things and errors with the following:

csrc/layer_norm_cuda_kernel.cu:4:10: fatal error: ATen/cuda/DeviceUtils.cuh: No such file or directory
 #include "ATen/cuda/DeviceUtils.cuh"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Interestingly, early in the build it says

Step 17/18 : RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
 ---> Running in e8df4e2bf69e
Cloning into 'apex'...
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

Warning: Torch did not find available GPUs on this system.
 If your intention is to cross-compile, this is not an error.
By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
Volta (compute capability 7.0), Turing (compute capability 7.5),
and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
If you wish to cross-compile for a single specific architecture,
export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.

which suprises me since I have 4 GPUs on my machine.

How to build that docker image in a SageMaker-managed AWS EC2 instance?

@la-cruche la-cruche changed the title Docker build fails on Amazon SageMaker: fatal error: ATen/cuda/DeviceUtils.cuh: No such file or directory #include "ATen/cuda/DeviceUtils.cuh" on AWS EC2 Docker build fails on Amazon SageMaker: fatal error: ATen/cuda/DeviceUtils.cuh: No such file or directory #include "ATen/cuda/DeviceUtils.cuh" Oct 20, 2021
@doulemint
Copy link

Hi, did you solve this problem? I also fail to compile apex when I build docker.

@lahiiru
Copy link

lahiiru commented Nov 13, 2021

ATen/cuda/DeviceUtils.cuh: No such file or directory

This issue is already discussed in NVIDIA/apex#1043

  1. Remove the apex build command from the Dockerfile
    # RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
  2. Add below to the Dockerfile instead the above removed line.
    RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
    Note: See the APEX readme to find latest build instructions.

"Torch did not find available GPUs on this system"

  1. You need to install NVidia docker plugin (You might already have it). Then use nvidia-docker instead of docker command

  2. Make sure you expose the GPUs when running the container (i.e. NV_GPU='0,1' nvidia-docker ... or --gpus)

    You might be interested in AWS guide on deep learning containers.

@Matt-Dinh
Copy link

ATen/cuda/DeviceUtils.cuh: No such file or directory

This issue is already discussed in NVIDIA/apex#1043

  1. Remove the apex build command from the Dockerfile

    # RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
  2. Add below to the Dockerfile instead the above removed line.

    RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

    Note: See the APEX readme to find latest build instructions.

"Torch did not find available GPUs on this system"

  1. You need to install NVidia docker plugin (You might already have it). Then use nvidia-docker instead of docker command
  2. Make sure you expose the GPUs when running the container (i.e. NV_GPU='0,1' nvidia-docker ... or --gpus)
    You might be interested in AWS guide on deep learning containers.

It worked. It's been a while but still thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants