Skip to content

CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable on Summit #11

Open
@cwpearson

Description

@cwpearson

Running on Summit with jsrun -n 1 -r 1 -c 42 -g 6 -a 6 -b rs js_task_info ../../build/src/weak causes

score=-0.424665
components: 0 1 2 3 4 5
nodeIdx=[0,0,0] size=[310,465,930] rank=0 gpuId=0 cuda=0
nodeIdx=[1,0,0] size=[310,465,930] rank=1 gpuId=0 cuda=1
nodeIdx=[2,0,0] size=[310,465,930] rank=2 gpuId=0 cuda=2
nodeIdx=[0,1,0] size=[310,465,930] rank=3 gpuId=0 cuda=3
nodeIdx=[1,1,0] size=[310,465,930] rank=4 gpuId=0 cuda=4
nodeIdx=[2,1,0] size=[310,465,930] rank=5 gpuId=0 cuda=5
idx=[0,0,0] size=[310,465,rank=3 gpu=0 (cuda id=3) => [0,1,0]
rank=1 gpu=0 (cuda id=1) => [1,0,0]
930] rank=0 subdomain=0 cuda=0
idx=[1,0,0] size=[310,465,930] rank=1 subdomain=0 cuda=1
idx=[2,0,0] size=rank=5 gpu=0 (cuda id=5) => [2,1,0]
rank=2 gpu=0 (cuda id=2) => [2,0,0]
rank=4 gpu=0 (cuda id=4) => [1,1,0]
[310,465,930] rank=2 subdomain=0 cuda=2
idx=[0,1,0] size=[310,465,930] rank=3 subdomain=0 cuda=3
idx=[1,1,0] size=[310,465,930/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable
] rank=4 subdomain=0 cuda=4
idx=[2,1,0] size=[310,465,930] rank=5 subdomain=0 cuda=5
rank=0 gpu=0 (cuda id=0) => [0,0,0]
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
comm plan
create remote
create colocated
create peer copy
DistributedDomain::realize: prepare peerAccessSender
/ccs/home/merth/pearson/stencil/include/stencil/rcstream.hpp@35: CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable

This is possibly because all GPUs in this configuration are reported to be in cudaComputeModeExclusiveProcess, which may only allow certain processes to access certain GPUs, even though all processes have visibility to all GPUs.
It may mean that the first MPI rank that tries to cudaSetDevice to that GPU gets exclusive access to it.

Running with only a single process on the node works: jsrun -n 1 -r 1 -c 42 -g 6 -a 1 -b rs js_task_info ../../build/src/weak

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions