Open
Description
Running on Summit with jsrun -n 1 -r 1 -c 42 -g 6 -a 6 -b rs js_task_info ../../build/src/weak
causes
score=-0.424665
components: 0 1 2 3 4 5
nodeIdx=[0,0,0] size=[310,465,930] rank=0 gpuId=0 cuda=0
nodeIdx=[1,0,0] size=[310,465,930] rank=1 gpuId=0 cuda=1
nodeIdx=[2,0,0] size=[310,465,930] rank=2 gpuId=0 cuda=2
nodeIdx=[0,1,0] size=[310,465,930] rank=3 gpuId=0 cuda=3
nodeIdx=[1,1,0] size=[310,465,930] rank=4 gpuId=0 cuda=4
nodeIdx=[2,1,0] size=[310,465,930] rank=5 gpuId=0 cuda=5
idx=[0,0,0] size=[310,465,rank=3 gpu=0 (cuda id=3) => [0,1,0]
rank=1 gpu=0 (cuda id=1) => [1,0,0]
930] rank=0 subdomain=0 cuda=0
idx=[1,0,0] size=[310,465,930] rank=1 subdomain=0 cuda=1
idx=[2,0,0] size=rank=5 gpu=0 (cuda id=5) => [2,1,0]
rank=2 gpu=0 (cuda id=2) => [2,0,0]
rank=4 gpu=0 (cuda id=4) => [1,1,0]
[310,465,930] rank=2 subdomain=0 cuda=2
idx=[0,1,0] size=[310,465,930] rank=3 subdomain=0 cuda=3
idx=[1,1,0] size=[310,465,930/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable
] rank=4 subdomain=0 cuda=4
idx=[2,1,0] size=[310,465,930] rank=5 subdomain=0 cuda=5
rank=0 gpu=0 (cuda id=0) => [0,0,0]
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
comm plan
create remote
create colocated
create peer copy
DistributedDomain::realize: prepare peerAccessSender
/ccs/home/merth/pearson/stencil/include/stencil/rcstream.hpp@35: CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable
This is possibly because all GPUs in this configuration are reported to be in cudaComputeModeExclusiveProcess
, which may only allow certain processes to access certain GPUs, even though all processes have visibility to all GPUs.
It may mean that the first MPI rank that tries to cudaSetDevice
to that GPU gets exclusive access to it.
Running with only a single process on the node works: jsrun -n 1 -r 1 -c 42 -g 6 -a 1 -b rs js_task_info ../../build/src/weak