Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable on Summit #11

Open
cwpearson opened this issue Feb 10, 2020 · 4 comments
Assignees
Labels
bug Something isn't working resource:summit

Comments

@cwpearson
Copy link
Owner

cwpearson commented Feb 10, 2020

Running on Summit with jsrun -n 1 -r 1 -c 42 -g 6 -a 6 -b rs js_task_info ../../build/src/weak causes

score=-0.424665
components: 0 1 2 3 4 5
nodeIdx=[0,0,0] size=[310,465,930] rank=0 gpuId=0 cuda=0
nodeIdx=[1,0,0] size=[310,465,930] rank=1 gpuId=0 cuda=1
nodeIdx=[2,0,0] size=[310,465,930] rank=2 gpuId=0 cuda=2
nodeIdx=[0,1,0] size=[310,465,930] rank=3 gpuId=0 cuda=3
nodeIdx=[1,1,0] size=[310,465,930] rank=4 gpuId=0 cuda=4
nodeIdx=[2,1,0] size=[310,465,930] rank=5 gpuId=0 cuda=5
idx=[0,0,0] size=[310,465,rank=3 gpu=0 (cuda id=3) => [0,1,0]
rank=1 gpu=0 (cuda id=1) => [1,0,0]
930] rank=0 subdomain=0 cuda=0
idx=[1,0,0] size=[310,465,930] rank=1 subdomain=0 cuda=1
idx=[2,0,0] size=rank=5 gpu=0 (cuda id=5) => [2,1,0]
rank=2 gpu=0 (cuda id=2) => [2,0,0]
rank=4 gpu=0 (cuda id=4) => [1,1,0]
[310,465,930] rank=2 subdomain=0 cuda=2
idx=[0,1,0] size=[310,465,930] rank=3 subdomain=0 cuda=3
idx=[1,1,0] size=[310,465,930/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable
] rank=4 subdomain=0 cuda=4
idx=[2,1,0] size=[310,465,930] rank=5 subdomain=0 cuda=5
rank=0 gpu=0 (cuda id=0) => [0,0,0]
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
/ccs/home/merth/pearson/stencil/include/stencil/local_domain.cuh@54: CUDA Runtime Error(46): all CUDA-capable
devices are busy or unavailable
comm plan
create remote
create colocated
create peer copy
DistributedDomain::realize: prepare peerAccessSender
/ccs/home/merth/pearson/stencil/include/stencil/rcstream.hpp@35: CUDA Runtime Error(46): all CUDA-capable devices are busy or unavailable

This is possibly because all GPUs in this configuration are reported to be in cudaComputeModeExclusiveProcess, which may only allow certain processes to access certain GPUs, even though all processes have visibility to all GPUs.
It may mean that the first MPI rank that tries to cudaSetDevice to that GPU gets exclusive access to it.

Running with only a single process on the node works: jsrun -n 1 -r 1 -c 42 -g 6 -a 1 -b rs js_task_info ../../build/src/weak

@cwpearson cwpearson added the bug Something isn't working label Feb 10, 2020
@cwpearson
Copy link
Owner Author

cwpearson commented Feb 10, 2020

This is the first call to cudaSetDevice in each process:

CUDA_RUNTIME(cudaSetDevice(src))

This is a race for all ranks.
It should be changed so that each rank only enables peer access from the GPU that it controls.

@cwpearson
Copy link
Owner Author

According to "GPU Specific Jobs" section on https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/#gpu-specific-jobs, you can set the compute mode to Default with bsub -alloc_flags gpudefault

@cwpearson
Copy link
Owner Author

Corrected in 3d018b9

@cwpearson
Copy link
Owner Author

This will still occur if more MPI ranks than GPUs exist on a system with exclusive access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resource:summit
Projects
None yet
Development

No branches or pull requests

2 participants