Open
Description
Describe the bug
python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}
When providing the UUID of the MIG device as gpu_id, I am getting the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.10/dist-packages/nnunetv2/run/run_training.py", line 113, in run_ddp
torch.cuda.set_device(torch.device('cuda', dist.get_rank()))
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Similarly, setting CUDA_VISIBLE_DEVICES (CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model
) is overwritten by nnUNetV2Runner and not working.
Running nnUNet
natively works fine with:
CUDA_VISIBLE_DEVICES={MIG_UUID} nnUNetv2_train ... 2d 4
To Reproduce
Steps to reproduce the behavior:
- Use computer with MIG device
- run
python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}
OR
CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0
Expected behavior
CUDA_VISIBLE_DEVICES
should not be overwritten if it was provided.