Skip to content

nnUNetV2Runner cannot be run with NVIDIA MIG configuration #7497

Open
@che85

Description

@che85

Describe the bug

python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}

When providing the UUID of the MIG device as gpu_id, I am getting the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/run/run_training.py", line 113, in run_ddp
    torch.cuda.set_device(torch.device('cuda', dist.get_rank()))
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Similarly, setting CUDA_VISIBLE_DEVICES (CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model) is overwritten by nnUNetV2Runner and not working.

Running nnUNet natively works fine with:

CUDA_VISIBLE_DEVICES={MIG_UUID} nnUNetv2_train ... 2d 4

To Reproduce
Steps to reproduce the behavior:

  1. Use computer with MIG device
  2. run
python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}

OR

CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 

Expected behavior

CUDA_VISIBLE_DEVICES should not be overwritten if it was provided.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions