Description
Describe the bug
When trying to use the wholeBody_ct_segmentation bundle for multi-gpu distributed training, the CacheDataset loads properly but before training epochs begin, I get the error RuntimeError: Failed to evaluate ConfigExpression: "$__local_refs['network_def'].to(__local_refs['device'])".
To Reproduce
Steps to reproduce the behavior:
- Install monai and the proper dependencies
- run CUDA_VISIBLE_DEVICES="0,1,2,3,4" torchrun --standalone --nnodes=1 --nproc_per_node=5 -m monai.bundle run --dataset_dir ../totalsegmentator_dataset_monai --config_file "['configs/train.json', 'configs/multi_gpu_train.json']"
Expected behavior
I should be able to have each GPU start training, but no training loop begins.
Environment
MONAI version: 1.3.0
Numpy version: 1.26.3
Pytorch version: 2.1.0.post300
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 865972f7a791bf7b42efbcd87c8402bd865b329e
MONAI file: /home//mambaforge/envs/monai/lib/python3.9/site-packages/monai/init.py
Optional dependencies:
Pytorch Ignite version: 0.4.13
ITK version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 5.2.0
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
scipy version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 10.2.0
Tensorboard version: 2.15.1
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.66.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.9.7
pandas version: 2.1.4
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: NOT INSTALLED or UNKNOWN VERSION.
clearml version: NOT INSTALLED or UNKNOWN VERSION.
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
System: Linux
Linux version: Ubuntu 20.04.6 LTS
Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
Processor: x86_64
Machine: x86_64
Python version: 3.9.18
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 48
Num logical CPUs: 96
Num usable CPUs: 96
CPU usage (%): [100.0, 8.3, 3.5, 3.5, 7.6, 2.1, 3.5, 0.7, 0.0, 0.7, 0.7, 1.4, 0.0, 0.7, 0.7, 2.1, 0.0, 0.0, 0.7, 0.7, 0.7, 0.7, 30.6, 10.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.3, 0.0, 1.4, 1.4, 15.4, 0.0, 0.0, 0.0, 100.0, 100.0, 0.0, 0.0, 100.0, 0.7, 1.4, 4.9, 6.2, 0.7, 0.7, 0.0, 25.0, 0.7, 1.4, 0.7, 0.7, 0.7, 0.0, 1.4, 0.0, 0.0, 1.4, 1.4, 2.8, 0.0, 2.1, 2.8, 1.4, 0.7, 1.4, 17.2, 0.0, 0.7, 0.0, 0.0, 0.0, 0.7, 100.0, 0.0, 15.3, 77.1, 23.6, 4.2, 0.7, 2.1, 0.7, 0.0, 0.0, 0.7, 0.0, 0.0]
CPU freq. (MHz): 3057
Load avg. in last 1, 5, 15 mins (%): [9.6, 13.7, 15.0]
Disk usage (%): 38.1
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 1510.6
Available memory (GB): 1361.0
Used memory (GB): 121.8
Num GPUs: 16
Has CUDA: True
CUDA version: 11.2
cuDNN enabled: True
NVIDIA_TF32_OVERRIDE: None
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE: None
cuDNN version: 8800
Current device: 0
Library compiled for CUDA architectures: ['sm_35', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_86']
GPU 0 Name: Tesla V100-SXM3-32GB
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 80
GPU 0 Total memory (GB): 31.7
GPU 0 CUDA capability (maj.min): 7.0
GPU 1 Name: Tesla V100-SXM3-32GB
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 80
GPU 1 Total memory (GB): 31.7
GPU 1 CUDA capability (maj.min): 7.0
GPU 2 Name: Tesla V100-SXM3-32GB
GPU 2 Is integrated: False
GPU 2 Is multi GPU board: False
GPU 2 Multi processor count: 80
GPU 2 Total memory (GB): 31.7
GPU 2 CUDA capability (maj.min): 7.0
GPU 3 Name: Tesla V100-SXM3-32GB
GPU 3 Is integrated: False
GPU 3 Is multi GPU board: False
GPU 3 Multi processor count: 80
GPU 3 Total memory (GB): 31.7
GPU 3 CUDA capability (maj.min): 7.0
GPU 4 Name: Tesla V100-SXM3-32GB
GPU 4 Is integrated: False
GPU 4 Is multi GPU board: False
GPU 4 Multi processor count: 80
GPU 4 Total memory (GB): 31.7
GPU 4 CUDA capability (maj.min): 7.0
GPU 5 Name: Tesla V100-SXM3-32GB
GPU 5 Is integrated: False
GPU 5 Is multi GPU board: False
GPU 5 Multi processor count: 80
GPU 5 Total memory (GB): 31.7
GPU 5 CUDA capability (maj.min): 7.0
GPU 6 Name: Tesla V100-SXM3-32GB
GPU 6 Is integrated: False
GPU 6 Is multi GPU board: False
GPU 6 Multi processor count: 80
GPU 6 Total memory (GB): 31.7
GPU 6 CUDA capability (maj.min): 7.0
GPU 7 Name: Tesla V100-SXM3-32GB
GPU 7 Is integrated: False
GPU 7 Is multi GPU board: False
GPU 7 Multi processor count: 80
GPU 7 Total memory (GB): 31.7
GPU 7 CUDA capability (maj.min): 7.0
GPU 8 Name: Tesla V100-SXM3-32GB
GPU 8 Is integrated: False
GPU 8 Is multi GPU board: False
GPU 8 Multi processor count: 80
GPU 8 Total memory (GB): 31.7
GPU 8 CUDA capability (maj.min): 7.0
GPU 9 Name: Tesla V100-SXM3-32GB
GPU 9 Is integrated: False
GPU 9 Is multi GPU board: False
GPU 9 Multi processor count: 80
GPU 9 Total memory (GB): 31.7
GPU 9 CUDA capability (maj.min): 7.0
GPU 10 Name: Tesla V100-SXM3-32GB
GPU 10 Is integrated: False
GPU 10 Is multi GPU board: False
GPU 10 Multi processor count: 80
GPU 10 Total memory (GB): 31.7
GPU 10 CUDA capability (maj.min): 7.0
GPU 11 Name: Tesla V100-SXM3-32GB
GPU 11 Is integrated: False
GPU 11 Is multi GPU board: False
GPU 11 Multi processor count: 80
GPU 11 Total memory (GB): 31.7
GPU 11 CUDA capability (maj.min): 7.0
GPU 12 Name: Tesla V100-SXM3-32GB
GPU 12 Is integrated: False
GPU 12 Is multi GPU board: False
GPU 12 Multi processor count: 80
GPU 12 Total memory (GB): 31.7
GPU 12 CUDA capability (maj.min): 7.0
GPU 13 Name: Tesla V100-SXM3-32GB
GPU 13 Is integrated: False
GPU 13 Is multi GPU board: False
GPU 13 Multi processor count: 80
GPU 13 Total memory (GB): 31.7
GPU 13 CUDA capability (maj.min): 7.0
GPU 14 Name: Tesla V100-SXM3-32GB
GPU 14 Is integrated: False
GPU 14 Is multi GPU board: False
GPU 14 Multi processor count: 80
GPU 14 Total memory (GB): 31.7
GPU 14 CUDA capability (maj.min): 7.0
GPU 15 Name: Tesla V100-SXM3-32GB
GPU 15 Is integrated: False
GPU 15 Is multi GPU board: False
GPU 15 Multi processor count: 80
GPU 15 Total memory (GB): 31.7
GPU 15 CUDA capability (maj.min): 7.0