You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When trying to use the wholeBody_ct_segmentation bundle for multi-gpu distributed training, the CacheDataset loads properly but before training epochs begin, I get the error RuntimeError: Failed to evaluate ConfigExpression: "$__local_refs['network_def'].to(__local_refs['device'])".
To Reproduce
Steps to reproduce the behavior:
Install monai and the proper dependencies
run CUDA_VISIBLE_DEVICES="0,1,2,3,4" torchrun --standalone --nnodes=1 --nproc_per_node=5 -m monai.bundle run --dataset_dir ../totalsegmentator_dataset_monai --config_file "['configs/train.json', 'configs/multi_gpu_train.json']"
Expected behavior
I should be able to have each GPU start training, but no training loop begins.
Optional dependencies:
Pytorch Ignite version: 0.4.13
ITK version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 5.2.0
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
scipy version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 10.2.0
Tensorboard version: 2.15.1
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.66.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.9.7
pandas version: 2.1.4
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: NOT INSTALLED or UNKNOWN VERSION.
clearml version: NOT INSTALLED or UNKNOWN VERSION.
Futher testing using docker also ends with the following error RuntimeError: Failed to evaluate ConfigExpression:
"$__local_refs['train::trainer'].run()"
Hi @idinsmore1, I can't reproduce the error. You can try to debug each component like this to see whether network_def is write correctly.
from monai.bundle import ConfigParser
parser = ConfigParser()
parser.read_config(f=["/workspace/Code/model-zoo/models/wholeBody_ct_segmentation/configs/train.json"])
# parse the structured config content
parser.parse()
# instantiate the network component and print the network structure
net = parser.get_parsed_content("network_def")
print(net)
Hi @KumoLiu, thanks for the quick response. For whatever reason, my slightly customized environment just does not seem to want to work properly in multi-gpu training. I solved the issue by exactly recreating the environment within the metadata.json file, everything seems to be working now. Thanks!
Describe the bug
When trying to use the wholeBody_ct_segmentation bundle for multi-gpu distributed training, the CacheDataset loads properly but before training epochs begin, I get the error RuntimeError: Failed to evaluate ConfigExpression: "$__local_refs['network_def'].to(__local_refs['device'])".
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I should be able to have each GPU start training, but no training loop begins.
Environment
MONAI version: 1.3.0
Numpy version: 1.26.3
Pytorch version: 2.1.0.post300
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 865972f7a791bf7b42efbcd87c8402bd865b329e
MONAI file: /home//mambaforge/envs/monai/lib/python3.9/site-packages/monai/init.py
Optional dependencies:
Pytorch Ignite version: 0.4.13
ITK version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 5.2.0
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
scipy version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 10.2.0
Tensorboard version: 2.15.1
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.66.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.9.7
pandas version: 2.1.4
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: NOT INSTALLED or UNKNOWN VERSION.
clearml version: NOT INSTALLED or UNKNOWN VERSION.
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
System: Linux
Linux version: Ubuntu 20.04.6 LTS
Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
Processor: x86_64
Machine: x86_64
Python version: 3.9.18
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 48
Num logical CPUs: 96
Num usable CPUs: 96
CPU usage (%): [100.0, 8.3, 3.5, 3.5, 7.6, 2.1, 3.5, 0.7, 0.0, 0.7, 0.7, 1.4, 0.0, 0.7, 0.7, 2.1, 0.0, 0.0, 0.7, 0.7, 0.7, 0.7, 30.6, 10.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.3, 0.0, 1.4, 1.4, 15.4, 0.0, 0.0, 0.0, 100.0, 100.0, 0.0, 0.0, 100.0, 0.7, 1.4, 4.9, 6.2, 0.7, 0.7, 0.0, 25.0, 0.7, 1.4, 0.7, 0.7, 0.7, 0.0, 1.4, 0.0, 0.0, 1.4, 1.4, 2.8, 0.0, 2.1, 2.8, 1.4, 0.7, 1.4, 17.2, 0.0, 0.7, 0.0, 0.0, 0.0, 0.7, 100.0, 0.0, 15.3, 77.1, 23.6, 4.2, 0.7, 2.1, 0.7, 0.0, 0.0, 0.7, 0.0, 0.0]
CPU freq. (MHz): 3057
Load avg. in last 1, 5, 15 mins (%): [9.6, 13.7, 15.0]
Disk usage (%): 38.1
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 1510.6
Available memory (GB): 1361.0
Used memory (GB): 121.8
Num GPUs: 16
Has CUDA: True
CUDA version: 11.2
cuDNN enabled: True
NVIDIA_TF32_OVERRIDE: None
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE: None
cuDNN version: 8800
Current device: 0
Library compiled for CUDA architectures: ['sm_35', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_86']
GPU 0 Name: Tesla V100-SXM3-32GB
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 80
GPU 0 Total memory (GB): 31.7
GPU 0 CUDA capability (maj.min): 7.0
GPU 1 Name: Tesla V100-SXM3-32GB
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 80
GPU 1 Total memory (GB): 31.7
GPU 1 CUDA capability (maj.min): 7.0
GPU 2 Name: Tesla V100-SXM3-32GB
GPU 2 Is integrated: False
GPU 2 Is multi GPU board: False
GPU 2 Multi processor count: 80
GPU 2 Total memory (GB): 31.7
GPU 2 CUDA capability (maj.min): 7.0
GPU 3 Name: Tesla V100-SXM3-32GB
GPU 3 Is integrated: False
GPU 3 Is multi GPU board: False
GPU 3 Multi processor count: 80
GPU 3 Total memory (GB): 31.7
GPU 3 CUDA capability (maj.min): 7.0
GPU 4 Name: Tesla V100-SXM3-32GB
GPU 4 Is integrated: False
GPU 4 Is multi GPU board: False
GPU 4 Multi processor count: 80
GPU 4 Total memory (GB): 31.7
GPU 4 CUDA capability (maj.min): 7.0
GPU 5 Name: Tesla V100-SXM3-32GB
GPU 5 Is integrated: False
GPU 5 Is multi GPU board: False
GPU 5 Multi processor count: 80
GPU 5 Total memory (GB): 31.7
GPU 5 CUDA capability (maj.min): 7.0
GPU 6 Name: Tesla V100-SXM3-32GB
GPU 6 Is integrated: False
GPU 6 Is multi GPU board: False
GPU 6 Multi processor count: 80
GPU 6 Total memory (GB): 31.7
GPU 6 CUDA capability (maj.min): 7.0
GPU 7 Name: Tesla V100-SXM3-32GB
GPU 7 Is integrated: False
GPU 7 Is multi GPU board: False
GPU 7 Multi processor count: 80
GPU 7 Total memory (GB): 31.7
GPU 7 CUDA capability (maj.min): 7.0
GPU 8 Name: Tesla V100-SXM3-32GB
GPU 8 Is integrated: False
GPU 8 Is multi GPU board: False
GPU 8 Multi processor count: 80
GPU 8 Total memory (GB): 31.7
GPU 8 CUDA capability (maj.min): 7.0
GPU 9 Name: Tesla V100-SXM3-32GB
GPU 9 Is integrated: False
GPU 9 Is multi GPU board: False
GPU 9 Multi processor count: 80
GPU 9 Total memory (GB): 31.7
GPU 9 CUDA capability (maj.min): 7.0
GPU 10 Name: Tesla V100-SXM3-32GB
GPU 10 Is integrated: False
GPU 10 Is multi GPU board: False
GPU 10 Multi processor count: 80
GPU 10 Total memory (GB): 31.7
GPU 10 CUDA capability (maj.min): 7.0
GPU 11 Name: Tesla V100-SXM3-32GB
GPU 11 Is integrated: False
GPU 11 Is multi GPU board: False
GPU 11 Multi processor count: 80
GPU 11 Total memory (GB): 31.7
GPU 11 CUDA capability (maj.min): 7.0
GPU 12 Name: Tesla V100-SXM3-32GB
GPU 12 Is integrated: False
GPU 12 Is multi GPU board: False
GPU 12 Multi processor count: 80
GPU 12 Total memory (GB): 31.7
GPU 12 CUDA capability (maj.min): 7.0
GPU 13 Name: Tesla V100-SXM3-32GB
GPU 13 Is integrated: False
GPU 13 Is multi GPU board: False
GPU 13 Multi processor count: 80
GPU 13 Total memory (GB): 31.7
GPU 13 CUDA capability (maj.min): 7.0
GPU 14 Name: Tesla V100-SXM3-32GB
GPU 14 Is integrated: False
GPU 14 Is multi GPU board: False
GPU 14 Multi processor count: 80
GPU 14 Total memory (GB): 31.7
GPU 14 CUDA capability (maj.min): 7.0
GPU 15 Name: Tesla V100-SXM3-32GB
GPU 15 Is integrated: False
GPU 15 Is multi GPU board: False
GPU 15 Multi processor count: 80
GPU 15 Total memory (GB): 31.7
GPU 15 CUDA capability (maj.min): 7.0
The text was updated successfully, but these errors were encountered: