Replies: 6 comments 6 replies
-
Do you have any suggestions about this question in the pathology tutorial? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
Hi @codybum, Thanks for reporting this. I will look into this issue and will keep you updated. |
Beta Was this translation helpful? Give feedback.
-
@codybum I have created a ticket to investigate this issue. #5198 |
Beta Was this translation helpful? Give feedback.
-
I have the same issue for 3D segmentation (that worked in monai==0.9.0 and started failing in monai==0.9.1) |
Beta Was this translation helpful? Give feedback.
-
It seems that it is an issue with the MetaTensor (which is a subclass of |
Beta Was this translation helpful? Give feedback.
-
Hi @codybum, |
Beta Was this translation helpful? Give feedback.
-
MultiGPU "--distributed" appears to be broken in the latest (several) Monai containers for MILModel/Resnet50. Specifically, the following tutorial example appears to be broken: https://github.com/Project-MONAI/tutorials/blob/main/pathology/multiple_instance_learning/panda_mil_train_evaluate_pytorch_gpu.py
Is anyone else experiencing this?
Versions:
NVIDIA Release 22.08 (build 42105213)
PyTorch Version 1.13.0a0+d321be6
projectmonai/monai:latest
DIGEST:sha256:109d2204811a4a0f9f6bf436eca624c42ed9bb3dbc6552c90b65a2db3130fefd
Error:
Traceback (most recent call last):
File "MIL.py", line 724, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args,))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/MIL.py", line 565, in main_worker
train_loss, train_acc = train_epoch(model, train_loader, optimizer, scaler=scaler, epoch=epoch, args=args)
File "/workspace/MIL.py", line 61, in train_epoch
logits = model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1009, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 970, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/monai/monai/networks/nets/milmodel.py", line 238, in forward
x = self.net(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 270, in _forward_impl
x = self.relu(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 102, in forward
return F.relu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1453, in relu
return handle_torch_function(relu, (input,), input, inplace=inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1528, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/monai/monai/data/meta_tensor.py", line 249, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 1089, in torch_function
ret = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu
result = torch.relu(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 400 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Beta Was this translation helpful? Give feedback.
All reactions