Latest Pytorch/Monai troubles with distributed #5081

codybum · 2022-09-02T15:43:58Z

codybum
Sep 2, 2022

MultiGPU "--distributed" appears to be broken in the latest (several) Monai containers for MILModel/Resnet50. Specifically, the following tutorial example appears to be broken: https://github.com/Project-MONAI/tutorials/blob/main/pathology/multiple_instance_learning/panda_mil_train_evaluate_pytorch_gpu.py

Is anyone else experiencing this?

Versions:
NVIDIA Release 22.08 (build 42105213)
PyTorch Version 1.13.0a0+d321be6
projectmonai/monai:latest
DIGEST:sha256:109d2204811a4a0f9f6bf436eca624c42ed9bb3dbc6552c90b65a2db3130fefd

Error:
Traceback (most recent call last):
File "MIL.py", line 724, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args,))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/MIL.py", line 565, in main_worker
train_loss, train_acc = train_epoch(model, train_loader, optimizer, scaler=scaler, epoch=epoch, args=args)
File "/workspace/MIL.py", line 61, in train_epoch
logits = model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1009, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 970, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/monai/monai/networks/nets/milmodel.py", line 238, in forward
x = self.net(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 270, in _forward_impl
x = self.relu(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 102, in forward
return F.relu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1453, in relu
return handle_torch_function(relu, (input,), input, inplace=inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1528, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/monai/monai/data/meta_tensor.py", line 249, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 1089, in torch_function
ret = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu
result = torch.relu(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 400 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Nic-Ma · 2022-09-02T16:40:39Z

Nic-Ma
Sep 2, 2022
Maintainer

Hi @drbeh @KumoLiu ,

Do you have any suggestions about this question in the pathology tutorial?

Thanks in advance.

1 reply

codybum Sep 2, 2022
Author

The problem does not exists with the following versions:

monai==0.9.1

torch 1.8.2+cu111
torch-summary 1.4.5
torchaudio 0.8.2
torchvision 0.9.2+cu111

drbeh · 2022-09-02T18:29:00Z

drbeh
Sep 2, 2022
Collaborator

Hi @codybum, Thanks for reporting this. I will look into this issue and will keep you updated.

4 replies

codybum Sep 2, 2022
Author

I can confirm that if I revert to the projectmonai/monai:0.9.1 container everything works as expected.

Thanks

drbeh Sep 2, 2022
Collaborator

Thanks! You are right, I checked with MONAI tag 0.9.1 and it works fine while it fails with latest commit of dev branch.
@Nic-Ma @wyli @rijobro @ericspod, do you have any insight here? Does any change (since 0.9.1) comes to your mind that can potentially cause a problem with distributed training? We haven't had updated anything directly related to MIL pipeline but it should be the result of a change in one of the components.

ericspod Sep 2, 2022
Maintainer

I'm afraid there's a lot to look through: 0.9.1...dev I can't think of anything we'd have changed that would impact this.

codybum Sep 22, 2022
Author

It looks like this issue persist with 1.0.0 and the latest MIL example code:

File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu
result = torch.relu_(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

drbeh · 2022-09-22T19:12:50Z

drbeh
Sep 22, 2022
Collaborator

@codybum I have created a ticket to investigate this issue. #5198

1 reply

codybum Sep 22, 2022
Author

Thanks!

myron · 2022-10-07T03:04:18Z

myron
Oct 7, 2022
Maintainer

I have the same issue for 3D segmentation (that worked in monai==0.9.0 and started failing in monai==0.9.1)
see #5283

0 replies

drbeh · 2022-10-07T14:05:35Z

drbeh
Oct 7, 2022
Collaborator

It seems that it is an issue with the MetaTensor (which is a subclass of torch.Tensor). Wenqi has submitted an issue to PyTorch repo for this: pytorch/pytorch#86456

0 replies

drbeh · 2022-11-28T17:25:22Z

drbeh
Nov 28, 2022
Collaborator

Hi @codybum,
Thanks to @myron, now we have a fix for the distributed training of MIL model (which is merged). Feel free to test it and let us know if you face any issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest Pytorch/Monai troubles with distributed #5081

{{title}}

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Latest Pytorch/Monai troubles with distributed #5081

codybum Sep 2, 2022

Replies: 6 comments · 6 replies

Nic-Ma Sep 2, 2022 Maintainer

codybum Sep 2, 2022 Author

drbeh Sep 2, 2022 Collaborator

codybum Sep 2, 2022 Author

drbeh Sep 2, 2022 Collaborator

ericspod Sep 2, 2022 Maintainer

codybum Sep 22, 2022 Author

drbeh Sep 22, 2022 Collaborator

codybum Sep 22, 2022 Author

myron Oct 7, 2022 Maintainer

drbeh Oct 7, 2022 Collaborator

drbeh Nov 28, 2022 Collaborator

codybum
Sep 2, 2022

Replies: 6 comments 6 replies

Nic-Ma
Sep 2, 2022
Maintainer

codybum Sep 2, 2022
Author

drbeh
Sep 2, 2022
Collaborator

codybum Sep 2, 2022
Author

drbeh Sep 2, 2022
Collaborator

ericspod Sep 2, 2022
Maintainer

codybum Sep 22, 2022
Author

drbeh
Sep 22, 2022
Collaborator

codybum Sep 22, 2022
Author

myron
Oct 7, 2022
Maintainer

drbeh
Oct 7, 2022
Collaborator

drbeh
Nov 28, 2022
Collaborator