ddp_weakref is not set during the backward pass #20390
Unanswered
lokesh-vr-17773
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Training runs successfully on Torch 2.0 and PyTorch Lightning 2.0 with the same model architecture and dataset. However, using the same configuration and dataset on Torch 2.2 and Lightning 2.2 results in the following error: the training crashes during the 67th epoch. With Torch 2.0, training completes all 150 epochs without issues.
env info for 2.0 run:
env info for 2.2 run:
Beta Was this translation helpful? Give feedback.
All reactions