-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training is stuck at saving checkpoint for Llama3.2 #1713
Comments
@ebsmothers, what is that command that we use at meta from when nproc=4 and it gets stuck? Do you think it could be related? @apthagowda97 , what error do you get when you set save_adapter_weights_only? Can you also share the command/config you use to run training? |
@felipemello1 the command we use to avoid the hangs is NCCL_SHM_DISABLE=0 but I don’t think it’s relevant for non-Meta hardware (though I guess worth a try). @apthagowda97 would also be interested to know what kind of hardware you’re running on |
Sorry for delayed response. @felipemello1 when I use
If I disable it i.e But I dont get this problem if I do full finetuning where checkpoint is saved within secs. Here is my config:
@ebsmothers I am running on Single A100 GPU with cuda 12.4 pytorch. |
Training is stuck at saving checkpoint with the below msg after 1st epoch
Config:
If I enable
save_adapter_weights_only: True
different error comesThe text was updated successfully, but these errors were encountered: