-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146
Comments
I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.
|
which version of dlrover? Beside, have you test with I test dlrover 0.3.7 with torchrun, with this repo ,still error. |
dlrover[torch]==0.3.7. I have reproduced the issue if I do not use |
I add
|
This error is related to the storage I use and can be ignored. |
The job will be hang in the end. |
|
For megatron-lm train with flash-ckpt, when set
pipeline parallel
, can not save sucessfully. It seems to not all ckpt save to memory.Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.
save log :
The text was updated successfully, but these errors were encountered: