-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat ckpt #18
Conversation
b592ac8
to
1f3ee64
Compare
param_offloaded.data.copy_(param_model.data) | ||
|
||
## the next part is a fix so that each rank save a different dataloader rank. It not efficient because it reads the state two times from disk | ||
with open(os.path.join(resume_ckpt_path, f"__{world_info.local_rank}_0.pt"), "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this prevents us from resuming with a different number of local ranks right? For now since we are just running on 8xH100 nodes it is fine, just good to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed, if we need to resume from a different diloco we can easily implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow looks like a difficult PR haha. glad you figured it out :)
approved!
PR for ckpt ready:
Main different compare to open diloco ckpt is that :
here screenshot with 4 diloco wokrer
here ckpt without diloco