Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat ckpt #18

Merged
merged 20 commits into from
Sep 30, 2024
Merged

Feat ckpt #18

merged 20 commits into from
Sep 30, 2024

Conversation

samsja
Copy link
Collaborator

@samsja samsja commented Sep 28, 2024

PR for ckpt ready:

Main different compare to open diloco ckpt is that :

  • not using fsspec directly as we discovered a bug in pytorch
  • rather use fsspec to rsync in a subprocess to remote making ckpt faster
  • only save the model once

here screenshot with 4 diloco wokrer
Screenshot from 2024-09-29 21-16-28

here ckpt without diloco

Screenshot from 2024-09-29 21-34-21

@samsja samsja force-pushed the feat-ckpt branch 2 times, most recently from b592ac8 to 1f3ee64 Compare September 29, 2024 03:51
@samsja samsja marked this pull request as ready for review September 30, 2024 04:26
param_offloaded.data.copy_(param_model.data)

## the next part is a fix so that each rank save a different dataloader rank. It not efficient because it reads the state two times from disk
with open(os.path.join(resume_ckpt_path, f"__{world_info.local_rank}_0.pt"), "rb") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prevents us from resuming with a different number of local ranks right? For now since we are just running on 8xH100 nodes it is fine, just good to keep in mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, if we need to resume from a different diloco we can easily implement it

Copy link
Member

@Jackmin801 Jackmin801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow looks like a difficult PR haha. glad you figured it out :)
approved!

@samsja samsja merged commit 6475ea4 into main Sep 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants