-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better ElasticDeviceMesh #9
Conversation
Jackmin801
commented
Sep 25, 2024
•
edited
Loading
edited
- Patch into simulate-multi-node.sh
- Fix sync
b09225b
to
d7532fa
Compare
571184a
to
3077262
Compare
999caf7
to
4938bb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments, looks good, looking forward to try !!
It could be worth to add a small tests, either torchrun one or distributed one.
Would need to test gloo over vpn asap as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you did not install pre-commit hook haha 😿
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if you add your github comments as code comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a bit confused with the FSDP.summon_full but I think that I got it now with the shared memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LFGTM !