-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why model_optim_rng.pt is saved in a seperate directory? #1225
Comments
The flash checkpoint in DLRover saves and loads the distributed optimizer checkpoint of Megatron-LM in parallel. This is, each rank saves and loads its owner shard of optimizer states into the |
@workingloong Thanks for your quick reply. I got it. I tried benchmarking dlrover and found
|
Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0? |
Did you use from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint
from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint |
Not yet. |
Yes, both are used. It is weird when training a 16B model, the saving to memory costs about 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is low. |
Yeah, the performance disk may affect the performance to save the checkpoint into the memory. Because, the async checkpoint use the shared memory which need to create a file on the disk. I conducted some experiments and found that the performance to save the checkpoint into the memory with SSD is much better than NAS. |
This issue has been automatically marked as stale because it has |
Megatron-LM saves
model_optim_rng.pt
anddistrib_optim.pt
in directory named asmp_rank_xx_xxx
. But In dlrover,distrib_optim.pt
is been seperated and saved in a directory named asrank_xxxx
.It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong
The text was updated successfully, but these errors were encountered: