Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why model_optim_rng.pt is saved in a seperate directory? #1225

Open
zhaoyang-star opened this issue Aug 2, 2024 · 8 comments
Open

Why model_optim_rng.pt is saved in a seperate directory? #1225

zhaoyang-star opened this issue Aug 2, 2024 · 8 comments
Labels

Comments

@zhaoyang-star
Copy link

Megatron-LM saves model_optim_rng.pt and distrib_optim.pt in directory named as mp_rank_xx_xxx. But In dlrover, distrib_optim.pt is been seperated and saved in a directory named as rank_xxxx.

It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong

@workingloong
Copy link
Collaborator

workingloong commented Aug 5, 2024

The flash checkpoint in DLRover saves and loads the distributed optimizer checkpoint of Megatron-LM in parallel. This is, each rank saves and loads its owner shard of optimizer states into the rank_xxxx file. You can see the detail https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/blogs/megatron_flash_checkpoint.md#save-and-load-distributed-optimizer-in-parallel

@zhaoyang-star
Copy link
Author

@workingloong Thanks for your quick reply. I got it.

I tried benchmarking dlrover and found save_to_memory costs ~55sec. Is it normal? From the blogs the cost of save_to_memory is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:

192.169.125.62: saving checkpoint at iteration     800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62:     save-checkpoint ................................: (107635.64, 107635.84)

@zhaoyang-star
Copy link
Author

Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?

@workingloong
Copy link
Collaborator

@workingloong Thanks for your quick reply. I got it.

I tried benchmarking dlrover and found save_to_memory costs ~55sec. Is it normal? From the blogs the cost of save_to_memory is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:

192.169.125.62: saving checkpoint at iteration     800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62:     save-checkpoint ................................: (107635.64, 107635.84)

Did you use distributed_optimizer and the following APIs?

from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint
from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint

@workingloong
Copy link
Collaborator

Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?

Not yet.

@zhaoyang-star
Copy link
Author

zhaoyang-star commented Aug 6, 2024

Did you use distributed_optimizer and the following APIs?

Yes, both are used. It is weird when training a 16B model, the saving to memory costs about 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is low.

@workingloong
Copy link
Collaborator

ut 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is

Yeah, the performance disk may affect the performance to save the checkpoint into the memory. Because, the async checkpoint use the shared memory which need to create a file on the disk. I conducted some experiments and found that the performance to save the checkpoint into the memory with SSD is much better than NAS.

Copy link

github-actions bot commented Nov 5, 2024

This issue has been automatically marked as stale because it has
not had recent activity.

@github-actions github-actions bot added the stale label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants