Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support offline consolidate and reshard fsdp checkpoints #28

Merged
merged 24 commits into from
Oct 18, 2024

Conversation

hanwen-sun
Copy link
Contributor

@hanwen-sun hanwen-sun commented Oct 11, 2024

Support to consolidate and reshard xls fsdp checkpoints offline.

the time and memory need:

  load consolidate reshard save Total memory
llama3-8b model fsdp=8 reshard = 1 7s 1s   20s 28s  
llama3-8b optimizer fsdp=8 reshard = 1 16s 2s   40s 58s  
llama3-8b consolidate 8         1.5min  
llama3-8b model fsdp=8 reshard = 4 7s 1s 2s 7s 17s  
llama3-8b optimizer fsdp=8 reshard = 4 16s 2s 4s 13s 35s  
llama3-8b reshard 8 - 4         52s  
llama3-70b model fsdp=32 reshard = 1 50s 8s   180s 238s  
llama3-70b optimizer fsdp=32 reshard = 1 100s 18s   480s 598s 530G
llama3-70b consolidate 32         14 min  
llama3-70b model fsdp=32 reshard = 16 50 8s 15s 50s 123s  
llama3-70b optimizer fsdp=32 reshard = 16 100s 18s 36s 100s 254s 530G
llama3-70b reshard 32 - 16         6.3min  

torchacc/dist/state_dict_utils.py Outdated Show resolved Hide resolved
torchacc/utils/consolidate_and_reshard_ckpts.py Outdated Show resolved Hide resolved
torchacc/utils/consolidate_and_reshard_ckpts.py Outdated Show resolved Hide resolved
torchacc/utils/consolidate_and_reshard_ckpts.py Outdated Show resolved Hide resolved
torchacc/dist/state_dict_utils.py Outdated Show resolved Hide resolved
torchacc/utils/consolidate_and_reshard_ckpts.py Outdated Show resolved Hide resolved
torchacc/utils/consolidate_and_reshard_ckpts.py Outdated Show resolved Hide resolved
@hanwen-sun hanwen-sun changed the title offline consolidate util support offline consolidate and reshard checkpoints Oct 17, 2024
setup.py Outdated Show resolved Hide resolved
@anw90 anw90 changed the title support offline consolidate and reshard checkpoints support offline consolidate and reshard fsdp checkpoints Oct 17, 2024
@hanwen-sun hanwen-sun merged commit 1187a11 into main Oct 18, 2024
3 checks passed
@hanwen-sun hanwen-sun deleted the dev/offline_consolidate branch October 18, 2024 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants