Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to enable ZeRO 2/3 stages ? #1156

Open
polisettyvarma opened this issue Sep 24, 2024 · 9 comments
Open

[QUESTION] How to enable ZeRO 2/3 stages ? #1156

polisettyvarma opened this issue Sep 24, 2024 · 9 comments

Comments

@polisettyvarma
Copy link

polisettyvarma commented Sep 24, 2024

How to enable ZeRO 2/3 stages ?
similar to #589

@lmcafee-nvidia
Copy link
Contributor

I responded to this on #589.

@polisettyvarma
Copy link
Author

polisettyvarma commented Sep 25, 2024

please convert this issue to feature request for ZeRO 2/3
Thank you.

@carolove
Copy link

i think article,https://www.deepspeed.ai/tutorials/megatron/, is useful.
deepspeed ZeRO 1/2 works with Megatron-lm latest code.

@polisettyvarma
Copy link
Author

polisettyvarma commented Sep 26, 2024

@carolove Thanks for the inputs, i am familiar with deepspeed framework to enable all ZeRO stages. here query is regarding enabling ZeRO in this repo natively.
can you please share commits which added ZeRO 2 support in latest code of this repo.
Thank you.

@carolove
Copy link

I also look for such example~.

@SeunghyunSEO
Copy link

megatron-lm now has its own zero-1 (it is called distributed optimizer in this project), but if u are more familiar with deepspeed, then how bout using deepspeed-megatron, @polisettyvarma ?
And to my best knowledge, zero-3 is not compatible with model parallelism (TP or PP) of megatron-lm.
zero-3 reduce vram memory and improve throughput by partitioning and broadcasting model parameters but TP or PP partition its own way and rather communicate activations (all-reduce activations for backward and forward).
So TP or PP has no room for communicating model parameters.

@polisettyvarma
Copy link
Author

Thank you @SeunghyunSEO for your inputs. Yes Megatron-DeepSpeed repo can be used but it's not up to date with Megatron-LM. I agree on Zero > 1 is not compatible with PP.
My request here is some similar feature of ZeRO on Megatron-LM.

@deepakn94
Copy link
Collaborator

We should have PyTorch FSDP support compatible with TP in the next couple of weeks.

@polisettyvarma polisettyvarma changed the title [QUESTION] How to enable ZeRO 1/2/3 stages ? [QUESTION] How to enable ZeRO 2/3 stages ? Sep 30, 2024
@polisettyvarma
Copy link
Author

polisettyvarma commented Sep 30, 2024

Thank you @deepakn94 for sharing this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants