-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Lightning DeepSpeed Integration #103
Conversation
Just additionally we're working on ZeRO 3 right now, and once this in place I'm more than happy to add a separate PR :) |
Thanks for the PR! I'll test it out and see if it works.
I forget why I set that, so no issue changing it here. |
Sorry about the late response: IRL has been busy! Testing in the 1-GPU case in Colab, things look good, although unfortunately it goes OOM with your simulated 1.5B model (even w/ gradient checkpointing) despite the DeepSpeed literature asserting that the framework allows that (maybe that support is a ZeRO 3 thing). Checking the multi-GPU case now. If that works I'm good with merging. |
However, I'll merge it since this PR seems to be pretty safe for non-DeepSpeed use cases. The GPU issue I mentioned above is likely unrelated (will file an issue to pytorch-lightning if I get a better test case). Thanks again! |
Woop! no need to thank, you've done a lot for the community :) Regarding the issue this is probably because we're using multiprocessing within a notebook... I think this opens up the discussion of a spawn based DeepSpeed plugin! I can make an issue and cross-reference this now |
Made the issue to track @minimaxir if you have the ability to test outside of a notebook via a terminal let me know if it works, or I can prioritize to get the spawn version of DeepSpeed together! |
Thanks @minimaxir for your hard work, this repo is super awesome! I learnt a lot back when this repo was released about how text generation really works :)
Related to #97.
I've enabled DeepSpeed using default parameters to start with, which does not include CPU Offloading as this comes with a speed degradation as default. I've also got a PR setup to PyTorch Lightning to update the README as information was slightly outdated!
Whilst testing DeepSpeed, I noticed that when training with default parameters the gradient clip value is set to 0, turning off any gradient clipping. Is this by choice? When training with FP16 my loss did not converge, and NaNed without DeepSpeed. To remedy, I reduced the LR to 1e-4, and set
max_grad_norm
to 1.0, which I think is the default in HF Transformers as well.Since DeepSpeed really takes effect at larger parameter sizes (since the buffer themselves are around 3GB of RAM by default) I tested DeepSpeed using a much larger network size (1.5B parameters) and tested across 8 GPUs, seeing good scaling of parameter sizes. As the number of GPUs increased, I did see the memory usage decrease which is as expected as we shard the states across GPUs. I'll continue to try push the numbers on our A100 Server.
I'll try do some tests with DeepSpeed at sizes that fit DDP, but I'll continue to test. Let me know if you have any issues/feedback!!
Script: