-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed + TPU support via transformer's Trainer #97
Comments
Now Zero-3 Offload is available which in theory should be easier to implement (once it works with base Transformers) |
thanks for highlighting this! @SeanNaren can help get this solved on the PL side. |
hey @minimaxir what issues are you running into? If you're able to point to issues I can help escalate/resolve them for PL! ZeRO 3 Offload has it's own quirks that will require HuggingFace Transformers and us both to figure out, so it may take a bit longer to integrate however we're working together on this where we can. We do have experimental support in place, and can give some pointers if you're keen to try :) |
Dear @minimaxir, Would you mind joining Pytorch Lightning Slack. I sent you an invitation. We can coordinate efforts there to resolve your issues with Sean and I. Best, |
Currently, training via pytorch-lightning's implementation of DeepSpeed/TPUs is not working, and it's impossible to debug where the issues lie (i.e. within aitextgen, transformers, pytorch-lightning, or pytorch-xla) since the entire ecosystem is very fragile and error messages are unhelpful.
A short-term workaround is to use transformer's native Trainer for DeepSpeed + TPUs (and only those specific use cases for now) as it limits potential breakage, and also serves as a baseline for pytorch-lightning's approach when that is more stabilized.
The downside is that Trainer is not as good as pytorch-lightning UX-wise, but given that DeepSpeed + TPUs are a more niche use case for power users. that's acceptable.
The text was updated successfully, but these errors were encountered: