-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix save pretrained for TPUs #105
Conversation
@@ -172,10 +176,19 @@ def on_batch_end(self, trainer, pl_module): | |||
desc += f" — GPU Mem: {gpu_memory} MB" | |||
self.main_progress_bar.update(self.progress_bar_refresh_rate) | |||
self.main_progress_bar.set_description(desc) | |||
|
|||
|
|||
if _TPU_AVAILABLE and self.save_every_check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get some more information around this and why it's necessary? I think we can iterate on this to get something cleaner; the idea is that the boilerplate can live within lightning, outside of the user code.
Converting the PR into a Draft. As found an issue during |
Hi Kaushik and Sean! Thanks for the PR! A few notes/questions:
|
Hi @minimaxir,
I understand the use case, as most of the Users will train on Colab for TPUs. I could work on a test colab, will add it here. It would be a good test to check if it's working as expected.
Noted! Let me spend some time on it and try to debug it. I will get back to you regarding it early next week. |
The pytorch-lightning update appears to be live now so I can now test this. Also:
That may have been the issue I was hitting earlier, so good timing! |
@minimaxir Awesome! We added few more fixes for TPUs. |
Created a draft Colab notebook: https://colab.research.google.com/drive/1MF-3hMdLpxzF7uH-MWJVaxX1u03WnVU_?usp=sharing However it's having issues with the init changes:
|
@minimaxir I have patched a fix in this PR https://github.com/kaushikb11/aitextgen/blob/tpu_save/fix/aitextgen/train.py#L29 We would have to use |
I have updated the Colab notebook to correctly pull from your PR branch, make it self-sufficient, and added some test cases. The good news is that the model now runs on the Colab TPU! Unfortunately, there's a lot of bad news that prevents merging:
Feel free to play with the notebook if I am missing anything but I'm ok with merging once these issues are addressed. |
Same issue/overall performance with the alternate Colab TPU installation via Lightning-AI/pytorch-lightning#6794 |
@minimaxir Noted! Thanks for linking the Notebook. There's an issue regarding the hang on Colab Lightning-AI/pytorch-lightning#6390. We have tagged it as a Priority bug & I will personally look into getting it resolved by next week. We did the testing on GCP TPU + VM, and it is working as expected. Will also investigate the training speed for 1 vs 8 cores. Thanks, appreciate the feedback! |
@minimaxir There's been some progress on the reported issues:
I believe the above TPU issues have been resolved in this PR. It's available in the latest release (v1.2.7). Would love you to try it out. But, there seems to be a tricky issue during
Regarding this, I am afraid you are benchmarking it for the same global steps (num of optimizer calls), whereas training with 8 TPU cores will have a larger effective batch size. |
Thanks for the updates @kaushikb11 ! I just tested the notebook with There's still one blocker issue: the trained weights don't seem to get back to the CPU from the TPU correctly, which can be tested on using There are still a few other issues:
It appears ending training also hangs on 1 TPU core but not 8 TPU cores, oddly.
That would make sense, however I'm not sure that's happening in practice. I changed the notebook to fine-tune the 124M GPT-2 w/ batch-size of 1; on paper, an effective batch size of 8 should result in substantially-faster loss decrease than a batch size of 1, however that does not appear to be the case with the test notebook where moving average losses are about the same. I may need to set up better logging to help visualize this. It may be OK to merge this PR soon (with a note that this behavior is BETA) ; I'll increase the minimum pytorch-lightning requirement to |
Hi @minimaxir, there was a bit of refactoring for DDP Spawn last week, and as TPU Spawn was a subclass of it, it was affected. It has been resolved in this PR Lightning-AI/pytorch-lightning#7074 |
As of pytorch-lightning 1.3, training now works and completes without hangs! However, the generation is still not correct, implying the model does not go back to the CPU. That should hopefully be debuggable on my end. Given that + this PR has some fixes for #129 I'll merge this now, but won't advertise TPU support until it's more hardened + have a more up-to-date test notebook. Thanks, and sorry for the late update! |
Thanks for the update!! |
Do you have any idea when TPU training might be ready? I'd love to use aitextgen with my TRC TPUs. |
Related to #97
Hello @minimaxir! 👋🏽
The code hangs for TPUs when it calls
xm.save
for the save_pretrained method,xm_save
is passed to thesave_function
param for TPUs.There's one more update on the Lightning side for TPUs. The update will be available in the next patch release on Tuesday next week. Thank you!