-
-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stability: Continue training even if a data batch was hit which causes OOM #113
Conversation
you can load the batch data |
@RuntimeRacer should we close this? |
I've applied similar approach in my private training code some days ago. But I have to point out that this kind of fix can't restore from OOM errors that happen during backward. According to my experience, this PR can avoid OOM hangs for AR model training, but not for NAR model. I really want to solve this problem, too. But I don't know how. And one more thing, explicit cleanup of anything is no need, most of the memory will be released when leaving the exception handling block. |
@chenjiasheng thank you for your detailed comment. Yes this issue probably needs a bit more in-depth analysis. In my case it turned out that the error was thrown because of some token generation problem with non-latin characters; causing VRAM to be flooded. As pointed out by @lifeiteng here: #111 (comment), the model needs to be improved in regards to supporting symbols from various languages to overcome this. However, the main purpose of this PR, to recover the training process in case of an unexpected OOM error - I still don't 100% understand why these errors were not detected by OOM check on start in the first place - is not 100% solved by this yet. So maybe this PR should be converted to draft state in the meantime. |
As I pointed out, we should also continue other ranks that didn't encounter an OOM error, to avoid inconsistency among ranks. The inconsistency may be caused by following processes like @RuntimeRacer
|
Closing this since the issue was related to input data format / combo itself, not the way errors are being handled. |
Following my Investigations mentioned in #110, I implemented some code which does the following in case a training batch causes an OOM error:
This should improve the training process in various aspects, such as:
Additionally, I discovered there is a Grad Scaling step missing in the Pre-Training OOM check, which explicitly was the code where the OOM Exception happened during training in my case. So I added thos as well