Training Log 2022-06-20 #35
zh-zheng
announced in
Training Logs 训练日志
Replies: 1 comment 1 reply
-
Did you just restart to solve this problem? After a lot of iterative steps, it is strange that OOM opened again. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
CPM-Live Training Log (June, 20)
Time: June, 20 2022 16:00
Recorder: @zh-zheng
Loss
Completed Data
Average Grad Norm
Progress
Comment
After the restart, our model worked fine and was trained steadily for a whole day. We suspect that the CUDA OOM issue yesterday may be related to GPU memory fragmentation in PyTorch.
It's also worth mentioning that our WeChat official account (OpenBMB) posted an article today about the technical principles of the BMTrain toolkit, which is used to train CPM-Live efficiently. Read it if you are interested and any discussions are welcomed!
Beta Was this translation helpful? Give feedback.
All reactions