You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm working on a project that involves pre-training GPT-2 Medium. Currently using your code (deepspeed + bf16 + flash attention) it took around 15 days to pre-train for the full 400K steps on 4 A100 GPUs. Do you have any suggestions on possible approaches to further speed up pre-training by e.g., 2x?
One possible solution I'm thinking of is to increase the learning rate. Looks like GPT-2 medium uses a learning rate of 1.5e-4. Did you guys experiment with a larger learning rate? Was the model able to converge faster during pre-training without losing too much of the perplexity?
Any suggestion would be very appreciated!
The text was updated successfully, but these errors were encountered:
Hello, I'm working on a project that involves pre-training GPT-2 Medium. Currently using your code (deepspeed + bf16 + flash attention) it took around 15 days to pre-train for the full 400K steps on 4 A100 GPUs. Do you have any suggestions on possible approaches to further speed up pre-training by e.g., 2x?
One possible solution I'm thinking of is to increase the learning rate. Looks like GPT-2 medium uses a learning rate of 1.5e-4. Did you guys experiment with a larger learning rate? Was the model able to converge faster during pre-training without losing too much of the perplexity?
Any suggestion would be very appreciated!
The text was updated successfully, but these errors were encountered: