Speed up pre-training #197

yandachen · 2023-03-12T15:18:27Z

Hello, I'm working on a project that involves pre-training GPT-2 Medium. Currently using your code (deepspeed + bf16 + flash attention) it took around 15 days to pre-train for the full 400K steps on 4 A100 GPUs. Do you have any suggestions on possible approaches to further speed up pre-training by e.g., 2x?

One possible solution I'm thinking of is to increase the learning rate. Looks like GPT-2 medium uses a learning rate of 1.5e-4. Did you guys experiment with a larger learning rate? Was the model able to converge faster during pre-training without losing too much of the perplexity?

Any suggestion would be very appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up pre-training #197

Speed up pre-training #197

yandachen commented Mar 12, 2023

Speed up pre-training #197

Speed up pre-training #197

Comments

yandachen commented Mar 12, 2023