title | booktitle | year | volume | series | month | publisher | url | openreview | abstract | layout | issn | id | tex_title | firstpage | lastpage | page | order | cycles | bibtex_editor | editor | bibtex_author | author | date | address | container-title | genre | issued | extras | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
When and How to Grow? On Efficient Pre-training via Model Growth |
Proceedings of the 16th Asian Conference on Machine Learning |
2025 |
260 |
Proceedings of Machine Learning Research |
0 |
PMLR |
DDI359KC7v |
The remarkable performance of GPT models has attracted widespread attention for large-scale language models. Despite their stunning performance, the huge pre-training cost is prohibitive. Progressive pre-training takes advantage of the faster convergence speed of small models to save computing overhead and shows great potential in accelerating pre-training. This work studies the two key issues in progressive pre-training: growth schedule and growth operation. First, we estimate the optimal growth point in theory. Then, we find in experiments that the growth operation can be performed after the model enters the convergence stage to achieve a high speed-up ratio. On the other hand, we propose progressive dimensionality growth for width expansion and redundant layers for depth expansion. Progressive dimensionality growth is a smoothed operation and improves training stability. Redundant layers implement function-preserving at a small cost and inherit the core parameters of adjacent layers, improving the utilization of knowledge learned by the original model. Our method follows strict function preservation and produces good training dynamics. Experimental results show that our method outperforms the baselines and achieves an acceleration rate of about 1.5 times while achieving the same training effect. |
inproceedings |
2640-3498 |
wang25a |
When and How to Grow? On Efficient Pre-training via Model Growth |
95 |
110 |
95-110 |
95 |
false |
Nguyen, Vu and Lin, Hsuan-Tien |
|
Wang, Jikai and Li, Juntao and Zhang, Min and Li, Zechang and Xia, Qingrong and Duan, Xinyu and Wang, Zhefeng and Huai, Baoxing |
|
2025-01-14 |
Proceedings of the 16th Asian Conference on Machine Learning |
inproceedings |
|