Further questions pertaining to pre-training esm-1b model #163
Replies: 1 comment 1 reply
-
Hi @vigneshvalliappan, thanks for the detailed questions! Regarding training epochs: Regarding the learning rate scheduler: Regarding weight decay: |
Beta Was this translation helpful? Give feedback.
-
Hello,
Thank you for the amazing work.
Based on the information that there are around 500,000 updates and there 512 sequences associated with each update, it appears to multiply to 500,000*512 = 256,000,000 sequences total. This roughly matches the 250,000,000 sequences that is in the title of the Facebook esm article: https://www.pnas.org/content/118/15/e2016239118
I noticed the following in the supplementary materials:
"The model was optimized using Adam (β1 = 0.9, β2 = 0.999) with learning rate 10−4. We trained with 131,072 tokens per batch (128 gpus x 1024 tokens). The models follow a warm-up period of 16000 updates, during which the learning rate increases linearly. Afterwards, the learning rate follows an inverse square root decay schedule. All models were trained using the fairseq toolkit on 128 NVIDIA V100 GPUs."
From the above, I understand that the starting learning rate of 0.0001. It also appears that there is a target learning rate at the end of 16,000 updates, and that the learning rate rises linearly from 0.0001 to the target. Is my understanding correct, and can I kindly find out approximately what the target learning rate at the end of 16,000 updates is?
From the above, it appears that between the 16,000th update and the ~500,000th update, an inverse square root schedule was used. Is my understanding correct?
I found the following learning rate scheduler on the fairseq GitHub page. Is the following be similar to what you did for the esm-1b model? https://github.com/pytorch/fairseq/blob/main/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py
Can I kindly find out if any weight decay was used? If so, can I kindly find out what your weight decay was? Was Adam used or AdamW used?
Thank you very much for your time and consideration.
Beta Was this translation helpful? Give feedback.
All reactions