Further questions pertaining to pre-training esm-1b model #163

vigneshvalliappan · 2022-02-12T17:50:39Z

vigneshvalliappan
Feb 12, 2022

Hello,

Thank you for the amazing work.

Based on the information that there are around 500,000 updates and there 512 sequences associated with each update, it appears to multiply to 500,000*512 = 256,000,000 sequences total. This roughly matches the 250,000,000 sequences that is in the title of the Facebook esm article: https://www.pnas.org/content/118/15/e2016239118

Is my understanding correct that because there is so much input data (i.e. 250 million sequences), your team didn't need to run multiple epochs of the same data, and hence ran 1 epoch?

I noticed the following in the supplementary materials:
"The model was optimized using Adam (β1 = 0.9, β2 = 0.999) with learning rate 10−4. We trained with 131,072 tokens per batch (128 gpus x 1024 tokens). The models follow a warm-up period of 16000 updates, during which the learning rate increases linearly. Afterwards, the learning rate follows an inverse square root decay schedule. All models were trained using the fairseq toolkit on 128 NVIDIA V100 GPUs."

From the above, I understand that the starting learning rate of 0.0001. It also appears that there is a target learning rate at the end of 16,000 updates, and that the learning rate rises linearly from 0.0001 to the target. Is my understanding correct, and can I kindly find out approximately what the target learning rate at the end of 16,000 updates is?
From the above, it appears that between the 16,000th update and the ~500,000th update, an inverse square root schedule was used. Is my understanding correct?
I found the following learning rate scheduler on the fairseq GitHub page. Is the following be similar to what you did for the esm-1b model? https://github.com/pytorch/fairseq/blob/main/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py
Can I kindly find out if any weight decay was used? If so, can I kindly find out what your weight decay was? Was Adam used or AdamW used?

Thank you very much for your time and consideration.

liujas000 · 2022-02-17T17:15:54Z

liujas000
Feb 17, 2022

Hi @vigneshvalliappan, thanks for the detailed questions!

Regarding training epochs:
See Section B. ESM-1b Hyperparameter Optimization of the supplement section; ESM-1b is trained for 56 epochs. The reason is that the 250M sequences were clustered by 50% sequence similarity, reducing to ~27M sequences in training. See section A.1 of the supplementary.

Regarding the learning rate scheduler:
The target learning rate is 1e-4. The Learning rate indeed rises linearly for 16k updates until it is 1e-4 and then decreases following the inverse square root decay schedule. Yes, the link you posted is the same! Lines 66 and 83 show how the decay occurs.

Regarding weight decay:
We use the following optimizer with weight decay set to 0

1 reply

vigneshvalliappan Feb 19, 2022
Author

Thank you very much for the answers, @liujas000.

I have a couple of follow-up questions I hope you can help with:

If the target learning rate after 16k updates is 1e-4, can I find out what learning rate you start off with, in the very first batch?
Do the 500,000 updates refer to total number of updates in all of the 56 epochs combined? In other words, in each epoch, are there roughly 500,000/56 = 8929 updates? (I obtained the 500,000 update number from the bottom of the following website: https://huggingface.co/facebook/esm-1b)
I understand that the get_batched_indices function in the data.py file acts as the sampler. I also noticed that the sampler appears to to be grouping the sequences by length. This appears to result in mini-batches with varying number of sequences, while keeping the total number of tokens within a batch to be the same. Is the purpose of this, to minimize the padding token from being applied extensively for much shorter sequences?

Thank you for your time and consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further questions pertaining to pre-training esm-1b model #163

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Further questions pertaining to pre-training esm-1b model #163

vigneshvalliappan Feb 12, 2022

Replies: 1 comment · 1 reply

liujas000 Feb 17, 2022

vigneshvalliappan Feb 19, 2022 Author

vigneshvalliappan
Feb 12, 2022

Replies: 1 comment 1 reply

liujas000
Feb 17, 2022

vigneshvalliappan Feb 19, 2022
Author