Questions relating to pre-training #196

vigneshvalliappan · 2022-04-23T10:04:30Z

vigneshvalliappan
Apr 23, 2022

Hi all,

Thank you for the amazing work!

I have a couple of questions relating to pre-training that I hope you can help with:

From the supplementary materials, I understand that for the esm-1b model, the "raw input sequence is represented as a sequence of 1-hot vectors of dimension 25". My understanding is that the vocabulary size is 33, including special tokens such as mask, pad, eos, and cls.

Can I ask how the model is restricting the 1-hot vectors to just dimension 25? If this is indeed the case, wouldn't there be duplicate one-hot vectors in the representation of a raw input sequence, given that the vocabulary size is 33?

From the supplementary materials, I understand that the UniRef pre-training datasets include UR50/S, 27.1M UniRef50 representative sequences. My understanding is that this is done by clustering the 250 million sequences using 50% sequence similarity.

Is this clustering done to avoid very similar sequences being part of the pre-training input data?

Thank you very much for your time and consideration.

Answered by tomsercu

Apr 25, 2022

Yes in the end everything is vocab size 33, you could see the 33 as an implementation detail ("system" special tokens and mask token); while the actual sequences themselves will only use part of the vocab of size 25 of 20 AA + ambiguous AA
Clustering is commonly done eg for sequence search. In the case of pre-training yes it reduces near-duplicates and has a rebalancing effect.

View full answer

tomsercu · 2022-04-25T14:15:54Z

tomsercu
Apr 25, 2022

Yes in the end everything is vocab size 33, you could see the 33 as an implementation detail ("system" special tokens and mask token); while the actual sequences themselves will only use part of the vocab of size 25 of 20 AA + ambiguous AA
Clustering is commonly done eg for sequence search. In the case of pre-training yes it reduces near-duplicates and has a rebalancing effect.

3 replies

vigneshvalliappan Apr 27, 2022
Author

Thank you very much Dr Tom.

Regarding your answer for question 1, can I clarify if the 1-hot vector dimensions are of size 25 or 33? Below I explain my confusion about this.

My understanding of the pre-training data-processing is as follows:

a) Based on the functions in the FastaBatchedDataset, Alphabet, and BatchConverter classes in data.py, an input fasta file is converted to a tokenized (labels + tensor) batch. These tokens include the integer representations for special tokens such as eos, cls, mask, pad etc.
b) The DataLoader accepts the above input.
c) In model.py, the embedding of the tokenized input sequences appears to be based on the following lines of code:

line 71: self.embed_tokens = nn.Embedding(self.alphabet_size, self.args.embed_dim, padding_idx=self.padding_idx)
line 126: x = self.embed_scale * self.embed_tokens(tokens)

Given that self.alphabet_size is 33, it appears from the above code that the one-hot vectors are of dimension 33. However, in the supplementary materials it is mentioned that the 1-hot vectors are of dimension 25.
Hence, my confusion.

Any clarification would be greatly appreciated.

Thank you very much for your time and consideration.

tomsercu May 2, 2022

Yes everything is size 33.
I clarified above that I meant

while the actual sequences themselves will only use part of the vocab

vigneshvalliappan May 6, 2022
Author

Thank you Dr Tom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions relating to pre-training #196

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Questions relating to pre-training #196

vigneshvalliappan Apr 23, 2022

Replies: 1 comment · 3 replies

tomsercu Apr 25, 2022

vigneshvalliappan Apr 27, 2022 Author

tomsercu May 2, 2022

vigneshvalliappan May 6, 2022 Author

vigneshvalliappan
Apr 23, 2022

Replies: 1 comment 3 replies

tomsercu
Apr 25, 2022

vigneshvalliappan Apr 27, 2022
Author

vigneshvalliappan May 6, 2022
Author