-
Hi, Congrats on the interesting paper. I'm currently trying to see if I could use the predicted contacts in a downstream analysis. For this I would like to better understand what I am working with. I followed the steps to produce the contacts maps using the ESM-1b model. Specifically this:
now Also, I have noticed that the output close to the diagonal is quite small (close to 0). This is counter-intuitive, as I would expect there to be contact between neighboring residues (as could be seen in the ground truth contact map in this notebook https://github.com/facebookresearch/esm/blob/master/examples/esm_structural_dataset.ipynb, under "Visualize Distance + Contact Map"). Short-range contacts seem to be mostly absent from the predictions. Also, have you experimented at all with sequences longer than 1024? Do you think predicting contacts of sub-structures is valid (splitting up larger proteins into pieces shorter than 1024)? best, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, thanks for your interest. |
Beta Was this translation helpful? Give feedback.
Hi, thanks for your interest.
The contacts correspond to the output of the logistic regression model, as described in Transformer protein language models are unsupervised structure learners. (Rao et al. 2020).
See also this notebook: https://github.com/facebookresearch/esm#unsupervised-contact-prediction
The paper mentions the logistic regression weights were fit using minimum separation of 6, which explains why the local range is absent from the predictions.
The LM is trained on a max sequence length of 1024, but yes you could split up longer proteins into shorter pieces. There were some previous questions around this if you search through discussions/github issues.
Hope this helps!