Understanding the tutorial output (ESM-1b unsupervised self-attention map contact predictions) #72

remomomo · 2021-05-05T11:32:11Z

remomomo
May 5, 2021

Hi,

Congrats on the interesting paper. I'm currently trying to see if I could use the predicted contacts in a downstream analysis. For this I would like to better understand what I am working with.

I followed the steps to produce the contacts maps using the ESM-1b model. Specifically this:

data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)

now results['contacts'] appears to contain the 2D contact predictions for the different inputs. How are these derived? Are these the outputs of the sparse linear regression model? or are these just average self-attention maps? Do these have a correspondence to "contact probabilities"?

Also, I have noticed that the output close to the diagonal is quite small (close to 0). This is counter-intuitive, as I would expect there to be contact between neighboring residues (as could be seen in the ground truth contact map in this notebook https://github.com/facebookresearch/esm/blob/master/examples/esm_structural_dataset.ipynb, under "Visualize Distance + Contact Map").

Short-range contacts seem to be mostly absent from the predictions.

Also, have you experimented at all with sequences longer than 1024? Do you think predicting contacts of sub-structures is valid (splitting up larger proteins into pieces shorter than 1024)?

best,

Answered by tomsercu

May 5, 2021

Hi, thanks for your interest.
The contacts correspond to the output of the logistic regression model, as described in Transformer protein language models are unsupervised structure learners. (Rao et al. 2020).
See also this notebook: https://github.com/facebookresearch/esm#unsupervised-contact-prediction
The paper mentions the logistic regression weights were fit using minimum separation of 6, which explains why the local range is absent from the predictions.
The LM is trained on a max sequence length of 1024, but yes you could split up longer proteins into shorter pieces. There were some previous questions around this if you search through discussions/github issues.
Hope this helps!

View full answer

tomsercu · 2021-05-05T14:04:20Z

tomsercu
May 5, 2021

Hi, thanks for your interest.
The contacts correspond to the output of the logistic regression model, as described in Transformer protein language models are unsupervised structure learners. (Rao et al. 2020).
See also this notebook: https://github.com/facebookresearch/esm#unsupervised-contact-prediction
The paper mentions the logistic regression weights were fit using minimum separation of 6, which explains why the local range is absent from the predictions.
The LM is trained on a max sequence length of 1024, but yes you could split up longer proteins into shorter pieces. There were some previous questions around this if you search through discussions/github issues.
Hope this helps!

1 reply

remomomo May 5, 2021
Author

thanks for the quick reply. I'll see if the contact predictions are still useful in this format then 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the tutorial output (ESM-1b unsupervised self-attention map contact predictions) #72

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Understanding the tutorial output (ESM-1b unsupervised self-attention map contact predictions) #72

remomomo May 5, 2021

Replies: 1 comment · 1 reply

tomsercu May 5, 2021

remomomo May 5, 2021 Author

remomomo
May 5, 2021

Replies: 1 comment 1 reply

tomsercu
May 5, 2021

remomomo May 5, 2021
Author