Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-canonical amino acid sequence seen #37

Open
sungyounjoo opened this issue May 6, 2024 · 1 comment
Open

Non-canonical amino acid sequence seen #37

sungyounjoo opened this issue May 6, 2024 · 1 comment

Comments

@sungyounjoo
Copy link

sungyounjoo commented May 6, 2024

I used conditional sequence generation via evodiff.ipynb with my own MSA file.
However it came with a amino acid sequence containing un-natural amino acid codes such as "Z" and "B"

And i would like to ask if it is fine and the un-natural amino acid means something or my own MSA is problem.

Thank you.

image

Tasks

No tasks being tracked yet.
@sarahalamdari
Copy link
Collaborator

sarahalamdari commented Aug 8, 2024

Since our model is trained over additional amino acid codes (JOUBZX), it's possible to observe them at inference in your generations.

These are;

U = selenocysteine
O = pyrrolysine
B = D or N
J = I or L
Z = E or Q
X = unknown

It could be due to many reasons, it's not clear that the MSA would be the problem.

To prevent the model from predicting these amino acids at inference you can change line 257 and 259

p = preds[:, random_x, random_y, :]

to:

p = preds[:, random_x, random_y, :20]

this will force the model to only generate seqeunces using the first 20 amino acids in MSA_ALPHABET: ACDEFGHIKLMNPQRSTVWY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants