-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fine-tuning with more protein sequences #53
Comments
Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:
The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process. |
Thanks, I'll have a look at ColabSaprot. My 500,000 protein sequences are part of a corpus that hasn't been seen by any model, but I could use AF2 or similar to generate 3D models for them. We don't have empirical data for the fitness, only the protein sequences, but this corpus of data hopefully will modify the existing models enough so that the answers are not biased by the species that are most represented, e.g. human or mouse. Hopefully that makes sense. |
If you don't have experimental labels for the fitness, you could predict the mutational effect in a zero-shot manner. In this case, you don't have to further tune the model and could directly make predictions for interested mutations. ColabSaprot provides a specific module for doing so (see this part 3.2), or you can run the provided code to make prdiction (see this part). Even the model didn't see those protein sequences during training, I think it is capable of predicting the changed fitness to some degree. Hopy you could try it out and advance your research:) |
I have a quick question—if we want to fine-tune SaProt with our own labeled data, how should we prepare the .mdb file? The .mdb files on the website seem to be password-protected, so we can't access the data structure. Could you provide a non-password-protected version, for instance, for the thermostability dataset? Thanks in advance! |
Hi, you could refer to this issue #16 for some details. |
Thanks for the timely reply. Good day! |
Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks |
Hi, if you have local gpus for training, you could deploy ColabSaprot on your local server without using google cloud. Here is the quick tutotial for your deployment: https://github.com/westlake-repl/SaprotHub/tree/main/local_server |
Thanks for the timely reply. |
Hello, can I post training SaProt with my own protein sequences? Not fine-tuning. |
Hi, I have a corpus of about 500,000 protein sequences and would like to apply them to existing models like ESM2 or this one for predicting the fitness effect of changing an amino-acid for another.
How could I add my sequences to the models referred in this repo to then use the modified model for such task? Thanks.
The text was updated successfully, but these errors were encountered: