Controllable Protein Sequence Generation with LLM Preference Optimization
The dataset is constructed by extracting specific labels from UniProtKB based on Gene Ontology (GO) annotations. The corresponding structures should be downloaded from the AFDB and stored in the designated directory.
dataset/
│── component/ # Cellular component data
│ ├── 0.tsv # Cytoplasm (GO:0005737)
│ ├── 1.tsv # Nucleus (GO:0005634)
│
│── function/ # Molecular function data
│ ├── 0.tsv # Metal ion binding (GO:0046872)
│ ├── 1.tsv # RNA binding (GO:0003723)
│
│── process/ # Biological process data
│ ├── 0.tsv # Phosphorylation (GO:0016310)
│ ├── 1.tsv # Translation (GO:0006412)
│
│── structure/ # Protein structures
| ├── component_0/
| | ├── pdb/
| | | ├── *.pdb # PDB files for corresponding proteins
| |
| ├── component_1/
| | ├── pdb/
| | | ├── *.pdb
| ...
Before running the scripts, ensure that you have the necessary dependencies installed. You can install them using:
pip install -r requirements.txt
Other required models and dependencies can be obtained from the following sources:
ProtGPT2, ProteinMPNN, ESMFold, ESM-2, PyRosetta and Foldseek
The evaluation dataset and trained classifiers can be downloaded here
Here, we show how to run CtrlProt to generate protein sequences with desired attributes.
Finetune protein language models.
python prefix_tuning_prot.py --batch_size 16 --epochs 50 --dataset_path ./dataset/function/0.tsv --dataset_name function_0 --output_path ./candidate_prefix_tuning_model/
This script generates candidate sequences using the chosen prefix-tuning model.
python generate_candidate_sequence.py --model_path ./prefix_tuning_model/
We first use ESMFold to predict the structure and get the pdb files of generated proteins.
python evaluate_candidate_sequence.py --dataset_path ./generate_candidate_sequence --output_path ./evaluate_candidate_sequence
Then we use Rosetta Relaxation on Generated Sequences for evaluating structural stability.
python generate_rosetta_relax.py
We use the structural embedding from ProteinMPNN to evaluate the Functionality
python structure_similarity.py
Finally, we use the above score to get the quality score
python evaluate_candidate_sequence2.py --dataset_path ./evaluate_candidate_sequence --output_path ./mlpo_candidate_sequence
These scripts build the dataset.
python build_mlpo_dataset.py --dataset_path ./mlpo_candidate_sequence
We then train the model on the constructed preference optimization dataset.
python train_mlpo.py --batch_size 16 --epochs 50 --lr 5e-5 --dataset_path ./mlpo_candidate_sequence/function_0/mlpo_dataset --dataset_name function_0 --model_path ./prefix_tuning_model/function_0/
Then we can generate and test the result.
python single_function_generation.py
eval_classifier provides CLS-score, TM-score, rmsd and pLDDT.
python eval_classifier.py
@inproceedings{CtrlProt,
title={Controllable Protein Sequence Generation with LLM Preference Optimization},
author={Liu, Xiangyu and Liu, Yi and Chen, Silei and Hu, Wei},
booktitle={AAAI},
year={2025}
}