Skip to content

Latest commit

 

History

History
142 lines (97 loc) · 3.95 KB

README.md

File metadata and controls

142 lines (97 loc) · 3.95 KB

CtrlProt

Controllable Protein Sequence Generation with LLM Preference Optimization

Dataset

The dataset is constructed by extracting specific labels from UniProtKB based on Gene Ontology (GO) annotations. The corresponding structures should be downloaded from the AFDB and stored in the designated directory.

dataset/
│── component/ # Cellular component data
│ ├── 0.tsv # Cytoplasm (GO:0005737)
│ ├── 1.tsv # Nucleus (GO:0005634)
│
│── function/ # Molecular function data
│ ├── 0.tsv # Metal ion binding (GO:0046872)
│ ├── 1.tsv # RNA binding (GO:0003723)
│
│── process/ # Biological process data
│ ├── 0.tsv # Phosphorylation (GO:0016310)
│ ├── 1.tsv # Translation (GO:0006412)
│
│── structure/ # Protein structures
| ├── component_0/
| | ├── pdb/
| | | ├── *.pdb # PDB files for corresponding proteins
| |
| ├── component_1/
| | ├── pdb/
| | | ├── *.pdb 
| ...

Prerequisites

Before running the scripts, ensure that you have the necessary dependencies installed. You can install them using:

pip install -r requirements.txt

Other required models and dependencies can be obtained from the following sources:

ProtGPT2, ProteinMPNN, ESMFold, ESM-2, PyRosetta and Foldseek

The evaluation dataset and trained classifiers can be downloaded here

Run

Here, we show how to run CtrlProt to generate protein sequences with desired attributes.

1. Prefix Tuning

Finetune protein language models.

python prefix_tuning_prot.py --batch_size 16 --epochs 50 --dataset_path ./dataset/function/0.tsv --dataset_name function_0 --output_path ./candidate_prefix_tuning_model/

2. Candidate Sequences Generation

This script generates candidate sequences using the chosen prefix-tuning model.

python generate_candidate_sequence.py --model_path ./prefix_tuning_model/

3. Candidate Sequences Evaluation

We first use ESMFold to predict the structure and get the pdb files of generated proteins.

python evaluate_candidate_sequence.py --dataset_path ./generate_candidate_sequence --output_path ./evaluate_candidate_sequence

Then we use Rosetta Relaxation on Generated Sequences for evaluating structural stability.

python generate_rosetta_relax.py

We use the structural embedding from ProteinMPNN to evaluate the Functionality

python structure_similarity.py

Finally, we use the above score to get the quality score

python evaluate_candidate_sequence2.py --dataset_path ./evaluate_candidate_sequence --output_path ./mlpo_candidate_sequence

4. Construct preference optimization dataset and train the model:

These scripts build the dataset.

python build_mlpo_dataset.py --dataset_path ./mlpo_candidate_sequence

We then train the model on the constructed preference optimization dataset.

python train_mlpo.py --batch_size 16 --epochs 50 --lr 5e-5 --dataset_path ./mlpo_candidate_sequence/function_0/mlpo_dataset --dataset_name function_0 --model_path ./prefix_tuning_model/function_0/

5.Generation and evaluation

Then we can generate and test the result.

python single_function_generation.py

eval_classifier provides CLS-score, TM-score, rmsd and pLDDT.

python eval_classifier.py

Citation

@inproceedings{CtrlProt,
  title={Controllable Protein Sequence Generation with LLM Preference Optimization},
  author={Liu, Xiangyu and Liu, Yi and Chen, Silei and Hu, Wei},
  booktitle={AAAI},
  year={2025}
}