Fusion of sequence, structure and feature information to improve protein solubility prediction!
-
[2024.11.04] I have updated the code and README file to support your own dataset, please have a try!
-
[2024.08.22] Congratulations! Our paper was accepted as a short paper at IEEE International Conference on Bioinformatics and Biomedicine 2024 (IEEE BIBM 2024)!
PDBSol and ExternalTest pdb files can be found at https://huggingface.co/datasets/tyang816/ProtSolM_PDB.
The labels are stored in CSV files which can be found at data/PDBSol
and data/ExternalTest
.
cd data/PDBSol
wget https://huggingface.co/datasets/tyang816/ProtSolM_PDB/blob/main/ExternalTest_ESMFold_PDB.zip
unzip PDBSol_ESMFold_PDB.zip
cd data/ExternalTest
wget https://huggingface.co/datasets/tyang816/ProtSolM_PDB/blob/main/ExternalTest_ESMFold_PDB.zip
unzip ExternalTest_ESMFold_PDB.zip
Please make sure you have installed Anaconda3 or Miniconda3.
conda env create -f environment.yaml
conda activate protsolm
We use the pre-trained checkpoints from ProtSSN, we recommend using k20_h512 for fine-tuning on downstream tasks.
mkdir model
cd model
wget https://huggingface.co/tyang816/ProtSSN/resolve/main/protssn_k20_h512.pt
python get_feature.py \
--pdb_dir data/PDBSol/esmfold_pdb \
--out_file data/PDBSol_feature.csv
Script example can be found at script/
.
python eval.py \
--supv_dataset data/PDBSol \
--test_file data/PDBSol/test.csv \
--test_result_dir result/protssn_k20_h512/PDBSol \
--feature_file data/PDBSol/PDBSol_feature.csv \
--feature_name "aa_composition" "gravy" "ss_composition" "hygrogen_bonds" "exposed_res_fraction" "pLDDT" \
--use_plddt_penalty \
--batch_token_num 3000
- pdb files directory (e.g.
data/<YourDataset>/pdb
). - a csv file (e.g.
data/<YourDataset>/test.csv
) with the following columns:name
,aa_seq
,label
, if you don't have labels, you can use0
to replace them.
dataset_name=<YourDataset>
python get_feature.py \
--pdb_dir data/$dataset_name/pdb \
--out_file data/$dataset_name/"$dataset_name"_feature.csv
The result will be saved in result/$dataset_name
python eval.py \
--supv_dataset data/$dataset_name \
--test_file data/$dataset_name/test.csv \
--test_result_dir result/$dataset_name \
--feature_file data/$dataset_name/"$dataset_name"_feature.csv \
--feature_name "aa_composition" "gravy" "ss_composition" "hygrogen_bonds" "exposed_res_fraction" "pLDDT" \
--use_plddt_penalty \
--batch_token_num 3000
Script example can be found at script/
.
K=20
H=512
pooling_method=attention1d
model_name=feature_"$pooling_method"_k"$K"_h"$H"
CUDA_VISIBLE_DEVICES=0 python run_ft.py \
--seed 3407 \
--gnn_hidden_dim $H \
--gnn_model_path model/protssn_k"$K"_h"$H".pt \
--pooling_method $pooling_method \
--model_dir result/sol/debug/protssn_k"$K"_h"$H" \
--model_name $model_name.pt \
--num_labels 2 \
--supv_dataset data/PDBSol \
--train_file data/PDBSol/train.csv \
--valid_file data/PDBSol/valid.csv \
--test_file data/PDBSol/test.csv \
--feature_file data/PDBSol/PDBSol_feature.csv \
--feature_name "aa_composition" "gravy" "ss_composition" "hygrogen_bonds" "exposed_res_fraction" "pLDDT" \
--c_alpha_max_neighbors $K \
--learning_rate 5e-4 \
--num_train_epochs 10 \
--batch_token_num 16000 \
--gradient_accumulation_steps 1 \
--patience 3 \
--wandb \
--wandb_entity ty_ang \
--wandb_project protssn-sol_debug \
--wandb_run_name $model_name
Please cite our work if you have used our code or data. We are pleased to see improvements in the subsequent work.
@article{tan2024protsolm,
title={ProtSolM: Protein Solubility Prediction with Multi-modal Features},
author={Tan, Yang and Zheng, Jia and Hong, Liang and Zhou, Bingxin},
journal={arXiv preprint arXiv:2406.19744},
year={2024}
}