Skip to content

CSUBioGroup/DPFunc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DPFunc

Accurately predicting protein function via deep learning with domain-guided structure information

Usage

Here we provide instructions for two use cases: (1) Retraining our model on our or your data. (2) Testing data on trained models.

If you encounter any bugs or issues, feel free to contact us.

🚀 Latest Updates

[June 2025] Data Processing Tutorial Update: We have streamlined our data processing pipeline with a comprehensive step-by-step tutorial. Users can now easily generate all required data using our new Jupyter notebook instead of following the previous complex workflow.


Data Processing Tutorial Update

We have updated our data processing pipeline with a comprehensive tutorial! Instead of following the previous complex and redundant data generation process, you can now easily prepare all required data step-by-step using our new Jupyter notebook tutorial.

New Streamlined Data Processing

📍 Location: ./DataProcess/Process_data.ipynb

This notebook provides a complete walkthrough from raw data to model-ready datasets using the latest benchmark data as examples.

Sample Data Provided

We have included sample benchmark datasets in ./DataProcess/data_dpfunc/ to help you get started quickly.

Citation Requirements: If you use these benchmark datasets in your research, please cite the following publications:

  1. Wang W, Shuai Y, Zeng M, et al. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information[J]. Nature Communications, 2025, 16(1): 70.

  2. DPGOK: A deep learning-based method for protein function prediction by fusing GO knowledge with protein features. (Submitted, under review). GitHub: https://github.com/CSUBioGroup/DPGOK.

Note: PDB structure files are not included due to size constraints. Please download and place the required PDB files in the corresponding folders as indicated in the tutorial.

Getting Started

  1. Follow the Tutorial: Open ./DataProcess/Process_data.ipynb and execute cells step-by-step
  2. Prepare PDB Files: Download required PDB structure files and place them in the specified directories
  3. Run Processing: The notebook will guide you through generating all necessary data files for training and prediction

What You'll Generate

The tutorial will help you create:

  • Protein ID lists and GO annotations
  • PDB graph structures with features
  • InterPro domain annotations
  • Multi-label binarizer files

This new approach significantly simplifies the data preparation process and ensures reproducibility across different environments. After this step, you can train or predict DPFunc!

Train DPFunc

Key Environment

Pytorch: 1.12.0
DGL: 1.1.0

Data Download

You can download our models from ./data/download_link.txt and get our trained model.

Data Construction

You can following our newest data process mentioned above and ingore this part now!

You should prepare your configure file as requirement (see ./configure/[mf/cc/bp].yaml as an example), following items must be clarified:

name: mf   # The ontology you want to choose: mf/bp/cc. Make sure it matches the file name of configuration file (mf.yaml/bp.yaml/cc.yaml).
mlb: ./mlb/mf_go.mlb  # The predicted labels used in DPFunc, which is generated automatically during training.
results: ./results  # The directory to save predicted results of test data.

base:
  interpro_whole: ./data/interpro/{}.pkl  # The interpro files of proteins. Each interpro file corresponds an array with x columns, where each column is an interpro property (IPR...), which can be seen in './data/inter_idx.pkl' 
  residue_feature: # The residue-level esm features of your test proteins. The details can be found in later sections.
  pdb_points: # The coordinate file of proteins, generated by `./DataProcess/generate_points.py`. The details can be found in later sections.

train:
  name: train
  pid_list_file: ./data/mf_train_used_pid_list.pkl
  pid_go_file: ./data/mf_train_go.txt
  pid_pdb_file: ./data/PDB/graph_feature/mf_train_whole_pdb_part{}.pkl
  train_file_count: 7
  interpro_file: ./data/mf_train_interpro.pkl # The path of interpro file including training proteins, which is generated automatically during training.

valid:
  name: valid
  pid_list_file: ./data/mf_test1_used_pid_list.pkl
  pid_go_file: ./data/mf_test1_go.txt
  pid_pdb_file: ./data/PDB/graph_feature/mf_test1_whole_pdb_part0.pkl
  interpro_file: ./data/mf_test1_interpro.pkl # The path of interpro file including validated proteins, which is generated automatically during training.
  
test:
  name: test
  pid_list_file: ./data/mf_test2_used_pid_list.pkl # The test protein list ('.pkl' format).
  pid_go_file: ./data/mf_test2_go.txt # The test proteins GO (for evaluation if provided).
  pid_pdb_file: ./data/PDB/graph_feature/mf_test2_whole_pdb_part0.pkl # The structure graphs of test proteins.
  interpro_file: ./data/mf_test2_interpro.pkl # The path of interpro file including test proteins, which is generated automatically during training.

Notably, to generate pid_pdb_file, you need complete the following steps:

  1. For pid_pdb_file:

    1.1 You should place your PDB files of proteins (5NTC_RAT.pdb, 6PGL_SCHPO.pdb, ...) at ./data/PDB/PDB_folder/.

    1.2 Use generate_points.py to generate the coordinate files of proteins, the result file will be placed at ./data/pdb_points.pkl.

    python ./DataProcess/generate_points.py -i ./data/mf_test2_used_pid_list.pkl -o pdb_points
    

    1.3 Use pre-trained language model (esm or other PLLMs) to generate the residue features. As the number of proteins may be too large, we suggest that users should partition the whole data into several parts and an additional map file map_pid_esm_file (dict format) is also needed to map the part id of each proteins.

    1.4 Based on pdb_points.pkl, map_pid_esm_file.pkl, and pdb_residue_esm_embeddings_part{part_id}.pkl, using process_graph.py to generate the structure graphs for test data. (Note: change the paths in the file)

    python ./DataProcess/process_graph.py -d mf
    

Train our model on our or your own data

If you have prepared the data, you can train our model on your data as follows (Ensure that your configure file is right):

python DPFunc_main.py -d mf -n 0 -e 15 -p temp_model

arguments:
    -d: the ontology (mf/cc/bp)
    -n: gpu number (default: 0)
    -e: training epoch (default: 15)
    -p: the prefix of results (default: temp_model)

Test

If you want to test proteins on trained models, you can easily comment out the training and validation code, as shown in DPFunc_pred.py

Model Download

You can also download our trained model from: https://drive.google.com/file/d/1V0VTFTiB29ilbAIOZn0okBQWPlbOI3wN/view?usp=drive_link

Contact

Please feel free to contact us for any further questions.

References

Wang W, Shuai Y, Zeng M, et al. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information[J]. Nature Communications, 2025, 16(1): 70.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published