DPFunc

Accurately predicting protein function via deep learning with domain-guided structure information

Usage

Here we provide instructions for two use cases: (1) Retraining our model on our or your data. (2) Testing data on trained models.

If you encounter any bugs or issues, feel free to contact us.

🚀 Latest Updates

[June 2025] Data Processing Tutorial Update: We have streamlined our data processing pipeline with a comprehensive step-by-step tutorial. Users can now easily generate all required data using our new Jupyter notebook instead of following the previous complex workflow.

Data Processing Tutorial Update

We have updated our data processing pipeline with a comprehensive tutorial! Instead of following the previous complex and redundant data generation process, you can now easily prepare all required data step-by-step using our new Jupyter notebook tutorial.

New Streamlined Data Processing

📍 Location: ./DataProcess/Process_data.ipynb

This notebook provides a complete walkthrough from raw data to model-ready datasets using the latest benchmark data as examples.

Sample Data Provided

We have included sample benchmark datasets in ./DataProcess/data_dpfunc/ to help you get started quickly.

Citation Requirements: If you use these benchmark datasets in your research, please cite the following publications:

Wang W, Shuai Y, Zeng M, et al. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information[J]. Nature Communications, 2025, 16(1): 70.
DPGOK: A deep learning-based method for protein function prediction by fusing GO knowledge with protein features. (Submitted, under review). GitHub: https://github.com/CSUBioGroup/DPGOK.

Note: PDB structure files are not included due to size constraints. Please download and place the required PDB files in the corresponding folders as indicated in the tutorial.

Getting Started

Follow the Tutorial: Open ./DataProcess/Process_data.ipynb and execute cells step-by-step
Prepare PDB Files: Download required PDB structure files and place them in the specified directories
Run Processing: The notebook will guide you through generating all necessary data files for training and prediction

What You'll Generate

The tutorial will help you create:

Protein ID lists and GO annotations
PDB graph structures with features
InterPro domain annotations
Multi-label binarizer files

This new approach significantly simplifies the data preparation process and ensures reproducibility across different environments. After this step, you can train or predict DPFunc!

Train DPFunc

Key Environment

Pytorch: 1.12.0
DGL: 1.1.0

Data Download

You can download our models from ./data/download_link.txt and get our trained model.

Data Construction

You can following our newest data process mentioned above and ingore this part now!

You should prepare your configure file as requirement (see ./configure/[mf/cc/bp].yaml as an example), following items must be clarified:

name: mf   # The ontology you want to choose: mf/bp/cc. Make sure it matches the file name of configuration file (mf.yaml/bp.yaml/cc.yaml).
mlb: ./mlb/mf_go.mlb  # The predicted labels used in DPFunc, which is generated automatically during training.
results: ./results  # The directory to save predicted results of test data.

base:
  interpro_whole: ./data/interpro/{}.pkl  # The interpro files of proteins. Each interpro file corresponds an array with x columns, where each column is an interpro property (IPR...), which can be seen in './data/inter_idx.pkl' 
  residue_feature: # The residue-level esm features of your test proteins. The details can be found in later sections.
  pdb_points: # The coordinate file of proteins, generated by `./DataProcess/generate_points.py`. The details can be found in later sections.

train:
  name: train
  pid_list_file: ./data/mf_train_used_pid_list.pkl
  pid_go_file: ./data/mf_train_go.txt
  pid_pdb_file: ./data/PDB/graph_feature/mf_train_whole_pdb_part{}.pkl
  train_file_count: 7
  interpro_file: ./data/mf_train_interpro.pkl # The path of interpro file including training proteins, which is generated automatically during training.

valid:
  name: valid
  pid_list_file: ./data/mf_test1_used_pid_list.pkl
  pid_go_file: ./data/mf_test1_go.txt
  pid_pdb_file: ./data/PDB/graph_feature/mf_test1_whole_pdb_part0.pkl
  interpro_file: ./data/mf_test1_interpro.pkl # The path of interpro file including validated proteins, which is generated automatically during training.
  
test:
  name: test
  pid_list_file: ./data/mf_test2_used_pid_list.pkl # The test protein list ('.pkl' format).
  pid_go_file: ./data/mf_test2_go.txt # The test proteins GO (for evaluation if provided).
  pid_pdb_file: ./data/PDB/graph_feature/mf_test2_whole_pdb_part0.pkl # The structure graphs of test proteins.
  interpro_file: ./data/mf_test2_interpro.pkl # The path of interpro file including test proteins, which is generated automatically during training.

Notably, to generate pid_pdb_file, you need complete the following steps:

For pid_pdb_file:

1.1 You should place your PDB files of proteins (5NTC_RAT.pdb, 6PGL_SCHPO.pdb, ...) at ./data/PDB/PDB_folder/.

1.2 Use generate_points.py to generate the coordinate files of proteins, the result file will be placed at ./data/pdb_points.pkl.
```
python ./DataProcess/generate_points.py -i ./data/mf_test2_used_pid_list.pkl -o pdb_points
```
1.3 Use pre-trained language model (esm or other PLLMs) to generate the residue features. As the number of proteins may be too large, we suggest that users should partition the whole data into several parts and an additional map file map_pid_esm_file (dict format) is also needed to map the part id of each proteins.

1.4 Based on pdb_points.pkl, map_pid_esm_file.pkl, and pdb_residue_esm_embeddings_part{part_id}.pkl, using process_graph.py to generate the structure graphs for test data. (Note: change the paths in the file)
```
python ./DataProcess/process_graph.py -d mf
```

Train our model on our or your own data

If you have prepared the data, you can train our model on your data as follows (Ensure that your configure file is right):

python DPFunc_main.py -d mf -n 0 -e 15 -p temp_model

arguments:
    -d: the ontology (mf/cc/bp)
    -n: gpu number (default: 0)
    -e: training epoch (default: 15)
    -p: the prefix of results (default: temp_model)

Test

If you want to test proteins on trained models, you can easily comment out the training and validation code, as shown in DPFunc_pred.py

Model Download

You can also download our trained model from: https://drive.google.com/file/d/1V0VTFTiB29ilbAIOZn0okBQWPlbOI3wN/view?usp=drive_link

Contact

Please feel free to contact us for any further questions.

Wenkang Wang [email protected]
Min Li [email protected]

References

Wang W, Shuai Y, Zeng M, et al. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information[J]. Nature Communications, 2025, 16(1): 70.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DPFunc

Usage

🚀 Latest Updates

Data Processing Tutorial Update

New Streamlined Data Processing

Sample Data Provided

Getting Started

What You'll Generate

Train DPFunc

Key Environment

Data Download

Data Construction

Train our model on our or your own data

Test

Model Download

Contact

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
DPFunc		DPFunc
DataProcess		DataProcess
configure		configure
data		data
mlb		mlb
.DS_Store		.DS_Store
.gitattributes		.gitattributes
DPFunc_main.py		DPFunc_main.py
DPFunc_pred.py		DPFunc_pred.py
LICENSE		LICENSE
README.md		README.md

License

CSUBioGroup/DPFunc

Folders and files

Latest commit

History

Repository files navigation

DPFunc

Usage

🚀 Latest Updates

Data Processing Tutorial Update

New Streamlined Data Processing

Sample Data Provided

Getting Started

What You'll Generate

Train DPFunc

Key Environment

Data Download

Data Construction

Train our model on our or your own data

Test

Model Download

Contact

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages