- PDBBind: The PDBBind dataset in MoleculeNet [1] processed from the PDBBind database. The PDBBind database consists of experimentally measured binding affinities for bio-molecular complexes [2], [3]. It provides detailed 3D Cartesian coordinates of both ligands and their target proteins derived from experimental(e.g., X-ray crystallography) measurements. The availability of coordinates of the protein-ligand complexes permits structure-based featurization that is aware of the protein-ligand binding geometry. The authors of [1] use the "refined" and "core" subsets of the database [4], more carefully processed for data artifacts, as additional benchmarking targets.
-
Atomic Convolutional Networks (ACNN) [5]: Constructs nearest neighbor graphs separately for the ligand, protein and complex based on the 3D coordinates of the atoms and predicts the binding free energy.
-
PotentialNet [6]: A 3-stage model that combines graph convolutional neural network (GCNN) with molecular graphs and KNN graphs and fully connected neural network (FCNN). The model consists of three main steps: (1) covalent-only propagation, (2) dual noncovalent and covalent propagation, and (3) ligand-based graph gather.
- Stage 1. Both ligand and protein graphs are constructed and featurized based on covalent information using
dgllife.utils.CanonicalAtomFeaturizer
anddgllife.utils.CanonicalBondFeaturizer
so that only chemically bonded atoms are connected in the graph. The feature propagation is passed onto a multi-step gated recurrent unit (GRU) followed by a linear layer with sigmoid activation. - Stage 2. A new pair of knn-graphs of ligand and protein is constructed from 3-D coordinates of the molecules. The edge type between atoms is a combination of covalent bond types and their physical distances. The output from stage 1 is used as initial features. The feature propagation is again passed onto a multi-step gated recurrent unit (GRU) followed by a linear layer with sigmoid activation.
- Stage 3. Feature gathering is performed only on ligand atoms in graphs from stage 2. The final prediction is computed from feature propagation through a multi-layer fully connected neural network with ReLU activation.
- Stage 1. Both ligand and protein graphs are constructed and featurized based on covalent information using
Use main.py
with arguments
-m
, Model to use. Available:{ACNN, PotentialNet}
.-v
, Version of PDBBind dataset. Available:{v2007, v2015}
.-d
, PDBBind subset (core or refined), and splitting method. Currently implemented:{PDBBind_core_pocket_random, PDBBind_core_pocket_scaffold, PDBBind_core_pocket_stratified, PDBBind_core_pocket_temporal, PDBBind_refined_pocket_random, PDBBind_refined_pocket_scaffold, PDBBind_refined_pocket_stratified, PDBBind_refined_pocket_temporal, PDBBind_refined_pocket_structure, PDBBind_refined_pocket_sequence}
- Note that
structure
refers to "Agglomerative Structure Split" provided by [6], and is only implemented for PDBBind v2007 Refined set;sequence
refers to "Agglomerative Sequence Split" provided by [6], and is only implemented for PDBBind v2007 Refined set.
- Note that
--pdb_path
, local path of existing PDBBind dataset. Specify this argument to a local path of customized dataset, which should follow the structure and the naming format of PDBBind v2015.--test_on_core
, bool, whether to use the whole core set as test set when training on refined set, default True.--save_r2
, path to save r2 at each epoch, default not save.--num_workers
, number of workers for loading PDBBind molecules and Dataloader, default to the number of CPUs.-t
, int, number of trials to run, default to 1.
f_in
: int. The dimension size of input features to GatedGraphConv, equivalent to the dimension size of atomic features in the molecular graph.f_bond
: int. The dimension size of the output from GatedGraphConv in stage 1, equivalent to the dimension size of input to the linear layer at the end of stage 1.f_spatial
: int. The dimension size of the output from GatedGraphConv in stage 2, equivalent to the dimension size of input to the linear layer at the end of stage 2.f_gather
: int. The dimension size of the output from stage 1 & 2, equivalent to the dimension size of output from the linear layer at the end of stage 1 & 2.n_etypes
: int. The number of heterogeneous edge types for stage 2. Currently implemented as 5(the number of covalent bond types in stage 1) + the number of distance bins in stage 2.n_bond_conv_steps
: int. The number of bond convolution layers(steps) of GatedGraphConv in stage 1.n_spatial_conv_steps
: int. The number of spatial convolution layers(steps) of GatedGraphConv in stage 2.n_rows_fc
: list of int. The widths of the fully connected neural networks at each layer in stage 3.dropouts
: list of 3 floats. The amount of dropout applied at the end of each stage.
Subset | Splitting Method | Test MAE | Test R2 |
---|---|---|---|
Core | Random | 1.7688 | 0.1511 |
Core | Scaffold | 2.5420 | 0.1471 |
Core | Stratified | 1.7419 | 0.1520 |
Core | Temporal | 1.9543 | 0.1640 |
Refined | Random | 1.1948 | 0.4373 |
Refined | Scaffold | 1.4021 | 0.2086 |
Refined | Stratified | 1.6376 | 0.3050 |
Refined | Temporal | 1.2457 | 0.3438 |
Training Set | Test Set | Splitting Method | PDBBind version | Train R2 (std) | Validation R2 (std) | Test R2 (std) |
---|---|---|---|---|---|---|
Refined | Core | Random | v2007 | 0.7250 (0.0610) | 0.4239 (0.0290) | 0.7952 (0.0479) |
Refined | Core | Random | v2015 | 0.5545 (0.0462) | 0.4294 (0.0143) | 0.5862 (0.0420) |
Refined (core removed) | Core | Random | v2007 | 0.7382 (0.0757) | 0.5171 (0.0302) | 0.3825 (0.0392) |
Refined (core removed) | Core | Random | v2015 | 0.5480 (0.0441) | 0.4727 (0.0199) | 0.4508 (0.0308) |
The results are computed over 100 trials and R2 from the best validation epoch is reported.
Training Set | Test Set | PDBBind version | Test R2 (std) |
---|---|---|---|
Refined (core removed) | Core | v2007 | 0.668 (0.043) |
Comparing to the DeepChem's implementation, we achieve a speedup by roughly 3.3 for training time per epoch (from 1.40s to 0.42s). If we do not care about randomness introduced by some kernel optimization, we can achieve a speedup by roughly 4.4 (from 1.40s to 0.32s).
[1] Wu et al. (2017) MoleculeNet: a benchmark for molecular machine learning. Chemical Science 9, 513-530.
[2] Wang et al. (2004) The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J Med Chem 3;47(12):2977-80.
[3] Wang et al. (2005) The PDBbind database: methodologies and updates. J Med Chem 16;48(12):4111-9.
[4] Liu et al. (2015) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 1;31(3):405-12.
[5] Gomes et al. (2017) Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. arXiv preprint arXiv:1703.10603.
[6] Feinberg et al. (2018) PotentialNet for molecular property prediction. ACS central science 4.11: 1520-1530.