Merge pull request #538 from DeepRank/minor_review_joss_gcroci2

docs: implement minor suggestions for JOSS paper
DeepRank · Dec 21, 2023 · f6a93ea · f6a93ea
2 parents 5cde0bb + b5d6307
commit f6a93ea
Show file tree

Hide file tree

Showing 5 changed files with 63 additions and 9 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -1,7 +1,7 @@
 {
     "[python]": {
         "editor.codeActionsOnSave": {
-            "source.organizeImports": true
+            "source.organizeImports": "explicit"
         },
         "files.trimTrailingWhitespace": true,
     },

diff --git a/docs/features.md b/docs/features.md
@@ -5,7 +5,9 @@ Features implemented in the code-base are defined in `deeprank2.feature` subpack
 
 ## Custom features
 
-Users can add custom features by creating a new module and placing it in `deeprank2.feature` subpackage. One requirement for any feature module is to implement an `add_features` function, as shown below. This will be used in `deeprank2.models.query` to add the features to the nodes or edges of the graph.
+Users can add custom features by cloning the repository, creating a new module and placing it in `deeprank2.feature` subpackage. The custom features can then be used by installing the package in editable mode (see [here](https://deeprank2.readthedocs.io/en/latest/installation.html#install-deeprank2) for more details). We strongly recommend submitting a pull request (PR) to merge the new feature into the official repository.
+
+One requirement for any feature module is to implement an `add_features` function, as shown below. This will be used in `deeprank2.models.query` to add the features to the nodes or edges of the graph.
 
 ```python
 from typing import Optional
@@ -21,6 +23,59 @@ def add_features(
     pass
 ```
 
+Additionally, the nomenclature of the custom feature should be added in `deeprank2.domain.edgestorage` or `deeprank2.domain.nodestorage`, depending on which type of feature it is. 
+
+As an example, this is the implementation of the node feature `res_type`, which represents the one-hot encoding of the amino acid residue and is defined in `deeprank2.features.components` module:
+
+```python
+from deeprank2.domain import nodestorage as Nfeat
+from deeprank2.molstruct.atom import Atom
+from deeprank2.molstruct.residue import Residue, SingleResidueVariant
+from deeprank2.utils.graph import Graph
+
+def add_features(
+    pdb_path: str, graph: Graph,
+    single_amino_acid_variant: Optional[SingleResidueVariant] = None
+    ):
+
+    for node in graph.nodes:
+        if isinstance(node.id, Residue):
+            residue = node.id
+        elif isinstance(node.id, Atom):
+            atom = node.id
+            residue = atom.residue
+        else:
+            raise TypeError(f"Unexpected node type: {type(node.id)}")
+
+        node.features[Nfeat.RESTYPE] = residue.amino_acid.onehot
+```
+
+`RESTYPE` is the name of the variable assigned to the feature `res_type` in `deeprank2.domain.nodestorage`. In order to use the feature from DeepRank2 API, its module needs to be imported and specified during the queries processing:
+
+```python
+from deeprank2.features import components
+
+feature_modules = [components]
+
+# Save data into 3D-graphs only
+hdf5_paths = queries.process(
+    "<output_folder>/<prefix_for_outputs>",
+    feature_modules = feature_modules)
+```
+
+Then, the feature `res_type` can be used from the DeepRank2 datasets API:
+
+```python
+from deeprank2.dataset import GraphDataset
+
+node_features = ["res_type"]
+
+dataset = GraphDataset(
+    hdf5_path = hdf5_paths,
+    node_features = node_features
+)
+```
+
 The following is a brief description of the features already implemented in the code-base, for each features' module. 
 
 ## Default node features 

diff --git a/docs/index.rst b/docs/index.rst
@@ -60,7 +60,7 @@ Notes
 
 Package reference
 ===========
-   
+
 .. toctree::
    :caption: API
    :hidden:
@@ -70,7 +70,6 @@ Package reference
 :doc:`reference/deeprank2`
     This section documents the DeepRank2 API.
 
-
 Indices and tables
 ==================
 

diff --git a/docs/reference/deeprankcore.rst → docs/reference/deeprank2.rst b/docs/reference/deeprankcore.rst → docs/reference/deeprank2.rst
diff --git a/paper/paper.md b/paper/paper.md
@@ -58,7 +58,7 @@ bibliography: paper.bib
 # Summary
 [comment]: <> (CHECK FOR AUTHORS: Do the summary describe the high-level functionality and purpose of the software for a diverse, non-specialist audience?)
 
-We present DeepRank2, a deep learning (DL) framework geared towards making predictions on 3D protein structures for variety of biologically relevant applications. Our software can be used for predicting structural properties in drug design, immunotherapy, or designing novel proteins, among other fields. DeepRank2 allows for transformation and storage of 3D representations of both protein-protein interfaces (PPIs) and protein single-residue variants (SRVs) into either graphs or volumetric grids containing structural and physico-chemical information. These can be used for training neural networks for a variety of patterns of interest, using either our pre-implemented training pipeline for graph neural networks (GNNs) or convolutional neural networks (CNNs) or external pipelines. The entire framework flowchart is visualized in \autoref{fig:flowchart}. The package is fully open-source, follows the community-endorsed FAIR principles for research software, provides user-friendly APIs, publicily available [documentation](https://deeprank2.readthedocs.io/en/latest/), and in-depth [tutorials](https://github.com/DeepRank/deeprank2/blob/main/tutorials/TUTORIAL.md).
+We present DeepRank2, a deep learning (DL) framework geared towards making predictions on 3D protein structures for variety of biologically relevant applications. Our software can be used for predicting structural properties in drug design, immunotherapy, or designing novel proteins, among other fields. DeepRank2 allows for transformation and storage of 3D representations of both protein-protein interfaces (PPIs) and protein single-residue variants (SRVs) into either graphs or volumetric grids containing structural and physico-chemical information. These can be used for training neural networks for a variety of patterns of interest, using either our pre-implemented training pipeline for graph neural networks (GNNs) or convolutional neural networks (CNNs) or external pipelines. The entire framework flowchart is visualized in \autoref{fig:flowchart}. The package is fully open-source, follows the community-endorsed FAIR principles for research software, provides user-friendly APIs, publicly available [documentation](https://deeprank2.readthedocs.io/en/latest/), and in-depth [tutorials](https://github.com/DeepRank/deeprank2/blob/main/tutorials/TUTORIAL.md).
 
 [comment]: <> (CHECK FOR AUTHORS: Do the authors clearly state what problems the software is designed to solve and who the target audience is?)
 [comment]: <> (CHECK FOR AUTHORS: Do the authors describe how this software compares to other commonly-used packages?)
@@ -74,12 +74,12 @@ The 3D structure of proteins and protein complexes provides fundamental informat
 In the past decades, a variety of experimental methods (e.g., X-ray crystallography, nuclear magnetic resonance, cryogenic electron microscopy) have determined and accumulated a large number of atomic-resolution 3D structures of proteins and protein-protein complexes [@schwede_protein_2013]. Since experimental determination of structures is a tedious and expensive process, several computational prediction methods have been developed over the past decades, exploiting classical molecular modelling [@rosetta; @modeller; @haddock], and, more recently, DL [@alphafold_2021; @alphafold_multi]. The large amount of data available makes it possible to use DL to leverage 3D structures and learn their complex patterns. Unlike other machine learning (ML) techniques, deep neural networks hold the promise of learning from millions of data points without reaching a performance plateau quickly, which is made computationally feasible by hardware accelerators (i.e., GPUs, TPUs) and parallel file system technologies.
 
 [comment]: <> (Examples of DL with PPIs and SRVs)
-The main types of data structures in vogue for representing 3D structures are 3D grids, graphs and surfaces. 3D CNNs have been trained on 3D grids for the classification of biological vs. crystallographic PPIs [@renaud_deeprank_2021], and for the scoring of models of protein-protein complexes generated by computational docking [@renaud_deeprank_2021; @dove]. Gaiza et al. have applied geodesic CNNs to extract protein interaction fingerprints by applying 2D CNNs on spread-out protein surface patches [@gainza2023novo]. 3D CNNs have been used for exploiting protein structure data for predicting mutation-induced changes in protein stability [@mut_cnn; @ramakrishnan2023] and identifying novel gain-of-function mutations [@shroff]. Contrary to CNNs, in GNNs the convolution operations on graphs can rely on the relative local connectivity between nodes and not on the data orientation, making graphs rotationally invariant. Additionally, GNNs can accept any size of graph, while in a CNN the size of the 3D grid for all input data needs to be the same, which may be problematic for datasets containing highly variable in size structures. Based on these arguments, different GNN-based tools have been designed to predict patterns from PPIs [@dove_gnn; @fout_protein_nodate; @reau_deeprank-gnn_2022]. Eisman et al. developed a rotation-equivariant neural network trained on point-based representation of the protein atomic structure to classify PPIs [@rot_eq_gnn].
+The main types of data structures in vogue for representing 3D structures are 3D grids, graphs, and surfaces. 3D CNNs have been trained on 3D grids for the classification of biological vs. crystallographic PPIs [@renaud_deeprank_2021], and for the scoring of models of protein-protein complexes generated by computational docking [@renaud_deeprank_2021; @dove]. Gaiza et al. have applied geodesic CNNs to extract protein interaction fingerprints by applying 2D CNNs on spread-out protein surface patches [@gainza2023novo]. 3D CNNs have been used for exploiting protein structure data for predicting mutation-induced changes in protein stability [@mut_cnn; @ramakrishnan2023] and identifying novel gain-of-function mutations [@shroff]. Contrary to CNNs, in GNNs the convolution operations on graphs can rely on the relative local connectivity between nodes and not on the data orientation, making graphs rotationally invariant. Additionally, GNNs can accept any size of graph, while in a CNN the size of the 3D grid for all input data needs to be the same, which may be problematic for datasets containing highly variable in size structures. Based on these arguments, different GNN-based tools have been designed to predict patterns from PPIs [@dove_gnn; @fout_protein_nodate; @reau_deeprank-gnn_2022]. Eisman et al. developed a rotation-equivariant neural network trained on point-based representation of the protein atomic structure to classify PPIs [@rot_eq_gnn].
 
 # Statement of need
 
 [comment]: <> (Motivation for a flexible framework)
-Data mining 3D structures of proteins presents several challenges. These include complex physico-chemical rules governing structural features, the possibility of characterizartion at different scales (e.g., atom-level, residue level, and secondary structure level), and the large diversity in shape and size. Furthermore, because a structure can easily comprise of hundreds to thousands of residues (and ~15 times as many atoms), efficient processing and featurization of many structures is critical to handle the computational cost and file storage requirements. Existing software solutions are often highly specialized and not developed as reusable and flexible frameworks, and cannot be easily adapted to diverse applications and predictive tasks. Examples include DeepAtom [@deepatom] for protein-ligand binding affinity prediction only, and MaSIF [@gainza2023novo] for deciphering patterns in protein surfaces. While some frameworks, such as TorchProtein and TorchDrug [@torchdrug], configure themselves as general-purpose ML libraries for both molecular sequences and 3D structures, they only implement geometric-related features and do not incorporate fundamental physico-chemical information in the 3D representation of molecules.
+Data mining 3D structures of proteins presents several challenges. These include complex physico-chemical rules governing structural features, the possibility of characterization at different scales (e.g., atom-level, residue level, and secondary structure level), and the large diversity in shape and size. Furthermore, because a structure can easily comprise of hundreds to thousands of residues (and ~15 times as many atoms), efficient processing and featurization of many structures is critical to handle the computational cost and file storage requirements. Existing software solutions are often highly specialized and not developed as reusable and flexible frameworks, and cannot be easily adapted to diverse applications and predictive tasks. Examples include DeepAtom [@deepatom] for protein-ligand binding affinity prediction only, and MaSIF [@gainza2023novo] for deciphering patterns in protein surfaces. While some frameworks, such as TorchProtein and TorchDrug [@torchdrug], configure themselves as general-purpose ML libraries for both molecular sequences and 3D structures, they only implement geometric-related features and do not incorporate fundamental physico-chemical information in the 3D representation of molecules.
 
 These limitations create a growing demand for a generic and flexible DL framework that researchers can readily utilize for their specific research questions while cutting down the tedious data preprocessing stages. Generic DL frameworks have already emerged in diverse scientific fields, such as computational chemistry (e.g., DeepChem [@deepchem]) and condensed matter physics (e.g., NetKet [@netket]), which have promoted collaborative efforts, facilitated novel insights, and benefited from continuous improvements and maintenance by engaged user communities.
 
@@ -93,14 +93,14 @@ As input, DeepRank2 takes [PDB-formatted](https://www.cgl.ucsf.edu/chimera/docs/
 
 The physico-chemical and geometrical features are then computed and assigned to each node and edge. The user can choose which features to generate from several pre-existing options defined in the package, or define custom features modules, as explained in the documentation. Examples of pre-defined node features are the type of the amino acid, its size and polarity, as well as more complex features such as its buried surface area and secondary structure features. Examples of pre-defined edge features are distance, covalency, and potential energy. A detailed list of predefined features can be found in the [documentation's features page](https://deeprank2.readthedocs.io/en/latest/features.html). Graphs can either be used directly or mapped to volumetric grids (i.e., 3D image-like representations), together with their features. Multiple CPUs can be used to parallelize and speed up the featurization process. The processed data are saved into HDF5 files, designed to efficiently store and organize big data. Users can then use the data for any ML or DL framework suited for the application. Specifically, graphs can be used for the training of GNNs, and 3D grids can be used for the training of CNNs.
 
-DeepRank2 also provides convenient pre-implemented modules for training simple [PyTorch](https://pytorch.org/)-based GNNs and CNNs using the data generated in the previous step. Alternatively, users can implement custom PyTorch networks in the DeepRank package (or export the data to external software). Data can be loaded across multiple CPUs, and the training can be run on GPUs. The data stored within the HDF5 files are read into customized datasets, and the user-friendly API allows for selection of individual features (from those generated above), definition of the targets, and the predictive task (classfication or regression), among other settings. Then the datasets can be used for training, validating, and testing the chosen neural network. The final model and results can be saved using built-in data exporter modules.
+DeepRank2 also provides convenient pre-implemented modules for training simple [PyTorch](https://pytorch.org/)-based GNNs and CNNs using the data generated in the previous step. Alternatively, users can implement custom PyTorch networks in the DeepRank package (or export the data to external software). Data can be loaded across multiple CPUs, and the training can be run on GPUs. The data stored within the HDF5 files are read into customized datasets, and the user-friendly API allows for selection of individual features (from those generated above), definition of the targets, and the predictive task (classification or regression), among other settings. Then the datasets can be used for training, validating, and testing the chosen neural network. The final model and results can be saved using built-in data exporter modules.
 
 DeepRank2 embraces the best practices of open-source development by utilizing platforms like GitHub and Git, unit testing (as of August 2023 coverage is 83%), continuous integration, automatic documentation, and Findable, Accessible, Interoperable, and Reusable (FAIR) principles. Detailed [documentation](https://deeprank2.readthedocs.io/en/latest/?badge=latest) and [tutorials](https://github.com/DeepRank/deeprank2/blob/main/tutorials/TUTORIAL.md) for getting started with the package are publicly available. The project aims to create high-quality software that can be easily accessed, used, and contributed to by a wide range of researchers.
 
 We believe this project will have a positive impact across the all of structural bioinformatics, enabling advancements that rely on molecular complex analysis, such as structural biology, protein engineering, and rational drug design. The target community includes researchers working with molecular complexes data, such as computational biologists, immunologists, and structural bioinformaticians. The existing features, as well as the sustainable package formatting and its modular design make DeepRank2 an excellent framework to build upon. Taken together, DeepRank2 provides all the requirements to become the all-purpose DL tool that is currently lacking in the field of biomolecular interactions.
 
 # Acknowledgements
 
-This work was supported by the [Netherlands eScience Center](https://www.esciencecenter.nl/) under grant number NLESC.OEC.2021.008, and [SURF](https://www.surf.nl/en) infrastructure, and was developed in collaboration with the [Department of Medical BioSciences](https://www.radboudumc.nl/en/research/departments/medical-biosciences) at RadboudUMC (Hypatia Fellowship, Rv819.52706). This work was also supported from NVIDIA Acamedic Award.
+This work was supported by the [Netherlands eScience Center](https://www.esciencecenter.nl/) under grant number NLESC.OEC.2021.008, and [SURF](https://www.surf.nl/en) infrastructure, and was developed in collaboration with the [Department of Medical BioSciences](https://www.radboudumc.nl/en/research/departments/medical-biosciences) at RadboudUMC (Hypatia Fellowship, Rv819.52706). This work was also supported from NVIDIA Academic Award.
 
 # References