Skip to content

Latest commit

 

History

History
286 lines (217 loc) · 11.8 KB

README.md

File metadata and controls

286 lines (217 loc) · 11.8 KB

DeeprankCore

Badges
fairness fair-software.eu CII Best Practices
package PyPI version Codacy Badge
docs Documentation Status DOI
tests Build Status Linting status Coverage Status
license License

Overview

alt-text

DeeprankCore is a deep learning framework for data mining Protein-Protein Interactions (PPIs) using Graph Neural Networks.

DeeprankCore contains useful APIs for pre-processing PPIs data, computing features and targets, as well as training and testing GNN models.

Main features:

  • Predefined atom-level and residue-level PPI feature types
    • e.g. atomic density, vdw energy, residue contacts, PSSM, etc.
  • Predefined target type
    • e.g. binary class, CAPRI categories, DockQ, RMSD, FNAT, etc.
  • Flexible definition of both new features and targets
  • Graphs feature mapping
  • Efficient data storage in HDF5 format
  • Support both classification and regression (based on PyTorch and PyTorch Geometric)

DeeprankCore documentation can be found here : https://deeprankcore.rtfd.io/.

Table of contents

Installation

Dependencies

Before installing deeprankcore you need to install:

  • reduce: follow the instructions in the README of the reduce repository.
    • How to build it without sudo privileges on a Linux machine. After having run make in the reduce/ root directory, go to reduce/reduce_src/Makefile and modify /usr/local/ to a folder in your home directory, such as /home/user_name/apps. Note that such a folder needs to be added to the PATH in the .bashrc file. Then run make install from reduce/.
  • msms: conda install -c bioconda msms. For MacOS with M1 chip users: you can follow these instructions.
  • pytorch: conda install pytorch torchvision torchaudio cpuonly -c pytorch or conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia, for taking advantage of GPUs.
  • pytorch-geometric: conda install pyg -c pyg
  • Dependencies for pytorch geometric from wheels: pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html.
    • Here, ${TORCH} and ${CUDA} should be replaced by the pytorch and CUDA versions installed. You can find these using:
      • python -c "import torch; print(torch.__version__)" and
      • python -c "import torch; print(torch.version.cuda)"
    • For example: https://data.pyg.org/whl/torch-2.0.0+cpu.html
  • Only if you have a MacOS with M1 chip, additional steps are needed:
    • conda install pytables
    • See this solution to install PyQt5 or run conda install pyqt

Deeprank-Core Package

Once the dependencies installed, you can install the latest release of deeprankcore using the PyPi package manager:

pip install deeprankcore

You can get all the new developments by cloning the repo and installing the code with

git clone https://github.com/DeepRank/deeprank-core
cd deeprank-core
pip install -e ./

Documentation

More extensive and detailed documentation can be found here.

Quick start

Graphs generation

The process of generating graphs takes as input .pdb files representing protein-protein structural complexes and the correspondent Position-Specific Scoring Matrices (PSSMs) in the form of .pssm files. Query objects describe how the graphs should be built.

from deeprankcore.query import QueryCollection, ProteinProteinInterfaceResidueQuery

queries = QueryCollection()

# Append data points
queries.add(ProteinProteinInterfaceResidueQuery(
    pdb_path = "1ATN_1w.pdb",
    chain_id1 = "A",
    chain_id2 = "B",
    targets = {
        "binary": 0
    },
    pssm_paths = {
        "A": "1ATN.A.pdb.pssm",
        "B": "1ATN.B.pdb.pssm"
    }
))
queries.add(ProteinProteinInterfaceResidueQuery(
    pdb_path = "1ATN_2w.pdb",
    chain_id1 = "A",
    chain_id2 = "B",
    targets = {
        "binary": 1
    },
    pssm_paths = {
        "A": "1ATN.A.pdb.pssm",
        "B": "1ATN.B.pdb.pssm"
    }
))
queries.add(ProteinProteinInterfaceResidueQuery(
    pdb_path = "1ATN_3w.pdb",
    chain_id1 = "A",
    chain_id2 = "B",
    targets = {
        "binary": 0
    },
    pssm_paths = {
        "A": "1ATN.A.pdb.pssm",
        "B": "1ATN.B.pdb.pssm"
    }
))

# Generate graphs and save them in hdf5 files
output_paths = queries.process("<output_folder>/<prefix_for_outputs>")

The user is free to implement his/her own query class. Each implementation requires the build method to be present.

Dataset

Data can be split in sets implementing custom splits according to the specific application. Utility splitting functions are currently under development.

Assuming that the training, validation and testing ids have been chosen (keys of the hdf5 file), then the corresponding graphs can be saved in hdf5 files containing only references (external links) to the original one. For example:

from deeprankcore.dataset import save_hdf5_keys

save_hdf5_keys("<original_hdf5_path.hdf5>", train_ids, "<train_hdf5_path.hdf5>")
save_hdf5_keys("<original_hdf5_path.hdf5>", valid_ids, "<val_hdf5_path.hdf5>")
save_hdf5_keys("<original_hdf5_path.hdf5>", test_ids, "<test_hdf5_path.hdf5>")

Now the GraphDataset objects can be defined:

from deeprankcore.dataset import GraphDataset

node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"

# Creating GraphDataset objects
dataset_train = GraphDataset(
    hdf5_path = "<train_hdf5_path.hdf5>",
    node_features = node_features,
    edge_features = edge_features,
    target = target
)
dataset_val = GraphDataset(
    hdf5_path = "<val_hdf5_path.hdf5>",
    node_features = node_features,
    edge_features = edge_features,
    target = target

)
dataset_test = GraphDataset(
    hdf5_path = "<test_hdf5_path.hdf5>",
    node_features = node_features,
    edge_features = edge_features,
    target = target
)

Training

Let's define a Trainer instance, using for example of the already existing GNNs, GINet:

from deeprankcore.trainer import Trainer
from deeprankcore.ginet import GINet

trainer = Trainer(
    GINet,
    dataset_train,
    dataset_val,
    dataset_test
)

By default, the Trainer class creates the folder ./output for storing predictions information collected later on during training and testing. HDF5OutputExporter is the exporter used by default, but the user can specify any other implemented exporter or implement a custom one.

Optimizer (torch.optim.Adam by default) and loss function can be defined by using dedicated functions:

import torch

trainer.configure_optimizers(torch.optim.Adamax, lr = 0.001, weight_decay = 1e-04)

Then the Trainer can be trained and tested, and the model can be saved:

trainer.train(nepoch = 50, batch_size = 64, validate = True)
trainer.test()
trainer.save_model(filename = "<output_model_path.pth.tar>")

Custom GNN

It is also possible to define new network architectures:

import torch 

def normalized_cut_2d(edge_index, pos):
    row, col = edge_index
    edge_attr = torch.norm(pos[row] - pos[col], p=2, dim=1)
    return normalized_cut(edge_index, edge_attr, num_nodes=pos.size(0))

class CustomNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = SplineConv(d.num_features, 32, dim=2, kernel_size=5)
        self.conv2 = SplineConv(32, 64, dim=2, kernel_size=5)
        self.fc1 = torch.nn.Linear(64, 128)
        self.fc2 = torch.nn.Linear(128, 1)

    def forward(self, data):
        data.x = F.elu(self.conv1(data.x, data.edge_index, data.edge_attr))
        weight = normalized_cut_2d(data.edge_index, data.pos)
        cluster = graclus(data.edge_index, weight)
        data = max_pool(cluster, data)

        data.x = F.elu(self.conv2(data.x, data.edge_index, data.edge_attr))
        weight = normalized_cut_2d(data.edge_index, data.pos)
        cluster = graclus(data.edge_index, weight)
        x, batch = max_pool_x(cluster, data.x, data.batch)

        x = scatter_mean(x, batch, dim=0)
        x = F.elu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        return F.log_softmax(self.fc2(x), dim=1)

trainer = Trainer(
    CustomNet,
    dataset_train,
    dataset_val,
    dataset_test
)

trainer.train(nepoch=50, batch_size = 64)

h5x support

After installing h5xplorer (https://github.com/DeepRank/h5xplorer), you can execute the python file deeprankcore/h5x/h5x.py to explorer the connection graph used by deeprankcore. The context menu (right click on the name of the structure) allows to automatically plot the graphs using plotly.

Package development

  • Branching
    • When creating a new branch, please use the following convention: <issue_number>_<description>_<author_name>.
  • Pull Requests
    • When creating a pull request, please use the following convention: <type>: <description>. Example types are fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:, and others based on the Angular convention.
  • Software release
    • Before creating a new package release, make sure to have updated all version strings in the source code. An easy way to do it is to run bump2version [part] from command line after having installed bump2version on your local environment. Instead of [part], type the part of the version to increase, e.g. minor. The settings in .bumpversion.cfg will take care of updating all the files containing version strings.