Skip to content
/ pharml Public

PharML is a framework for predicting compound affinity for protein structures. It utilizes a novel Molecular-Highway Graph Neural Network (MH-GNN) architecture based on state-of-the-art techniques in deep learning. This repository contains the visualization, preprocessing, training, and inference code written in Python and C. In addition, we pro…

License

Notifications You must be signed in to change notification settings

jbalma/pharml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PharML

PharML is a framework for predicting compound affinity for protein structures. It utilizes a novel Molecular-Highway Graph Neural Network (MH-GNN) architecture based on state-of-the-art techniques in deep learning. This repository contains the visualization, preprocessing, training, and inference code written in Python and C. In addition, we provide an ensemble of pre-trained models which can readily be used for quickly generating rank-ordered predictions of compound affinity relative to a given target. DISCLAIMER: Compounds predicted by PharML.Bind should not be used without consulting a doctor or pharmacist - all results should be considered unverified and used only as a starting point for further investigation. Use at your own risk!

Setup

  1. Edit the conda environment script to reflect your system configuration

    vim tools/environments/setup_conda_env.sh

    -> Ensure you have cudatoolkit and appropriate drivers to match the conda environment -> See README under /tools/environments for more details

  2. Activate the conda environment

    source activate /path/to/conda-pharml-env

    -> Install the following packages or do pip install -r tools/requirements.txt

    [todo: list package version requirements and put those in tools/environments/requirements.txt]

  3. Preprocess the Dataset

    -> The examples directory is setup with scripts for a) preprocessing of COVID-19 structure PDB, and BindingDB FDA-approved compounds in SDF format b) preprocessing of the full BindingDB dataset c) visualization of structures with associated compounds

    -> After preprocessing completes, you will have a directory containing -> data/lig: the ligand graph files -> data/nhg: the protein neighborhood graph (NHG) files -> data/pdb: the raw pdb files used to generate ligands and NHG -> data/map: the map file used for inference that specifies the ligand-to-target tests which will be tested

Running

  1. Test Inference Across example map file

    -> Launch with the following command to test against the COVID-19 6VSU structure and bindingDB's FDA-approved compound list generated in step 3

     python mldock_gnn.py \
         --map_test ../datasets/covid19/map/6VSU.map \
         --batch_size_test 16 \
         --mlp_latent ${MLP_LATENT} \
         --mlp_layers ${MLP_LAYERS} \
         --gnn_layers ${GNN_LAYERS} \
         --hvd True \
         --num_features ${NUM_FEATURES} \
         --data_threads 2 \
         --mode classification \
         --inference_only True
         --restore=../pretrain-models/mh-gnnx5-ensemble/ensemble_member_${n}/checkpoints/model0.ckpt" \
         --inference_out covid19_inference_${MAP_TEST_NAME}.out \
         --epochs 1 2>&1 |& tee log-covid19-${MAP_TEST_NAME}.out
    

    -> Using the --inference_out options tells PharML.Bind to save the outputs to disk, indexed by the compound ID

  2. Run Inference with each ensemble member to generate rank-ordered compound set

    -> You can also use the runit_inference.sh script as follows:

     Note: This will launch one pre-trained ensemble member each iteration indexed by n. When looping over each memeber, it generates predictions for the compound set on target PDB ID set by MAP_TEST_NAME. MAP_TEST_NAME is the index of the structure list which by default is set by PDB_ID_LIST="6LZG 6VSB 6LU7"
    
     salloc -N 8 -n 64 --ntasks-per-node=8 -t 24:00:00
    
     #Start CUDA MPS Server on each of the Dense GPU nodes
     time srun --cpu_bind=none -p spider -C V100 -l -N 8 --ntasks-per-node=1 -u ./restart_mps.sh 2>&1 |& tee mps_result.txt
    
     #Start the inference run on a single model
     srun -c 4 --hint=multithread -C V100 -p spider -l -N 8 -n 64 --ntasks-per-node=8 --ntasks-per-socket=4 -u --cpu_bind=none python mldock_gnn.py \
     --map_test ../datasets/covid19/map/${MAP_TEST_NAME}.map \
     --batch_size_test 16 \
     --mlp_latent ${MLP_LATENT} \
     --mlp_layers ${MLP_LAYERS} \
     --gnn_layers ${GNN_LAYERS} \
     --hvd True \
     --num_features ${NUM_FEATURES} \
     --data_threads 2 \
     --mode classification \
     --inference_only \
     --restore=../pretrain-models/mh-gnnx5-ensemble/ensemble_member_${n}/checkpoints/model0.ckpt" \
     --inference_out covid19_inference_${MAP_TEST_NAME}.out \
     --epochs 1 2>&1 |& tee covid19-${MAP_TEST_NAME}.out
    

    -> You should see the following output if you use SLURM workload manager to launch via srun command using the runit_inference.sh script:

     "Starting COVID-19 Structure Inference for 6VSB, 6LZG, 6LU7 for ensemble member 0..."
     "========================================================================================"
     "--> 6LZG: Spike receptor-binding domain complexed with its receptor ACE2: https://www.rcsb.org/structure/6LZG"
     "--> 6VSB: Prefusion 2019-nCoV spike glycoprotein with a single receptor-binding domain up: https://www.rcsb.org/structure/6vsb"
     "--> Starting at: 4/3/2020 12:00:00 PM CST"
    

    -> Next, you will see PharML.Bind initialize and tensorflow will print some logging info

     0: WARNING:tensorflow:From mldock_gnn.py:57: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
     0: Initialization of horovod complete...
     0: Rank 0  is saving inference output to model0_dataset_inference_0.map
     0: ----------------------------------------
     0: PharML.Bind-GNN: Version 1.0.1 - Framework for Open Therapeutics with Graph Neural Networks.
     0: ----------------------------------------
     0: ============================================================================================
     0: ----------------------------------------
     0:   Developed by
     0: ----------------------------------------
     0: ----------------------------------------
     0:       Jacob Balma: [email protected]
     0: ----------------------------------------
     0: ----------------------------------------
     0:       Aaron Vose:  [email protected]
     0: ----------------------------------------
     0: ----------------------------------------
     0:       Yuri Petersen: [email protected]
     0: ----------------------------------------
     0: 
     0: ----------------------------------------
     0: This work is supported by collaboration with Cray, Inc, Medical University of South Carolina (MUSC) and Hewlett Packard Enterprise (HPE). 
     0: ----------------------------------------
     0: ----------------------------------------
     0: ============================================================================================
     0: ----------------------------------------
     0: Namespace(batch_size=2, batch_size_test=2, data_threads=4, debug=True, epochs=1, gnn_layers='5,5', hvd=True, inference_only=True, inference_out='model0_dataset_inference.map', lr_init=0.01, map_test='/lus/scratch/jbalma/DataSets/Binding/mldock/tools/covid19-fda-bindingdb/data-6vsb-bindingdb-fda/map/dataset.map', map_train='/lus/scratch/jbalma/avose_backup/data/map/l0_1pct_train.map', mlp_latent='32,32', mlp_layers='2,2', mode='classification', num_features='16,16', plot_history=False, restore='./pretrained-models/mh-gnnx5-ensemble/model_0/checkpoints/model0.ckpt', use_clr=False, use_fp16=False)
     0: ----------------------------------------
     0: Loading data.
     0: ----------------------------------------
     0: Loading map file as distributed dataset for 8 ranks:
     0:   Map file:          /lus/scratch/jbalma/data/6vsb-bindingdb-fda/map/l0_1pct_train.map
     0:   Map items:         23608
     0:   Worker chunk size: 2951
     0:   Number of items:   2951
     0:   Sum(target[0]):    1422.000000
     0:   Avg(target[0]):    0.481871
     0:   Total load time:   0.4
     0: Done processing map file.
     0: Loading map file as distributed dataset for 8 ranks:
     0:   Map file:          /lus/scratch/jbalma/data/6vsb-bindingdb-fda/map/dataset.map
     0:   Map items:         440
     0:   Worker chunk size: 55
     0:   Number of items:   55
     0:   Sum(target[0]):    55.000000
     0:   Avg(target[0]):    1.000000
     0:   Total load time:   0.0
     0: Done processing map file.
    

    -> When inference finishes over all ensemble members, you will find the results stored to disk (if --inference_out was set) in the pharml/results/${MAP_TEST_NAME} directory

Evaluate and Visualize

  1. Evaluate the results

    -> Using the scripts provided in pharml/results, we can evaluate the rank-ordered compound set generated in the previous step

    "[TODO: provide example of using the ensemble summary tool]"

  2. Visualize the target with predicted compounds

    -> Using mlvoxelizer, we can visualize the compounds, proteins, or both relative to one another (when active-site training was used)

    "[TODO: Fill in an example of using mlvoxelizer for visualization of a covid-19 structure]"

About

PharML is a framework for predicting compound affinity for protein structures. It utilizes a novel Molecular-Highway Graph Neural Network (MH-GNN) architecture based on state-of-the-art techniques in deep learning. This repository contains the visualization, preprocessing, training, and inference code written in Python and C. In addition, we pro…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published