Integrating Deep Neural Networks and Symbolic Inference for Organic Reactivity Prediction [ChemRxiv]
Code for the inference pipeline is here, and the complete training pipeline will be updated once the paper is accepted.
conda=4.7.10
(from here with Python 3.6.9)tensorflow-gpu=1.12.0
(from conda, and GPU is recommended)rdkit=2019.09.1
(from conda)gurobi=8.1.1
(from conda, and free academic licence can be found here)h5py=2.9.0
(from conda)- other softwares required by the above packages
Follwoings are the intructions for installing our software and the dependencies specified above:
# the total installation time should be less than 15 minutes
# download the source code
git clone https://github.com/WesleyyC/symbolic_rxn.git
cd symbolic_rxn
# create a py3 conda env
conda create -n symbolic_rxn python=3
conda activate symbolic_rxn
# install the dependencies
# tensorflow-gpu
conda install tensorflow-gpu=1.12.0
# RDKit
conda install -c conda-forge rdkit=2019.09.1
# gurobi
conda config --add channels http://conda.anaconda.org/gurobi
conda install gurobi=8.1.1
To better balance the GPU/CPU workload during training/inference, we precompute the graph features and store them in HDF5 format for fast random query. Followings are the instructions for digesting the raw text input into our graph feature representation.
# prepare the data
unzip data.zip
# digesting a small demo dataset w/ 2000 reactions (~30 secs)
python -m reactivity_prediction.data_digestion --input data/demo.txt --output data/hdf5_demo
# digesting the full test dataset (~10 mins)
python -m reactivity_prediction.data_digestion --input data/test.txt --output data/hdf5_test
We provided the pre-trained models for reproducing our Top-K performance on the USPTO dataset:
# download the ensembles models
wget https://www.dropbox.com/s/x8ruovxxsc5q7q6/ckpt.zip
# unzip the models
unzip ckpt.zip
Followings are the instructions for running a single pre-trained model on the provided demo dataset:
# reactivity prediction w/ existing ckpt (<1 mins with GPU)
python -m reactivity_prediction.run_model \
--mode infer --ckpt ckpt/mdl_0 \
--eval_input data/hdf5_demo --eval_output demo_output
# evaluate reactivity prediction (<30 secs)
python -m reactivity_prediction.eval --input demo_output/delta_predictions.pkl
# and the program will report the bond changes coverage:
# ==========================================================
# ================ Reaction Bond Prediction ================
# ==========================================================
# as well as the delta prediction accuracy/F-1 socre:
# ==========================================================
# ==================== Delta Evaluation ====================
# ==========================================================
# run symbolic inference and evaluation (~30 mins with multi cores)
python -m octet_sampling.run_sampler --input demo_output/delta_predictions.pkl
# and the program will report in the last row with the Top-K prediction accuracy:
# ==================================== #2000 of Reactions Evaluated ====================================
# Gurobi: Top1: ... Top2: ... Top3: ... Top5: ... Top20: ... Average Time: ...
Followings are the instruction for reproducing our reprorted Top-k prediction accuracy on the USPTO test dataset using the ensemble models:
# running the following script to genereate ensembled predictions (~80 mins or ~10 mins/model)
bash reactivity_prediction/run_ensemble_inference.sh
# evaluate reactivity prediction (<10 mins)
python -m reactivity_prediction.eval --input ensemble_output/delta_predictions.pkl
# run symbolic inference and evaluation (<1 day)
python -m octet_sampling.run_sampler --input ensemble_output/delta_predictions.pkl
- Wesley Wei Qian ([email protected])