- Project description
- Citation
- Installation using conda
- Prerequisites
- How to install
- How to use maxsmi
- Examples
- How to train and evaluate a model using augmentation
- How to make predictions
- Examples
- Documentation
- Repository structure and important files
- Acknowledgments
Accurate molecular property or activity prediction is one of the main goals in computer-aided drug design in which deep learning has become an important part. Since neural networks are data greedy and both physico-chemical and bioactivity data sets remain scarce, augmentation techniques have become a powerful assistance for accurate predictions.
This repository provides the code basis to exploit data augmentation using the fact that one compound can be represented by various SMILES (simplified molecular-input line-entry system) strings.
Augmentation strategies
- No augmentation
- Augmentation with duplication
- Augmentation without duplication
- Augmentation with reduced duplication
- Augmentation with estimated maximum
Data sets
- Physico-chemical data from MoleculeNet, available as part of DeepChem
- ESOL
- FreeSolv
- lipophilicity
- Bioactivity data on the EGFR kinase, retrieved from Kinodata
Deep learning models
- 1D convolutional neural network (CONV1D)
- 2D convolutional neural network (CONV2D)
- Recurrent neural network (RNN)
The results of our study show that data augmentation improves the accuracy independently of the deep learning model and the size of the data. The best strategy leads to the Maxsmi models, which are available here for predictions on novel compounds on the provided data sets.
If you use maxsmi
, don't forget to reference the work. The paper can be found at this link.
@article{kimber_2021_AILSCI,
title = {Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
author = {Talia B. Kimber and Maxime Gagnebin and Andrea Volkamer}
journal = {Artificial Intelligence in the Life Sciences},
volume = {1},
pages = {100014},
year = {2021},
issn = {2667-3185},
doi = {https://doi.org/10.1016/j.ailsci.2021.100014},
url = {https://www.sciencedirect.com/science/article/pii/S2667318521000143}
}
Anaconda and Git should be installed. See Anaconda's website and Git's website for download.
- Clone the github repository:
git clone https://github.com/volkamerlab/maxsmi.git
- Change directory:
cd maxsmi
- Create the conda environment:
conda env create -n maxsmi -f devtools/conda-envs/test_env.yaml
- Activate the environment:
conda activate maxsmi
- Install the maxsmi package:
pip install -e .
To get an overview of all available options:
python maxsmi/full_workflow.py --help
To train a model with the ESOL data set, augmenting the training set 5 times and the test set 2 times, training for 5 epochs:
python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5
If no ensemble learning is wanted for the evaluation, add the flag as below:
Note: with ensemble learning computes a per compound prediction, whereas without ensemble learning compute a per SMILES prediction.
python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5 --eval-strategy=False
To train a model with all chosen arguments:
Note: This command uses the default number of epochs (which is set to 250). Please allow time for the model to train.
python maxsmi/full_workflow.py --task="FreeSolv" --string-encoding="smiles" --aug-strategy-train="augmentation_with_duplication" --aug-strategy-test="augmentation_with_reduced_duplication" --aug-nb-train=5 --aug-nb-test=2 --ml-model="CONV1D" --eval-strategy=True --nb-epochs=250
To train a model with early stopping (this command could take time to be executed):
python maxsmi/full_workflow_earlystopping.py --aug-nb-train=3 --aug-nb-test=2
These predictions use the precalculated Maxsmi
models (best performing models in the study).
To predict the affinity of a compound against the EGFR kinase, e.g. given by the SMILES CC1CC1
, run:
python maxsmi/prediction_unlabeled_data.py --task="affinity" --smiles_prediction="CC1CC1"
To predict the lipophilicity prediction for the semaxanib drug, run:
python maxsmi/prediction_unlabeled_data.py --task="lipophilicity" --smiles_prediction="O=C2C(\c1ccccc1N2)=C/c3c(cc([nH]3)C)C"
The maxsmi
package documentation is available here.
|-- LICENSE
|-- README.md
|-- devtools
|-- docs
|-- maxsmi
| |-- output_ <- Saved outputs for results analysis
| |-- prediction_models <- Weights for Maxsmi models
| |-- pytorch_utils <- Utilities for PyTorch
| |-- results_analysis <- Notebooks for results analysis
| |-- tests <- Unit tests
| |-- utils <- Utilities for data, encodings, smiles
| |-- augmentation_strategies.py <- SMILES augmentation strategies
| |-- full_workflow.py <- Training and evaluation of deep learning models
| |-- full_workflow_earlystopping.py <- Training using early stopping
| |-- prediction_unlabeled_data.py <- Maxsmi models available for user prediction
Project based on the Computational Molecular Science Python Cookiecutter version 1.4.
Documentation and packaging: A special thank you to dominiquesydow for sharing her valuable knowledge with patience and kindness.
Copyright (c) 2020, Talia B. Kimber at VolkamerLab.