Maxsmi: data augmentation for molecular property prediction using deep learning

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Accurate molecular property or activity prediction is one of the main goals in computer-aided drug design in which deep learning has become an important part. Since neural networks are data greedy and both physico-chemical and bioactivity data sets remain scarce, augmentation techniques have become a powerful assistance for accurate predictions.

This repository provides the code basis to exploit data augmentation using the fact that one compound can be represented by various SMILES (simplified molecular-input line-entry system) strings.

Augmentation strategies

No augmentation
Augmentation with duplication
Augmentation without duplication
Augmentation with reduced duplication
Augmentation with estimated maximum

Data sets

Physico-chemical data from MoleculeNet, available as part of DeepChem
- ESOL
- FreeSolv
- lipophilicity
Bioactivity data on the EGFR kinase, retrieved from Kinodata

Deep learning models

1D convolutional neural network (CONV1D)
2D convolutional neural network (CONV2D)
Recurrent neural network (RNN)

The results of our study show that data augmentation improves the accuracy independently of the deep learning model and the size of the data. The best strategy leads to the Maxsmi models, which are available here for predictions on novel compounds on the provided data sets.

Citation

If you use maxsmi, don't forget to reference the work. The paper can be found at this link.

@article{kimber_2021_AILSCI,
  title = {Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author = {Talia B. Kimber and Maxime Gagnebin and Andrea Volkamer}
  journal = {Artificial Intelligence in the Life Sciences},
  volume = {1},
  pages = {100014},
  year = {2021},
  issn = {2667-3185},
  doi = {https://doi.org/10.1016/j.ailsci.2021.100014},
  url = {https://www.sciencedirect.com/science/article/pii/S2667318521000143}
}

Installation using conda

Prerequisites

Anaconda and Git should be installed. See Anaconda's website and Git's website for download.

How to install

Clone the github repository:

git clone https://github.com/volkamerlab/maxsmi.git

Change directory:

cd maxsmi

Create the conda environment:

conda env create -n maxsmi -f devtools/conda-envs/test_env.yaml

Activate the environment:

conda activate maxsmi

Install the maxsmi package:

pip install -e .

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

To get an overview of all available options:

python maxsmi/full_workflow.py --help

To train a model with the ESOL data set, augmenting the training set 5 times and the test set 2 times, training for 5 epochs:

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5

If no ensemble learning is wanted for the evaluation, add the flag as below:

Note: with ensemble learning computes a per compound prediction, whereas without ensemble learning compute a per SMILES prediction.

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5 --eval-strategy=False

To train a model with all chosen arguments:

Note: This command uses the default number of epochs (which is set to 250). Please allow time for the model to train.

python maxsmi/full_workflow.py --task="FreeSolv" --string-encoding="smiles" --aug-strategy-train="augmentation_with_duplication" --aug-strategy-test="augmentation_with_reduced_duplication" --aug-nb-train=5 --aug-nb-test=2 --ml-model="CONV1D" --eval-strategy=True --nb-epochs=250

To train a model with early stopping (this command could take time to be executed):

python maxsmi/full_workflow_earlystopping.py --aug-nb-train=3 --aug-nb-test=2

How to make predictions

These predictions use the precalculated Maxsmi models (best performing models in the study).

To predict the affinity of a compound against the EGFR kinase, e.g. given by the SMILES CC1CC1, run:

python maxsmi/prediction_unlabeled_data.py --task="affinity" --smiles_prediction="CC1CC1"

To predict the lipophilicity prediction for the semaxanib drug, run:

python maxsmi/prediction_unlabeled_data.py --task="lipophilicity" --smiles_prediction="O=C2C(\c1ccccc1N2)=C/c3c(cc([nH]3)C)C"

Documentation

The maxsmi package documentation is available here.

Repository structure and important files

|-- LICENSE
|-- README.md
|-- devtools
|-- docs
|-- maxsmi
|   |-- output_                         <- Saved outputs for results analysis
|   |-- prediction_models               <- Weights for Maxsmi models
|   |-- pytorch_utils                   <- Utilities for PyTorch
|   |-- results_analysis                <- Notebooks for results analysis
|   |-- tests                           <- Unit tests
|   |-- utils                           <- Utilities for data, encodings, smiles
|   |-- augmentation_strategies.py      <- SMILES augmentation strategies
|   |-- full_workflow.py                <- Training and evaluation of deep learning models
|   |-- full_workflow_earlystopping.py  <- Training using early stopping
|   |-- prediction_unlabeled_data.py    <- Maxsmi models available for user prediction

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.4.

Documentation and packaging: A special thank you to dominiquesydow for sharing her valuable knowledge with patience and kindness.

Name		Name	Last commit message	Last commit date
Latest commit History 634 Commits
.github		.github
devtools		devtools
docs		docs
maxsmi		maxsmi
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.lgtm.yml		.lgtm.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
readthedocs.yml		readthedocs.yml
setup.cfg		setup.cfg
setup.py		setup.py
versioneer.py		versioneer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maxsmi: data augmentation for molecular property prediction using deep learning

Table of contents

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Citation

Installation using conda

Prerequisites

How to install

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

How to make predictions

Documentation

Repository structure and important files

Acknowledgements

Copyright

About

Releases 4

Contributors 4

Languages

License

volkamerlab/maxsmi

Folders and files

Latest commit

History

Repository files navigation

Maxsmi: data augmentation for molecular property prediction using deep learning

Table of contents

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Citation

Installation using conda

Prerequisites

How to install

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

How to make predictions

Documentation

Repository structure and important files

Acknowledgements

Copyright

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 4

Contributors 4

Languages