-
Notifications
You must be signed in to change notification settings - Fork 27
License
GPL-3.0, GPL-3.0 licenses found
Licenses found
GPL-3.0
LICENSE
GPL-3.0
COPYING
Merck/DeepNeuralNet-QSAR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
=================================================================== ============ DeepNeuralNet_QSAR Documentation ============== =================================================================== Authors: Yuting Xu, Junshui Ma. Contact: [email protected], [email protected]. Affiliation: Merck Biometrics Research, Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA. Date: 02/07/2017 Acknowledgement: This set of codes were developed based on George Dahl's Kaggle codes in Dec. 2012. If you use the DeepNeuralNet_QSAR for scientific work that gets published, you should include in that publication a citation of the paper below: Xu, Yuting, Junshui Ma, Andy Liaw, Robert P. Sheridan, and Vladimir Svetnik. "Demystifying Multitask Deep Neural Networks for Quantitative Structure–Activity Relationships." Journal of chemical information and modeling 57, no. 10 (2017): 2490-2504. =================================================================== Basic info. =================================================================== System requirements: * Python 2.7+ * Required Python Modules: - Python Modules installed by default: sys, os, argparse, itertools, gzip, time - General Python Modules: numpy, scipy.sparse - Special Python Modules: gnumpy, cudamat (if use GPU) or npmat (if use multiplec-core CPU) * CUDA toolkit: a prerequisite of cudamat Python Module. Installation of Special Python Modules: * gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html * npmat: http://www.cs.toronto.edu/~ilya/npmat.py * cudamat: https://github.com/cudamat/cudamat Note: - Modules "gnumpy" and "npmat" are also provided in this distribution. - If you have not GPU card or have problem installing cudamat module, the npmat.py module will use multiplec-core CPU to simulate the GPU computing. - Create a directory for this moduel of DeepNeuralNet_QSAR, and keep all the python scripts in that directory. Usage: * Start a commandline-window (in windows) or a terminal (in linux), and run the python scripts. Please refer to details below. =================================================================== Brief explaination of all python files =================================================================== All the files are listed in alphabetical order, not ordered by importance. Please find more detailed comments of all individual functions inside each python file. [activationFunctions.py] Define several classes of common activiation functions, such as ReLU/Linear/Sigmoid, along with their derivation or error function (if used for ouput layer). Used by [dnn.py] [counter.py] Utilize sys.stderr to produce progress bar for each training epoch. Include several different classes of progress bar, but only "Progress" and "DummyProgBar" are used. Used by [dnn.py] [DeepNeuralNetPredict.py] For making predictions for new compound structure with a single-task/multi-task DNN, which is trained by DeepNeuralNetTrain.py or DeepNeuralNetTrain_dense.py. [DeepNeuralNetTrain.py] For training a multi-task/single-task DNN with sparse QSAR dataset(s), accepts raw csv datasets or processed npz datasets. [DeepNeuralNetTrain_dense.py] For training a multi-task DNN with dense QSAR dataset(s), accepts raw csv datasets or processed npz datasets. [dnn.py] Key components of a simple feed forward neural network. Used by [DeepNeuralNetTrain.py], [DeepNeuralNetPredict.py], [DeepNeuralNetTrain_multi.py] and [DeepNeuralNetPredict_multi.py] [DNNSharedFunc.py] A group of assistant functions, such as calculating R-squared, writing predictions into file. Used by many other files in the package. [gnumpy.py] A simple python module for GPU computing, the "GPU-version" of numpy module. [npmat.py] A simple python module which is required by gnumpy.py for the simulation mode. If failed to import cudamat, using npmat (CPU computing) instead. [processData_sparse.py], [processData_dense.py] Pre-processing a group of raw csv QSAR data sets(either sparse or dense) to sparse-matrix python file format (save as *.npz), to facilitate later use. Contains many data-manipulation functions used by other files in the package. =================================================================== How to use - Example scripts =================================================================== 0) Prepare input datasets [sparse datasets] * Arrange all the datasets as examples in "data_sparse" folder. * Example #1 (It is a subset of three tasks from the 15 Kaggle datasets): - Folder name: data_sparse - Contains several datasets, each has training set and test set: METAB_training.csv METAB_test.csv OX1_training.csv OX1_test.csv TDI_training.csv TDI_test.csv * Example #2 (It is a single task selected from Kaggle datasets): - Folder name: data_sparse_single - Contains one pair of training set and test set: METAB_training.csv METAB_test.csv [dense datasets] * Arrange all the datasets as examples in "data_dense_raw" folder. * Example (It is a subsample from CYP datasets, which has 3 tasks): - Folder name: data_dense - Contains two datasets, one training set and one test set: training.csv test.csv 1) Pre-process data (Optional, can be skipped.) * preprocess sparse format datasets: create a new folder "data_sparse" under the working directory to save processed data. python processData_sparse.py data_sparse data_sparse_processed * preprocess dense format datasets: create a new folder "data_dense" under the working directory to save processed data, need to tell how many tasks are there in the dense dataset, such as "3" in the example datasets. python processData_dense.py data_dense data_dense_processed 3 2) Train a single-task DNN for one QSAR task Default transformation of inputs is log; activation function is ReLU, minibatch size 128.... The key parameters that need to be specify by user: - seed: random seed for the program. It is optional but better to be given for reproducibility. - CV: (optional) proportation of cross-validation subset which randomly sampled from training set - test: (optional) whether to use the corresponding external test set for checking performance on test set during training. - hid: DNN structure, specify the number of nodes at each layer. - dropouts: the drop out probability for each layer, to prevent over-fitting. - epochs: number of epochs for training - data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file - the last argument: where you want to save the trained model, if the folder doesn't exists it'll be created automatically * Example: use .csv raw data to train a single-task DNN for METAB, each corresponding processed .npz files will be automatically save to input data path python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single * Example: use .npz processed data to train a single-task DNN for METAB (recommended, loading data faster than raw data) Parameters are the same as above. The processed datasets in folder "data_sparse_single" is created in last step. python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single * Example: Without the optional 'CV' and 'test' arguments. python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single 3) Prediction with a single-task DNN The key parameters that need to be specify by user: - model: the path to previous trained model folder, e.g. the "models/METAB_single" from step 2). - data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file - label: whether the "test" dataset have true label. Default is 0, but in this example it has true label. - rep: (optional) number of dropout prediction rounds. Default is 0, means don't perform dropout prediction. - seed: random seed for the program, useful for dropout prediction. Optional but better to be given for reproducibility. - result: (optional) specify where to save the prediction results. Default is the same as model folder. * Example: use the previous trained single DNN model for METAB to perform prediction for its test data python DeepNeuralNetPredict.py --seed=0 --label=1 --rep=10 --data=data_sparse_single --model=models/METAB_single --result=predictions/METAB_single * Example: Without the optional 'rep' and 'PredictResultPath': python DeepNeuralNetPredict.py --label=1 --data=data_sparse_single --model=models/METAB_single 4) Train a multi-task DNN for the sparse datasets Need to use the processed datasets but not raw datasets. Parameters that are different from single-task DNN: - data: path to the data folder that stores all the QSAR datasets (Below are optional) - mbsz: the minibatch size, default is 20, but for multi-task it may be modified to achieve better results - keep: the datasets to keep in the model, if don't want to include all datasets in the 'data' folder - watch: if use internal cross-validation set or external test set, choose to monitor the MSE and R-squared for certain task - reducelearnRateVis: sometimes reduce the learning rate of the first layer helps the training process to converge better * Example: a multi-task DNN to model all the three sparse datasets: METAB, OX1, TDI python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=5 --data=data_sparse models/multi_sparse_1 * Example: load the previous trained model and continue the training process for more epochs. python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse --loadModel=models/multi_sparse_1 models/multi_sparse_continue * Example: with more optional parameters, keep only METAB and OX1 tasks and monitor OX1 task performance python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --mbsz=30 --keep=METAB --keep=OX1 --watch=OX1 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse models/multi_sparse_2 5) Prediction with multi-task DNN for the sparse datasets The parameter settings are the same as single-task DNN for sparse dataset. See step 3). Only difference: - data: path to the data folder that stores all the processed datasets (including test datasets). * Example: prediction for all the three sparse datasets with the model trained in previous step, save results to model folder: python DeepNeuralNetPredict.py --label=1 --data=data_sparse --model=models/multi_sparse_1 * Example: prediction with the model for METAB and OX1, trained in previous step, with dropout prediction, and save result to another folder. python DeepNeuralNetPredict.py --label=1 --seed=0 --rep=10 --data=data_sparse --model=models/multi_sparse_2 --result=predictions/multi_sparse_2 6) Train a multi-task DNN for the dense datasets Most of the parameter settings are the same as multi-task DNN for sparse datasets Difference: use integer parameters for the 'keep' and 'watch' arguments The key parameters that need to be specify by user: - numberOfOutputs: number of QSAR task output columns in the raw training set (.csv) * Example: keep only the first two output tasks and monitor the first output during training process, with internal cross-validation set and external test set, using raw data python DeepNeuralNetTrain_dense.py --numberOfOutputs=3 --CV=0.4 --test --keep=0_1 --watch=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense models/multi_dense_1 * Example: Without the optional arguments, using pre-processed data Note: for processed data, don't need to specify "--numberOfOutputs=3" python DeepNeuralNetTrain_dense.py --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense_processed models/multi_dense_2 7) Prediction with multi-task DNN for the dense datasets Parameter settings are the same as prediction for sparse datasets * Example: Prediction using trained DNN from previous step python DeepNeuralNetPredict.py --label=1 --dense --data=data_dense --model=models/multi_dense_1 --result=predictions/multi_dense_1
About
No description, website, or topics provided.
Resources
License
GPL-3.0, GPL-3.0 licenses found
Licenses found
GPL-3.0
LICENSE
GPL-3.0
COPYING
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published