In order to make a structure based predict on the bio-activity of molecules, a list of features is generated with a KNIME workflow. This list is used as input for a Support Vector Machine (SVM) Predictor. In the script, the compounds contained in the input data file are used to train the predictor. Furthermore, the parameters of the predictor are adjusted by GridSearchCV: The predictor is trained multiple times with different combinations of available parameters and the best predictor is then used to predict the bio-activity.
The KNIME workflow featureGeneration.knar receives an input file containing SMILES and the predicted bio-activity of the molecules in a comma separated csv file. It generates a list of features for the molecules and outputs a comma separated file containing the activity, the SMILES structure the molecules corresponding features.
In order to run the program one has to specify
-train Path of the input csv file generated by the KNIME workflow, containing the training molecules -test Path of the input csv file generated by the KNIME workflow, containing the molecules to be tested -out Destination path of the resulting prediction csv
SVM_GridSearch.py -train trainingData_Features.csv -test testData_Features.csv -out SVM_GridSearch_res.csv
- KNIME - Analytics Platform (3.7)
- RDKIT - Software Package to read and analyse SMILE data (3.4.0v)
- Python - Python programming language (3.6)
- scikit-learn - Software Package for Machine Learning (v0.20.1)
- matplotlib - 2D Plotting Library (2.2.2)
- pandas - Datastructures and Dataframes (v0.23.4)
Jennifer Bödker Tobias Nietsch