WIP (no active development) - New talktorial: ML splitting schemes #72

kimheeye · 2020-11-23T13:42:59Z

Details

Talktorial ID: xxx
Title: Performance of ligand-based machine learning methods for the classification of active/inactive compounds, considering various validation approaches
Original authors: Hee-yeong Kim
Reviewer(s): Andrea Volkamer
Date of review: 20.12.20

General

Search for literature belonging to the topic and filter information towards
splitting schemes.
Structure the notebook for the student’s presentation (~20min).
Write a project work with at least 3000 words belonging to the topic.
- Observe the results for the different ML methods based on the data splitting schemes (single random split, n-fold- and time-split CV, cluster-based split).

Programming Tasks

Create a EGFR compounds data, comprising bioactivity data and document year, using T001 and filter them by
Lipinski‘s Rule of five by T002.

Use the ML approaches from T007 and validate them by three different data splitting schemes. Therefore implement the following methods to split the data into train/test and validation set:

Single Random Split.
n-fold Cross Validation (CV).

Time-split CV

Implement a naive time-split CV provided by sklearn TimeSeriesSplit().
Create a plot to investigate the distribution of data depending on time of publication (document year).
Assess the train/test split from above implemented time-split CV and specify the splits such that the cut lies between distinct publication years.
Compare the Classifier’s performance for both approaches.

Cluster-based Split

Use the Butina algorithm provided by RDKit (functions from T005) to cluster the compounds by their fingeprint representation with Tanimoto similarity measure.
Cluster the compounds using the sklearn.cluster.KMeans() function and euclidian distance measure, based on their pysicochemical properties, calculated with CalcDescriptors() from RDKit .
Select the train set from large clusters and use the remaining small clusters and/or singletons to create an external test set.
Run the Classifiers on the generated train/test set.

AndreaVolkamer · 2020-12-08T11:43:52Z

@kimheeye really well done! Great notebook, text and illustrations!
Just a few 'cosmetic' comments here:

Have a look at the template on how (and where) to cite references and figure captions.
For the functions can you please insert doc strings (see e.g. the functions in [T007])(https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T007_compound_activity_machine_learning/talktorial.ipynb)
In the practical part, it would be good to add a few more markdown descriptions on what is done (and maybe why) and especially what the results and findings are (a short discussion)
also maybe link or mark somehow, which functions were reused from T007

hee and others added 5 commits November 22, 2020 12:40

Data-partitioning schemes

59ef867

validation_strategies

c7bc26f

validation_schemes

cee3a40

new file

9b2b700

Add README

35df80f

jaimergp marked this pull request as draft December 11, 2020 09:26

kimheeye added 3 commits December 12, 2020 17:15

fix typos, add readme

39e8e13

newest version

b49e0a2

newest version

4c42508

dominiquesydow added the new talktorial New talktorial label Sep 8, 2021

dominiquesydow changed the title ~~ML-splitting_schemes~~ ML splitting schemes Sep 14, 2021

dominiquesydow changed the title ~~ML splitting schemes~~ New talktorial: ML splitting schemes Sep 23, 2021

AndreaVolkamer changed the title ~~New talktorial: ML splitting schemes~~ WIP (no active development) - New talktorial: ML splitting schemes Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP (no active development) - New talktorial: ML splitting schemes #72

WIP (no active development) - New talktorial: ML splitting schemes #72

kimheeye commented Nov 23, 2020 •

edited

Loading

AndreaVolkamer commented Dec 8, 2020 •

edited

Loading

WIP (no active development) - New talktorial: ML splitting schemes #72

Are you sure you want to change the base?

WIP (no active development) - New talktorial: ML splitting schemes #72

Conversation

kimheeye commented Nov 23, 2020 • edited Loading

Details

General

Programming Tasks

AndreaVolkamer commented Dec 8, 2020 • edited Loading

kimheeye commented Nov 23, 2020 •

edited

Loading

AndreaVolkamer commented Dec 8, 2020 •

edited

Loading