Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP (no active development) - New talktorial: ML splitting schemes #72

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

kimheeye
Copy link
Contributor

@kimheeye kimheeye commented Nov 23, 2020

Details

  • Talktorial ID: xxx
  • Title: Performance of ligand-based machine learning methods for the classification of active/inactive compounds, considering various validation approaches
  • Original authors: Hee-yeong Kim
  • Reviewer(s): Andrea Volkamer
  • Date of review: 20.12.20

General

  • Search for literature belonging to the topic and filter information towards
    splitting schemes.

  • Structure the notebook for the student’s presentation (~20min).

  • Write a project work with at least 3000 words belonging to the topic.

    • Observe the results for the different ML methods based on the data splitting schemes (single random split, n-fold- and time-split CV, cluster-based split).

Programming Tasks

  • Create a EGFR compounds data, comprising bioactivity data and document year, using T001 and filter them by
    Lipinski‘s Rule of five by T002.

Use the ML approaches from T007 and validate them by three different data splitting schemes. Therefore implement the following methods to split the data into train/test and validation set:

  • Single Random Split.

  • n-fold Cross Validation (CV).

Time-split CV

  • Implement a naive time-split CV provided by sklearn TimeSeriesSplit().
  • Create a plot to investigate the distribution of data depending on time of publication (document year).
  • Assess the train/test split from above implemented time-split CV and specify the splits such that the cut lies between distinct publication years.
  • Compare the Classifier’s performance for both approaches.

Cluster-based Split

  • Use the Butina algorithm provided by RDKit (functions from T005) to cluster the compounds by their fingeprint representation with Tanimoto similarity measure.
  • Cluster the compounds using the sklearn.cluster.KMeans() function and euclidian distance measure, based on their pysicochemical properties, calculated with CalcDescriptors() from RDKit .
  • Select the train set from large clusters and use the remaining small clusters and/or singletons to create an external test set.
  • Run the Classifiers on the generated train/test set.

@AndreaVolkamer
Copy link
Member

AndreaVolkamer commented Dec 8, 2020

@kimheeye really well done! Great notebook, text and illustrations!
Just a few 'cosmetic' comments here:

@jaimergp jaimergp marked this pull request as draft December 11, 2020 09:26
@dominiquesydow dominiquesydow added the new talktorial New talktorial label Sep 8, 2021
@dominiquesydow dominiquesydow changed the title ML-splitting_schemes ML splitting schemes Sep 14, 2021
@dominiquesydow dominiquesydow changed the title ML splitting schemes New talktorial: ML splitting schemes Sep 23, 2021
@AndreaVolkamer AndreaVolkamer changed the title New talktorial: ML splitting schemes WIP (no active development) - New talktorial: ML splitting schemes Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new talktorial New talktorial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants