Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-canonical smiles confound string-based classifiers #15

Open
cyrusmaher opened this issue Dec 15, 2020 · 10 comments
Open

Non-canonical smiles confound string-based classifiers #15

cyrusmaher opened this issue Dec 15, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@cyrusmaher
Copy link

Running a string kernel classifier on the clintox dataset, I can obtain an AUROC of 0.96. When I canonicalize the smiles, my AUROC drops to 0.69. This implies that there is a bias in the smiles format between positive and negative examples that string-based classifiers can exploit to obtain unrealistically high performance, thereby tainting downstream benchmarks.

A solution to this would be to update the dataset to include only canonicalized smiles.

@rbharath
Copy link
Member

Oh wow, that's quite the find! Yes, this would definitely need to be fixed as we're overhauling MoleculeNet for the next v2 release. I'll mark this as a bug

@gabegrand
Copy link

@cyrusmaher do you have any insight as to what the bias in the SMILES is?

@rbharath
Copy link
Member

Starting to think about this anymore, I think we canonicalize smiles before computing descriptors in DeepChem which should handle this (but I'm not sure).

@cyrusmaher Would it be possible to provide a brief reproducing code snippet? That would help us figure out what's happening :)

@cjmielke
Copy link

One consideration : maybe a character frequency count between the raw and canonicalized forms? Maybe there's extra parentheses or aromatic operators added? (:)

@cyrusmaher
Copy link
Author

cyrusmaher commented Dec 19, 2020

@rbharath The easiest way to reproduce this will be to run a string model on ClinTox with and without smiles canonicalization. Here is an example for canonicalization:

from rdkit import Chem
canonical_smi = Chem.MolToSmiles(Chem.MolFromSmiles(smi), canonical=True)

Without bloating this with the helper code for the string kernel, etc., here's what I ran:
image

@gabegrand I'm not sure precisely, but delocalization, tautomers, salts, etc can all be handled differently in systematic ways

@cyrusmaher
Copy link
Author

cyrusmaher commented Dec 22, 2020

@rbharath It's worth considering that canonicalization would not entirely eliminate this bias, e.g. if one database is more likely to include charged species (assuming a different pH or preparation). You can see evidence for this in the different levels of "+" and "." characters between positive and negative examples in clintox:

Edit: it appears that much of this significance is driven by smiles that turn out to be duplicated once they're canonicalized.
image

@TWRogers
Copy link

TWRogers commented Mar 24, 2021

Firstly, thanks to the DeepChem & MoleculeNet contributers, it is a great library and a great benchmark!

However, I think this issue really needs to be fixed, before people publish papers (if they haven't already). And potentially the clintox dataset should be dropped altogether.

I was reproducing the textcnn result on the clintox dataset. I was very pleased to reproduce the benchmark AUC of ~0.995!

However, when I examined the underlying dataset I found that there were severe biases in the dataset, which should be fixed in the overall benchmark for clintox. The benchmark shows textcnn wining by around 11%, which is very unlikely to be true.

image

I observed the following in my experiments.

  1. Training textcnn on smiles as given by deepchem dataloader: Train: 0.991, Val: 0.995, Test: 0.994
  2. Train on only the last and first 2 characters of the smiles: Train: 0.850, Val: 0.700, Test: 0.956
  3. Training on on RDKit canonical smiles: Train: 0.916, Val: 0.837, Test: 0.905

I think the most surprising thing was only using the first and last 2 characters of the smiles. You can still achieve a very high (apparently +7% on SOTA) test AUC of 0.956, however would you really trust such a classifier to detect toxic molecules?!

The third experiment about canon smiles agrees with the findings of @cyrusmaher

The model dc.models.TextCNNModel seems to use the smiles in dataset.ids which are not canonical. My suggestion would be to canonicalise them in the dataset loader by default for all datasets.

@rbharath
Copy link
Member

rbharath commented Apr 6, 2021

Thanks for the detailed analysis! We're working towards a MoleculeNet 2.0 paper. We will update recommendations and benchmark analysis for Clintox as part of this release

@TWRogers
Copy link

Thanks for your reply, that's good to know 😄

@cmahervir
Copy link

Any updates on this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants