Non-canonical smiles confound string-based classifiers #15

cyrusmaher · 2020-12-15T00:44:55Z

Running a string kernel classifier on the clintox dataset, I can obtain an AUROC of 0.96. When I canonicalize the smiles, my AUROC drops to 0.69. This implies that there is a bias in the smiles format between positive and negative examples that string-based classifiers can exploit to obtain unrealistically high performance, thereby tainting downstream benchmarks.

A solution to this would be to update the dataset to include only canonicalized smiles.

rbharath · 2020-12-15T00:47:23Z

Oh wow, that's quite the find! Yes, this would definitely need to be fixed as we're overhauling MoleculeNet for the next v2 release. I'll mark this as a bug

gabegrand · 2020-12-16T22:47:26Z

@cyrusmaher do you have any insight as to what the bias in the SMILES is?

rbharath · 2020-12-16T22:54:12Z

Starting to think about this anymore, I think we canonicalize smiles before computing descriptors in DeepChem which should handle this (but I'm not sure).

@cyrusmaher Would it be possible to provide a brief reproducing code snippet? That would help us figure out what's happening :)

cjmielke · 2020-12-17T20:53:04Z

One consideration : maybe a character frequency count between the raw and canonicalized forms? Maybe there's extra parentheses or aromatic operators added? (:)

cyrusmaher · 2020-12-19T01:32:51Z

@rbharath The easiest way to reproduce this will be to run a string model on ClinTox with and without smiles canonicalization. Here is an example for canonicalization:

from rdkit import Chem
canonical_smi = Chem.MolToSmiles(Chem.MolFromSmiles(smi), canonical=True)

Without bloating this with the helper code for the string kernel, etc., here's what I ran:

@gabegrand I'm not sure precisely, but delocalization, tautomers, salts, etc can all be handled differently in systematic ways

cyrusmaher · 2020-12-22T00:31:36Z

@rbharath It's worth considering that canonicalization would not entirely eliminate this bias, e.g. if one database is more likely to include charged species (assuming a different pH or preparation). You can see evidence for this in the different levels of "+" and "." characters between positive and negative examples in clintox:

Edit: it appears that much of this significance is driven by smiles that turn out to be duplicated once they're canonicalized.

TWRogers · 2021-03-24T20:40:16Z

Firstly, thanks to the DeepChem & MoleculeNet contributers, it is a great library and a great benchmark!

However, I think this issue really needs to be fixed, before people publish papers (if they haven't already). And potentially the clintox dataset should be dropped altogether.

I was reproducing the textcnn result on the clintox dataset. I was very pleased to reproduce the benchmark AUC of ~0.995!

However, when I examined the underlying dataset I found that there were severe biases in the dataset, which should be fixed in the overall benchmark for clintox. The benchmark shows textcnn wining by around 11%, which is very unlikely to be true.

I observed the following in my experiments.

Training textcnn on smiles as given by deepchem dataloader: Train: 0.991, Val: 0.995, Test: 0.994
Train on only the last and first 2 characters of the smiles: Train: 0.850, Val: 0.700, Test: 0.956
Training on on RDKit canonical smiles: Train: 0.916, Val: 0.837, Test: 0.905

I think the most surprising thing was only using the first and last 2 characters of the smiles. You can still achieve a very high (apparently +7% on SOTA) test AUC of 0.956, however would you really trust such a classifier to detect toxic molecules?!

The third experiment about canon smiles agrees with the findings of @cyrusmaher

The model dc.models.TextCNNModel seems to use the smiles in dataset.ids which are not canonical. My suggestion would be to canonicalise them in the dataset loader by default for all datasets.

rbharath · 2021-04-06T17:55:22Z

Thanks for the detailed analysis! We're working towards a MoleculeNet 2.0 paper. We will update recommendations and benchmark analysis for Clintox as part of this release

TWRogers · 2021-06-14T14:47:29Z

Thanks for your reply, that's good to know 😄

cmahervir · 2023-04-19T23:16:37Z

Any updates on this? Thanks!

rbharath added the bug Something isn't working label Dec 15, 2020

This was referenced Dec 15, 2020

String kernels can exploit biases in SMILES string format, skewing performance metrics Ryan-Rhys/FlowMO#26

Open

String models exploit biases in MoleculeNet SMILES dialect to inflate performance divelab/MoleculeX#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-canonical smiles confound string-based classifiers #15

Non-canonical smiles confound string-based classifiers #15

cyrusmaher commented Dec 15, 2020

rbharath commented Dec 15, 2020

gabegrand commented Dec 16, 2020

rbharath commented Dec 16, 2020

cjmielke commented Dec 17, 2020

cyrusmaher commented Dec 19, 2020 •

edited

Loading

cyrusmaher commented Dec 22, 2020 •

edited

Loading

TWRogers commented Mar 24, 2021 •

edited

Loading

rbharath commented Apr 6, 2021

TWRogers commented Jun 14, 2021

cmahervir commented Apr 19, 2023

Non-canonical smiles confound string-based classifiers #15

Non-canonical smiles confound string-based classifiers #15

Comments

cyrusmaher commented Dec 15, 2020

rbharath commented Dec 15, 2020

gabegrand commented Dec 16, 2020

rbharath commented Dec 16, 2020

cjmielke commented Dec 17, 2020

cyrusmaher commented Dec 19, 2020 • edited Loading

cyrusmaher commented Dec 22, 2020 • edited Loading

TWRogers commented Mar 24, 2021 • edited Loading

rbharath commented Apr 6, 2021

TWRogers commented Jun 14, 2021

cmahervir commented Apr 19, 2023

cyrusmaher commented Dec 19, 2020 •

edited

Loading

cyrusmaher commented Dec 22, 2020 •

edited

Loading

TWRogers commented Mar 24, 2021 •

edited

Loading