Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String kernels can exploit biases in SMILES string format, skewing performance metrics #26

Open
cyrusmaher opened this issue Dec 15, 2020 · 4 comments
Assignees

Comments

@cyrusmaher
Copy link

Just a heads up on this issue:
deepchem/moleculenet#15

I propose that string-based classifiers canonicalize smiles prior to processing to prevent confounded performance, CI, etc. estimates.

@Ryan-Rhys
Copy link
Owner

Thanks for raising this! I made a change to the photoswitch dataset to canonicalise all SMILES as a preprocessing step a couple of weeks ago, will make sure this is implemented for the other datasets!

@Ryan-Rhys Ryan-Rhys self-assigned this Dec 19, 2020
@cyrusmaher
Copy link
Author

cyrusmaher commented Dec 22, 2020

image
It's possible canonicalization doesn't fully eliminate the bias (e.g. if one set calculates smiles at a different pH or is more likely to include salt forms). You can see that in the enrichment for "." and "+" characters between positive and negative examples in clintox.

@Ryan-Rhys
Copy link
Owner

Interesting! @henrymoss and I will keep track of this conversation you guys are having in DeepChem!

@henrymoss
Copy link
Collaborator

This is interesting indeed!

I wonder if this aligns with the observed lack of improvements we were getting when augmenting the data by adding extra (non-canonical) SMILES. Basically, we could only learn from training data in canonical form, as our test data was also canonical. Even increasing the data x5 (through augmentation), we couldn't improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants