Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two Submissions on Clearance #20

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

mufeili
Copy link
Contributor

@mufeili mufeili commented Jan 9, 2021

@rbharath @miaecle This PR is for two submissions (random forest + ECFP & GCN + GC) on Clearance.

Also, it seems that the dataset is small and the labels can have a very different scale, e.g. 0.xx to 22. As a result, the RMSE values are pretty large. See if this is expected. @peastman

@peastman
Copy link

Also, it seems that the dataset is small and the labels can have a very different scale, e.g. 0.xx to 22. As a result, the RMSE values are pretty large.

I'm not too familiar with this dataset. That does make sense. Perhaps a different metric would be more appropriate?

@mufeili
Copy link
Contributor Author

mufeili commented Jan 11, 2021

Also, it seems that the dataset is small and the labels can have a very different scale, e.g. 0.xx to 22. As a result, the RMSE values are pretty large.

I'm not too familiar with this dataset. That does make sense. Perhaps a different metric would be more appropriate?

What's the source of the dataset? Have anyone used this before? An alternative metric can be R2.

@rbharath
Copy link
Member

Sorry for the slow response! Lost track of this PR in my inbox. It looks like we added the clearance dataset in deepchem/deepchem#484 but we don't have the dataset listed in the original 17 datasets in MoleculeNet v1 for some readon. @miaecle would you happen to remember why we didn't add clearance to the moleculenet v1 datasets?

As a couple of thoughts, perhaps we should log-transform the output? We do this for some regression outputs in which there's a large range of outputs. In that case, the RMS on the logarithmic scale might be meaningful. Another option is swapping to R^2. I'm pretty open to swapping to either given that we didn't include Clearance in v1 so this won't break any existing benchmark standard

@miaecle
Copy link

miaecle commented Jan 21, 2021

@mufeili @rbharath Sorry I didn't quite remember why/if it is included in the initial version of benchmark. As of metrics I agree with Bharath on log-transforming. Depending on how the label distribution looks like, R2 could also suffer from outliers (assuming those with label~22 are quite rare).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants