Possible pre-training data leakage? #166

wzhang2022 · 2021-04-21T08:42:45Z

wzhang2022
Apr 21, 2021

I'm currently working on a project where I pre-train a neural network on ogbg-molpcba to improve performance on ogbg-molhiv. However, I've noticed that there seems to be some overlapping molecules in the training set of ogbg-molpcba and the test set ogbg-molhiv. If the goal is to demonstrate that a certain pre-training method improves performance, is it important to exclude these overlapping molecules in the pre-training? If so, did this paper also account for this?

weihua916 · 2021-04-21T13:54:32Z

weihua916
Apr 21, 2021
Maintainer

Yes, we have accounted for that: Section 5.1 of our pre-training paper "Furthermore, to prevent data
leakage, all test graphs used for performance evaluation are removed from the graph-level supervised
pre-training datasets."

For the purpose of the leaderboard, we only allow unsupervised pre-training to keep the comparison fair. In fact, some labels in ogbg-pcba may be highly correlated with that of ogbg-molhiv. For the purpose of your investigation, you should feel free to investigate supervised pre-training.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible pre-training data leakage? #166

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Possible pre-training data leakage? #166

wzhang2022 Apr 21, 2021

Replies: 1 comment

weihua916 Apr 21, 2021 Maintainer

wzhang2022
Apr 21, 2021

weihua916
Apr 21, 2021
Maintainer