Possible pre-training data leakage? #166
wzhang2022
started this conversation in
General
Replies: 1 comment
-
Yes, we have accounted for that: Section 5.1 of our pre-training paper "Furthermore, to prevent data For the purpose of the leaderboard, we only allow unsupervised pre-training to keep the comparison fair. In fact, some labels in ogbg-pcba may be highly correlated with that of ogbg-molhiv. For the purpose of your investigation, you should feel free to investigate supervised pre-training. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently working on a project where I pre-train a neural network on
ogbg-molpcba
to improve performance onogbg-molhiv
. However, I've noticed that there seems to be some overlapping molecules in the training set ofogbg-molpcba
and the test setogbg-molhiv
. If the goal is to demonstrate that a certain pre-training method improves performance, is it important to exclude these overlapping molecules in the pre-training? If so, did this paper also account for this?Beta Was this translation helpful? Give feedback.
All reactions