-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate samples in training data #17
Comments
If my calculations are correct, there are around 7061525 duplicate lines (i.e. 3.8% of the data is duplicated) in |
How did you calculate this? Wouldn't it just be best to remove all duplicates? |
I've run |
Perhaps this duplicates actually are like that and the data is correct. I'm gonna search if someone has noticed this on Kaggle. |
I still need to look into this myself (but don't have time today), but my first thought is that if Tom didn't make any mistakes in his calculations, the mistake must be in the data. If two rows are exactly the same, including their IDs, they are just duplicates and we should remove them because they obscure the dataset. @andregalvez79 Thanks for checking this out on Kaggle! |
So apparently no one has mentioned this on Kaggle. We can assume this is an error and delete the duplicates. :) |
I've just noticed that
train_sample.csv
contains two occurrences of the following line:Is it of any importance?
train.csv
contains loads more data and thus, I'm afraid, many more duplicate samples... Thoughts?The text was updated successfully, but these errors were encountered: