Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate samples in training data #17

Open
twesterhout opened this issue Apr 27, 2018 · 6 comments
Open

Duplicate samples in training data #17

twesterhout opened this issue Apr 27, 2018 · 6 comments
Labels
help wanted Extra attention is needed preprocessing question Further information is requested

Comments

@twesterhout
Copy link
Member

I've just noticed that train_sample.csv contains two occurrences of the following line:

871,12,1,13,178,2017-11-08 10:00:05,,0

Is it of any importance? train.csv contains loads more data and thus, I'm afraid, many more duplicate samples... Thoughts?

@twesterhout twesterhout added help wanted Extra attention is needed question Further information is requested preprocessing labels Apr 27, 2018
@twesterhout
Copy link
Member Author

If my calculations are correct, there are around 7061525 duplicate lines (i.e. 3.8% of the data is duplicated) in train.csv.

@johannadevos
Copy link
Member

How did you calculate this? Wouldn't it just be best to remove all duplicates?

@twesterhout
Copy link
Member Author

I've run sort on the data, then asked uniq to print duplicates, and counted them with wc. I might've made a mistake though, so it'd be great if someone checks this before we screw up the data.

@andregalvez79
Copy link
Contributor

Perhaps this duplicates actually are like that and the data is correct. I'm gonna search if someone has noticed this on Kaggle.

@johannadevos
Copy link
Member

I still need to look into this myself (but don't have time today), but my first thought is that if Tom didn't make any mistakes in his calculations, the mistake must be in the data. If two rows are exactly the same, including their IDs, they are just duplicates and we should remove them because they obscure the dataset. @andregalvez79 Thanks for checking this out on Kaggle!

@andregalvez79
Copy link
Contributor

So apparently no one has mentioned this on Kaggle. We can assume this is an error and delete the duplicates. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed preprocessing question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants