Duplicate samples in training data #17

twesterhout · 2018-04-27T13:29:50Z

I've just noticed that train_sample.csv contains two occurrences of the following line:

871,12,1,13,178,2017-11-08 10:00:05,,0

Is it of any importance? train.csv contains loads more data and thus, I'm afraid, many more duplicate samples... Thoughts?

The text was updated successfully, but these errors were encountered:

twesterhout · 2018-04-27T16:47:12Z

If my calculations are correct, there are around 7061525 duplicate lines (i.e. 3.8% of the data is duplicated) in train.csv.

johannadevos · 2018-04-30T07:37:01Z

How did you calculate this? Wouldn't it just be best to remove all duplicates?

twesterhout · 2018-04-30T07:57:04Z

I've run sort on the data, then asked uniq to print duplicates, and counted them with wc. I might've made a mistake though, so it'd be great if someone checks this before we screw up the data.

andregalvez79 · 2018-04-30T18:30:54Z

Perhaps this duplicates actually are like that and the data is correct. I'm gonna search if someone has noticed this on Kaggle.

johannadevos · 2018-05-01T05:50:51Z

I still need to look into this myself (but don't have time today), but my first thought is that if Tom didn't make any mistakes in his calculations, the mistake must be in the data. If two rows are exactly the same, including their IDs, they are just duplicates and we should remove them because they obscure the dataset. @andregalvez79 Thanks for checking this out on Kaggle!

andregalvez79 · 2018-05-01T10:58:19Z

So apparently no one has mentioned this on Kaggle. We can assume this is an error and delete the duplicates. :)

twesterhout added help wanted Extra attention is needed question Further information is requested preprocessing labels Apr 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate samples in training data #17

Duplicate samples in training data #17

twesterhout commented Apr 27, 2018

twesterhout commented Apr 27, 2018

johannadevos commented Apr 30, 2018

twesterhout commented Apr 30, 2018

andregalvez79 commented Apr 30, 2018

johannadevos commented May 1, 2018

andregalvez79 commented May 1, 2018

Duplicate samples in training data #17

Duplicate samples in training data #17

Comments

twesterhout commented Apr 27, 2018

twesterhout commented Apr 27, 2018

johannadevos commented Apr 30, 2018

twesterhout commented Apr 30, 2018

andregalvez79 commented Apr 30, 2018

johannadevos commented May 1, 2018

andregalvez79 commented May 1, 2018