Duplicate ids in Dolma v1.7 #157

Vedaad-Shakib · 2024-05-03T20:43:50Z

Hi,

While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same id field in the dataset. E.g. in the Project Gutenberg source, there are 175 duplicates that can be found by just looking at the id column. An example of a duplicate id is 8fddd3535f86e159339e1ff9be64fdda in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate ids in Dolma v1.7 #157

Duplicate ids in Dolma v1.7 #157

Vedaad-Shakib commented May 3, 2024

Duplicate ids in Dolma v1.7 #157

Duplicate ids in Dolma v1.7 #157

Comments

Vedaad-Shakib commented May 3, 2024