You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same id field in the dataset. E.g. in the Project Gutenberg source, there are 175 duplicates that can be found by just looking at the id column. An example of a duplicate id is 8fddd3535f86e159339e1ff9be64fdda in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?
The text was updated successfully, but these errors were encountered:
Hi,
While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same
id
field in the dataset. E.g. in theProject Gutenberg
source, there are 175 duplicates that can be found by just looking at theid
column. An example of a duplicateid
is8fddd3535f86e159339e1ff9be64fdda
in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?The text was updated successfully, but these errors were encountered: