Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate ids in Dolma v1.7 #157

Open
Vedaad-Shakib opened this issue May 3, 2024 · 0 comments
Open

Duplicate ids in Dolma v1.7 #157

Vedaad-Shakib opened this issue May 3, 2024 · 0 comments

Comments

@Vedaad-Shakib
Copy link

Hi,

While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same id field in the dataset. E.g. in the Project Gutenberg source, there are 175 duplicates that can be found by just looking at the id column. An example of a duplicate id is 8fddd3535f86e159339e1ff9be64fdda in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant