Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Language Contamination Inquiry #108

Open
iBibek opened this issue Mar 13, 2024 · 1 comment
Open

Potential Language Contamination Inquiry #108

iBibek opened this issue Mar 13, 2024 · 1 comment

Comments

@iBibek
Copy link

iBibek commented Mar 13, 2024

The repo mentions that the dataset is composed of only these languages: en, de, fr, and es. Is there any possibility of contamination with other languages in the dataset? I would greatly appreciate your response. Thank you in advance.

@mauriceweber
Copy link
Collaborator

Hi @iBibek

Thanks for your question -- it is very likely that there are also other languages present in the dataset. This is because the language of a document is identified using a FastText classifier and any document with score >= 0.5 is considered to be of the respective language and is kept in the corpus. I would expect that documents with lower language scores are more likely to contain text in other languages -- so if you want to filter such instances out you can filter the dataset based on a higher language score (e.g., RefinedWeb uses 0.6, C4 goes as high as 0.99).

I hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants