Potential Language Contamination Inquiry #108

iBibek · 2024-03-13T20:40:19Z

The repo mentions that the dataset is composed of only these languages: en, de, fr, and es. Is there any possibility of contamination with other languages in the dataset? I would greatly appreciate your response. Thank you in advance.

mauriceweber · 2024-03-20T08:05:48Z

Hi @iBibek

Thanks for your question -- it is very likely that there are also other languages present in the dataset. This is because the language of a document is identified using a FastText classifier and any document with score >= 0.5 is considered to be of the respective language and is kept in the corpus. I would expect that documents with lower language scores are more likely to contain text in other languages -- so if you want to filter such instances out you can filter the dataset based on a higher language score (e.g., RefinedWeb uses 0.6, C4 goes as high as 0.99).

I hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Language Contamination Inquiry #108

Potential Language Contamination Inquiry #108

iBibek commented Mar 13, 2024 •

edited

Loading

mauriceweber commented Mar 20, 2024

Potential Language Contamination Inquiry #108

Potential Language Contamination Inquiry #108

Comments

iBibek commented Mar 13, 2024 • edited Loading

mauriceweber commented Mar 20, 2024

iBibek commented Mar 13, 2024 •

edited

Loading