You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.
Checking that the original indeed contains these records:
cd cache/downloads
find . -type f -size +50k | xargs -n1 gunzip -c | fgrep -a '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' | tee data-with-many-slashes.txt
Since goclassy (the pipeline used to generate OSCAR 2019) and its sequel ungoliant downloads and generates OSCAR from CommonCrawl dumps, it seems that the whole downloading and slicing of the 4B backslashes happened there.
I would like to have some precisions about the word "record", since it can mean many things in this context.
The issue itself may present itself again in the latest OSCAR 21.09, since the filtering is more or less the same.
We will look into what can be done to improve detection of such low-quality content.
At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en
My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.
Checking that the original indeed contains these records:
pip install datasets
)Look at the lengths:
The largest number is
524287
(Which is the most common record)The text was updated successfully, but these errors were encountered: