3835 records full of backslashes #4

stas00 · 2021-10-27T17:05:56Z

At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en

My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.

Checking that the original indeed contains these records:

Download the dataset (after pip install datasets)

python -c "from datasets import load_dataset; load_dataset('oscar', 'unshuffled_deduplicated_en', split='train', keep_in_memory=False, cache_dir='cache')"

Check the original records:

cd cache/downloads
find . -type f -size +50k | xargs -n1  gunzip -c | fgrep -a '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' | tee data-with-many-slashes.txt

Validate:

$ perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | wc -l
4245

Look at the lengths:

perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | sort -V

The largest number is 524287 (Which is the most common record)

The text was updated successfully, but these errors were encountered:

Uinelj · 2021-10-28T11:30:02Z

Hi and thank you for the report.

Since goclassy (the pipeline used to generate OSCAR 2019) and its sequel ungoliant downloads and generates OSCAR from CommonCrawl dumps, it seems that the whole downloading and slicing of the 4B backslashes happened there.

I would like to have some precisions about the word "record", since it can mean many things in this context.

The issue itself may present itself again in the latest OSCAR 21.09, since the filtering is more or less the same.

We will look into what can be done to improve detection of such low-quality content.

Uinelj added bug Something isn't working lang:en Language: English ver:21.09 Version: OSCAR 21.09 ver:2019 Version: OSCAR 2019 and removed ver:21.09 Version: OSCAR 21.09 labels Oct 28, 2021

Uinelj added this to OSCAR Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3835 records full of backslashes #4

3835 records full of backslashes #4

stas00 commented Oct 27, 2021 •

edited

Loading

Uinelj commented Oct 28, 2021 •

edited

Loading

3835 records full of backslashes #4

3835 records full of backslashes #4

Comments

stas00 commented Oct 27, 2021 • edited Loading

Uinelj commented Oct 28, 2021 • edited Loading

stas00 commented Oct 27, 2021 •

edited

Loading

Uinelj commented Oct 28, 2021 •

edited

Loading