Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3835 records full of backslashes #4

Open
stas00 opened this issue Oct 27, 2021 · 1 comment
Open

3835 records full of backslashes #4

stas00 opened this issue Oct 27, 2021 · 1 comment
Labels
bug Something isn't working lang:en Language: English ver:2019 Version: OSCAR 2019

Comments

@stas00
Copy link

stas00 commented Oct 27, 2021

At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en

My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.

Checking that the original indeed contains these records:

  • Download the dataset (after pip install datasets)
python -c "from datasets import load_dataset; load_dataset('oscar', 'unshuffled_deduplicated_en', split='train', keep_in_memory=False, cache_dir='cache')"
  • Check the original records:
cd cache/downloads
find . -type f -size +50k | xargs -n1  gunzip -c | fgrep -a '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' | tee data-with-many-slashes.txt
  • Validate:
$ perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | wc -l
4245

Look at the lengths:

perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | sort -V

The largest number is 524287 (Which is the most common record)

@Uinelj Uinelj added bug Something isn't working lang:en Language: English ver:21.09 Version: OSCAR 21.09 ver:2019 Version: OSCAR 2019 and removed ver:21.09 Version: OSCAR 21.09 labels Oct 28, 2021
@Uinelj
Copy link
Member

Uinelj commented Oct 28, 2021

Hi and thank you for the report.

Since goclassy (the pipeline used to generate OSCAR 2019) and its sequel ungoliant downloads and generates OSCAR from CommonCrawl dumps, it seems that the whole downloading and slicing of the 4B backslashes happened there.

I would like to have some precisions about the word "record", since it can mean many things in this context.

The issue itself may present itself again in the latest OSCAR 21.09, since the filtering is more or less the same.

We will look into what can be done to improve detection of such low-quality content.

@Uinelj Uinelj added this to OSCAR Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lang:en Language: English ver:2019 Version: OSCAR 2019
Projects
Status: No status
Development

No branches or pull requests

2 participants