OSCAR-2109 datasets are misaligned and truncated #3704

adrianeboyd · 2022-02-11T08:14:59Z

Describe the bug

The oscar-corpus/OSCAR-2109 data appears to be misaligned and truncated by the dataset builder for subsets that contain more than one part and for cases where the texts contain non-unix newlines.

Steps to reproduce the bug

A few examples, although I'm not sure how deterministic the particular (mis)alignment is in various configurations:

from datasets import load_dataset
dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_fi", split="train", use_auth_token=True)
entry = dataset[0]
# entry["text"] is from fi_part_3.txt.gz
# entry["meta"] is from fi_meta_part_2.jsonl.gz

dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_no", split="train", use_auth_token=True)
entry = dataset[900000]
# entry["text"] is from no_part_3.txt.gz and contains a blank line
# entry["meta"] is from no_meta_part_1.jsonl.gz

dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_mk", split="train", streaming=True, use_auth_token=True)
# 9088 texts in the dataset are empty

For deduplicated_fi, all exported raw texts from the dataset are 17GB rather than 20GB as reported in the data splits overview table. The token count with wc -w for the raw texts is 2,067,556,874 rather than the expected 2,357,264,196 from the data splits table.

For deduplicated_no all exported raw texts contain 624,040,887 rather than the expected 776,354,517 tokens.

For deduplicated_mk it is 122,236,936 rather than 134,544,934 tokens.

I'm not expecting the wc -w counts to line up exactly with the data splits table, but for comparison the wc -w count for deduplicated_mk on the raw texts is 134,545,424.

Issues

The meta / text files are not paired correctly when loading, so the extracted texts do not have the right offsets, the metadata is not associated with the correct text, and the text files may not be processed to the end or may be processed beyond the end (empty texts).
The line count offset is not reset per file so the texts aren't aligned to the right offsets in any parts beyond the first part, leading to truncation when in effect blank lines are not skipped.
Non-unix newline characters are treated as newlines when reading the text files while the metadata only counts unix newlines for its line offsets, leading to further misalignments between the metadata and the extracted texts, and which also results in truncation.

Expected results

All texts from the OSCAR release are extracted according to the metadata and aligned with the correct metadata.

Fixes

Not necessarily the exact fixes/checks you may want to use (I didn't test all languages or do any cross-platform testing, I'm not sure all the details are compatible with streaming), however to highlight the issues:

diff --git a/OSCAR-2109.py b/OSCAR-2109.py
index bbac1076..5eee8de7 100644
--- a/OSCAR-2109.py
+++ b/OSCAR-2109.py
@@ -20,6 +20,7 @@
 import collections
 import gzip
 import json
+import os
 
 import datasets
 
@@ -387,9 +388,20 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
         with open(checksum_file, encoding="utf-8") as f:
             data_filenames = [line.split()[1] for line in f if line]
             data_urls = [self.config.base_data_path + data_filename for data_filename in data_filenames]
-        text_files = dl_manager.download([url for url in data_urls if url.endswith(".txt.gz")])
-        metadata_files = dl_manager.download([url for url in data_urls if url.endswith(".jsonl.gz")])
+        # sort filenames so corresponding parts are aligned
+        text_files = sorted(dl_manager.download([url for url in data_urls if url.endswith(".txt.gz")]))
+        metadata_files = sorted(dl_manager.download([url for url in data_urls if url.endswith(".jsonl.gz")]))
+        assert len(text_files) == len(metadata_files)
         metadata_and_text_files = list(zip(metadata_files, text_files))
+        for meta_path, text_path in metadata_and_text_files:
+            # check that meta/text part numbers are the same
+            if "part" in os.path.basename(text_path):
+                assert (
+                    os.path.basename(text_path).replace(".txt.gz", "").split("_")[-1]
+                    == os.path.basename(meta_path).replace(".jsonl.gz", "").split("_")[-1]
+                )
+            else:
+                assert len(metadata_and_text_files) == 1
         return [
             datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"metadata_and_text_files": metadata_and_text_files}),
         ]
@@ -397,10 +409,14 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
     def _generate_examples(self, metadata_and_text_files):
         """This function returns the examples in the raw (text) form by iterating on all the files."""
         id_ = 0
-        offset = 0
         for meta_path, text_path in metadata_and_text_files:
+            # line offsets are per text file
+            offset = 0
             logger.info("generating examples from = %s", text_path)
-            with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8") as text_f:
+            # some texts contain non-Unix newlines that should not be
+            # interpreted as line breaks for the line counts in the metadata
+            # with readline()
+            with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8", newline="\n") as text_f:
                 with gzip.open(open(meta_path, "rb"), "rt", encoding="utf-8") as meta_f:
                     for line in meta_f:
                         # read meta
@@ -411,7 +427,12 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
                             offset += 1
                             text_f.readline()
                         # read text
-                        text = "".join([text_f.readline() for _ in range(meta["nb_sentences"])]).rstrip()
+                        text_lines = [text_f.readline() for _ in range(meta["nb_sentences"])]
+                        # all lines contain text (no blank lines or EOF)
+                        assert all(text_lines)
+                        assert "\n" not in text_lines
                         offset += meta["nb_sentences"]
+                        # only strip the trailing newline
+                        text = "".join(text_lines).rstrip("\n")
                         yield id_, {"id": id_, "text": text, "meta": meta}
                         id_ += 1

I've tested this with a number of smaller deduplicated languages with 1-20 parts and the resulting datasets looked correct in terms of word count and size when compared to the data splits table and raw texts, and the text/metadata alignments were correct in all my spot checks. However, there are many many languages I didn't test and I'm not sure that there aren't any texts containing blank lines in the corpus, for instance. For the cases I tested, the assertions related to blank lines and EOF made it easier to verify that the text and metadata were aligned as intended, since there would be little chance of spurious alignments of variable-length texts across so much data.

The text was updated successfully, but these errors were encountered:

albertvillanova · 2022-02-11T08:45:04Z

Hi @adrianeboyd, thanks for reporting.

There is indeed a bug in that community dataset:
Line:

metadata_and_text_files = list(zip(metadata_files, text_files))

should be replaced with

metadata_and_text_files = list(zip(sorted(metadata_files), sorted(text_files)))

I am going to contact their owners (https://huggingface.co/oscar-corpus) in order to inform them about the bug.

I keep you informed.

adrianeboyd · 2022-02-11T08:58:47Z

That fix is part of it, but it's clearly not the only issue.

I also already contacted the OSCAR creators, but I reported it here because it looked like huggingface members were the main authors in the git history. Is there a better place to have reported this?

Uinelj · 2022-02-11T10:35:10Z

Hello,

We've had an issue that could be linked to this one here: oscar-project/corpus#15.

I have been spot checking the source (.txt/.jsonl) files for a while, and have not found issues, especially in the start/end of corpora (but I conceed that more integration testing would be necessary on our side).

The text and metadata files are designed to be used in sync (with lang_part_n.txt and lang_meta_part_n.jsonl working together), while staying independent from part to part, so that anyone could randomly choose a part and work with it.

The fix @albertvillanova proposed should fix the problem, as the parts will be in sync again.

Let me know if you need help or more details, I'd be glad to help!

adrianeboyd · 2022-02-11T10:41:41Z

I'm happy to move the discussion to the other repo!

Merely sorting the files only maybe fixes the processing of the first part. If the first part contains non-unix newlines, it will still be misaligned/truncated, and all the following parts will be truncated with incorrect text offsets and metadata due the offset and newline bugs.

albertvillanova · 2022-02-14T14:50:10Z

Fixed:

https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/commit/3cd7e95aa1799b73c5ea8afc3989635f3e19b86b

norakassner · 2022-03-02T10:01:59Z

Hi @Uinelj, This is a total noobs question but how can I integrate that bugfix into my code? I reinstalled the datasets library this time from source. Should that have fixed the issue? I am still facing the misalignment issue. Do I need to download the dataset from scratch?

norakassner · 2022-03-04T07:48:00Z

Hi, I re-downloaded the dataset and still have the problem. See: oscar-project/corpus#18

albertvillanova · 2022-03-14T16:24:23Z

Sorry @norakassner for the late reply.

There are indeed several issues creating the misalignment, as @adrianeboyd cleverly pointed out:

https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/commit/3cd7e95aa1799b73c5ea8afc3989635f3e19b86b fixed one of them
but there are still others to be fixed

albertvillanova · 2022-03-16T16:21:28Z

Normally, the issues should be fixed now:

Fix offset initialization for each file: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/commit/1ad9b7bfe00798a9258a923b887bb1c8d732b833
Disable default universal newline support: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/commit/0c2f307d3167f03632f502af361ac6c3c393f510

Feel free to reopen if you find additional misalignments/truncations.

CC: @adrianeboyd @norakassner @Uinelj

adrianeboyd · 2022-03-17T18:01:04Z

Thanks for the updates!

The purist in me would still like to have the rstrip not strip additional characters from the original text (unicode whitespace mainly in practice, I think), but the differences are extremely small in practice and it doesn't actually matter for my current task:

text = "".join([text_f.readline() for _ in range(meta["nb_sentences"])]).rstrip("\n")

adrianeboyd added the bug Something isn't working label Feb 11, 2022

albertvillanova closed this as completed Feb 14, 2022

adrianeboyd mentioned this issue Feb 14, 2022

OSCAR-2109 huggingface datasets are misaligned and truncated oscar-project/corpus#18

Open

albertvillanova reopened this Mar 14, 2022

albertvillanova mentioned this issue Mar 16, 2022

Text builder with custom separator line boundaries #3804

Open

albertvillanova closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSCAR-2109 datasets are misaligned and truncated #3704

OSCAR-2109 datasets are misaligned and truncated #3704

adrianeboyd commented Feb 11, 2022

albertvillanova commented Feb 11, 2022

adrianeboyd commented Feb 11, 2022

Uinelj commented Feb 11, 2022

adrianeboyd commented Feb 11, 2022

albertvillanova commented Feb 14, 2022 •

edited

Loading

norakassner commented Mar 2, 2022 •

edited

Loading

norakassner commented Mar 4, 2022 •

edited

Loading

albertvillanova commented Mar 14, 2022

albertvillanova commented Mar 16, 2022

adrianeboyd commented Mar 17, 2022

OSCAR-2109 datasets are misaligned and truncated #3704

OSCAR-2109 datasets are misaligned and truncated #3704

Comments

adrianeboyd commented Feb 11, 2022

Describe the bug

Steps to reproduce the bug

Issues

Expected results

Fixes

albertvillanova commented Feb 11, 2022

adrianeboyd commented Feb 11, 2022

Uinelj commented Feb 11, 2022

adrianeboyd commented Feb 11, 2022

albertvillanova commented Feb 14, 2022 • edited Loading

norakassner commented Mar 2, 2022 • edited Loading

norakassner commented Mar 4, 2022 • edited Loading

albertvillanova commented Mar 14, 2022

albertvillanova commented Mar 16, 2022

adrianeboyd commented Mar 17, 2022

albertvillanova commented Feb 14, 2022 •

edited

Loading

norakassner commented Mar 2, 2022 •

edited

Loading

norakassner commented Mar 4, 2022 •

edited

Loading