Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSCAR-2109 datasets are misaligned and truncated #3704

Closed
adrianeboyd opened this issue Feb 11, 2022 · 10 comments
Closed

OSCAR-2109 datasets are misaligned and truncated #3704

adrianeboyd opened this issue Feb 11, 2022 · 10 comments
Labels
bug Something isn't working

Comments

@adrianeboyd
Copy link

Describe the bug

The oscar-corpus/OSCAR-2109 data appears to be misaligned and truncated by the dataset builder for subsets that contain more than one part and for cases where the texts contain non-unix newlines.

Steps to reproduce the bug

A few examples, although I'm not sure how deterministic the particular (mis)alignment is in various configurations:

from datasets import load_dataset
dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_fi", split="train", use_auth_token=True)
entry = dataset[0]
# entry["text"] is from fi_part_3.txt.gz
# entry["meta"] is from fi_meta_part_2.jsonl.gz

dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_no", split="train", use_auth_token=True)
entry = dataset[900000]
# entry["text"] is from no_part_3.txt.gz and contains a blank line
# entry["meta"] is from no_meta_part_1.jsonl.gz

dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_mk", split="train", streaming=True, use_auth_token=True)
# 9088 texts in the dataset are empty

For deduplicated_fi, all exported raw texts from the dataset are 17GB rather than 20GB as reported in the data splits overview table. The token count with wc -w for the raw texts is 2,067,556,874 rather than the expected 2,357,264,196 from the data splits table.

For deduplicated_no all exported raw texts contain 624,040,887 rather than the expected 776,354,517 tokens.

For deduplicated_mk it is 122,236,936 rather than 134,544,934 tokens.

I'm not expecting the wc -w counts to line up exactly with the data splits table, but for comparison the wc -w count for deduplicated_mk on the raw texts is 134,545,424.

Issues

  • The meta / text files are not paired correctly when loading, so the extracted texts do not have the right offsets, the metadata is not associated with the correct text, and the text files may not be processed to the end or may be processed beyond the end (empty texts).
  • The line count offset is not reset per file so the texts aren't aligned to the right offsets in any parts beyond the first part, leading to truncation when in effect blank lines are not skipped.
  • Non-unix newline characters are treated as newlines when reading the text files while the metadata only counts unix newlines for its line offsets, leading to further misalignments between the metadata and the extracted texts, and which also results in truncation.

Expected results

All texts from the OSCAR release are extracted according to the metadata and aligned with the correct metadata.

Fixes

Not necessarily the exact fixes/checks you may want to use (I didn't test all languages or do any cross-platform testing, I'm not sure all the details are compatible with streaming), however to highlight the issues:

diff --git a/OSCAR-2109.py b/OSCAR-2109.py
index bbac1076..5eee8de7 100644
--- a/OSCAR-2109.py
+++ b/OSCAR-2109.py
@@ -20,6 +20,7 @@
 import collections
 import gzip
 import json
+import os
 
 import datasets
 
@@ -387,9 +388,20 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
         with open(checksum_file, encoding="utf-8") as f:
             data_filenames = [line.split()[1] for line in f if line]
             data_urls = [self.config.base_data_path + data_filename for data_filename in data_filenames]
-        text_files = dl_manager.download([url for url in data_urls if url.endswith(".txt.gz")])
-        metadata_files = dl_manager.download([url for url in data_urls if url.endswith(".jsonl.gz")])
+        # sort filenames so corresponding parts are aligned
+        text_files = sorted(dl_manager.download([url for url in data_urls if url.endswith(".txt.gz")]))
+        metadata_files = sorted(dl_manager.download([url for url in data_urls if url.endswith(".jsonl.gz")]))
+        assert len(text_files) == len(metadata_files)
         metadata_and_text_files = list(zip(metadata_files, text_files))
+        for meta_path, text_path in metadata_and_text_files:
+            # check that meta/text part numbers are the same
+            if "part" in os.path.basename(text_path):
+                assert (
+                    os.path.basename(text_path).replace(".txt.gz", "").split("_")[-1]
+                    == os.path.basename(meta_path).replace(".jsonl.gz", "").split("_")[-1]
+                )
+            else:
+                assert len(metadata_and_text_files) == 1
         return [
             datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"metadata_and_text_files": metadata_and_text_files}),
         ]
@@ -397,10 +409,14 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
     def _generate_examples(self, metadata_and_text_files):
         """This function returns the examples in the raw (text) form by iterating on all the files."""
         id_ = 0
-        offset = 0
         for meta_path, text_path in metadata_and_text_files:
+            # line offsets are per text file
+            offset = 0
             logger.info("generating examples from = %s", text_path)
-            with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8") as text_f:
+            # some texts contain non-Unix newlines that should not be
+            # interpreted as line breaks for the line counts in the metadata
+            # with readline()
+            with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8", newline="\n") as text_f:
                 with gzip.open(open(meta_path, "rb"), "rt", encoding="utf-8") as meta_f:
                     for line in meta_f:
                         # read meta
@@ -411,7 +427,12 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
                             offset += 1
                             text_f.readline()
                         # read text
-                        text = "".join([text_f.readline() for _ in range(meta["nb_sentences"])]).rstrip()
+                        text_lines = [text_f.readline() for _ in range(meta["nb_sentences"])]
+                        # all lines contain text (no blank lines or EOF)
+                        assert all(text_lines)
+                        assert "\n" not in text_lines
                         offset += meta["nb_sentences"]
+                        # only strip the trailing newline
+                        text = "".join(text_lines).rstrip("\n")
                         yield id_, {"id": id_, "text": text, "meta": meta}
                         id_ += 1

I've tested this with a number of smaller deduplicated languages with 1-20 parts and the resulting datasets looked correct in terms of word count and size when compared to the data splits table and raw texts, and the text/metadata alignments were correct in all my spot checks. However, there are many many languages I didn't test and I'm not sure that there aren't any texts containing blank lines in the corpus, for instance. For the cases I tested, the assertions related to blank lines and EOF made it easier to verify that the text and metadata were aligned as intended, since there would be little chance of spurious alignments of variable-length texts across so much data.

@adrianeboyd adrianeboyd added the bug Something isn't working label Feb 11, 2022
@albertvillanova
Copy link
Member

Hi @adrianeboyd, thanks for reporting.

There is indeed a bug in that community dataset:
Line:

metadata_and_text_files = list(zip(metadata_files, text_files))

should be replaced with

metadata_and_text_files = list(zip(sorted(metadata_files), sorted(text_files)))

I am going to contact their owners (https://huggingface.co/oscar-corpus) in order to inform them about the bug.

I keep you informed.

@adrianeboyd
Copy link
Author

That fix is part of it, but it's clearly not the only issue.

I also already contacted the OSCAR creators, but I reported it here because it looked like huggingface members were the main authors in the git history. Is there a better place to have reported this?

@Uinelj
Copy link

Uinelj commented Feb 11, 2022

Hello,

We've had an issue that could be linked to this one here: oscar-project/corpus#15.

I have been spot checking the source (.txt/.jsonl) files for a while, and have not found issues, especially in the start/end of corpora (but I conceed that more integration testing would be necessary on our side).

The text and metadata files are designed to be used in sync (with lang_part_n.txt and lang_meta_part_n.jsonl working together), while staying independent from part to part, so that anyone could randomly choose a part and work with it.

The fix @albertvillanova proposed should fix the problem, as the parts will be in sync again.

Let me know if you need help or more details, I'd be glad to help!

@adrianeboyd
Copy link
Author

I'm happy to move the discussion to the other repo!

Merely sorting the files only maybe fixes the processing of the first part. If the first part contains non-unix newlines, it will still be misaligned/truncated, and all the following parts will be truncated with incorrect text offsets and metadata due the offset and newline bugs.

@albertvillanova
Copy link
Member

albertvillanova commented Feb 14, 2022

Fixed:

@norakassner
Copy link

norakassner commented Mar 2, 2022

Hi @Uinelj, This is a total noobs question but how can I integrate that bugfix into my code? I reinstalled the datasets library this time from source. Should that have fixed the issue? I am still facing the misalignment issue. Do I need to download the dataset from scratch?

@norakassner
Copy link

norakassner commented Mar 4, 2022

Hi, I re-downloaded the dataset and still have the problem. See: oscar-project/corpus#18

@albertvillanova
Copy link
Member

Sorry @norakassner for the late reply.

There are indeed several issues creating the misalignment, as @adrianeboyd cleverly pointed out:

@albertvillanova
Copy link
Member

Normally, the issues should be fixed now:

Feel free to reopen if you find additional misalignments/truncations.

CC: @adrianeboyd @norakassner @Uinelj

@adrianeboyd
Copy link
Author

Thanks for the updates!

The purist in me would still like to have the rstrip not strip additional characters from the original text (unicode whitespace mainly in practice, I think), but the differences are extremely small in practice and it doesn't actually matter for my current task:

text = "".join([text_f.readline() for _ in range(meta["nb_sentences"])]).rstrip("\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants