-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSCAR-2109 datasets are misaligned and truncated #3704
Comments
Hi @adrianeboyd, thanks for reporting. There is indeed a bug in that community dataset: metadata_and_text_files = list(zip(metadata_files, text_files)) should be replaced with metadata_and_text_files = list(zip(sorted(metadata_files), sorted(text_files))) I am going to contact their owners (https://huggingface.co/oscar-corpus) in order to inform them about the bug. I keep you informed. |
That fix is part of it, but it's clearly not the only issue. I also already contacted the OSCAR creators, but I reported it here because it looked like huggingface members were the main authors in the git history. Is there a better place to have reported this? |
Hello, We've had an issue that could be linked to this one here: oscar-project/corpus#15. I have been spot checking the source ( The text and metadata files are designed to be used in sync (with The fix @albertvillanova proposed should fix the problem, as the parts will be in sync again. Let me know if you need help or more details, I'd be glad to help! |
I'm happy to move the discussion to the other repo! Merely sorting the files only maybe fixes the processing of the first part. If the first part contains non-unix newlines, it will still be misaligned/truncated, and all the following parts will be truncated with incorrect text offsets and metadata due the offset and newline bugs. |
Hi @Uinelj, This is a total noobs question but how can I integrate that bugfix into my code? I reinstalled the datasets library this time from source. Should that have fixed the issue? I am still facing the misalignment issue. Do I need to download the dataset from scratch? |
Hi, I re-downloaded the dataset and still have the problem. See: oscar-project/corpus#18 |
Sorry @norakassner for the late reply. There are indeed several issues creating the misalignment, as @adrianeboyd cleverly pointed out:
|
Normally, the issues should be fixed now:
Feel free to reopen if you find additional misalignments/truncations. |
Thanks for the updates! The purist in me would still like to have the rstrip not strip additional characters from the original text (unicode whitespace mainly in practice, I think), but the differences are extremely small in practice and it doesn't actually matter for my current task: text = "".join([text_f.readline() for _ in range(meta["nb_sentences"])]).rstrip("\n") |
Describe the bug
The
oscar-corpus/OSCAR-2109
data appears to be misaligned and truncated by the dataset builder for subsets that contain more than one part and for cases where the texts contain non-unix newlines.Steps to reproduce the bug
A few examples, although I'm not sure how deterministic the particular (mis)alignment is in various configurations:
For
deduplicated_fi
, all exported raw texts from the dataset are 17GB rather than 20GB as reported in the data splits overview table. The token count withwc -w
for the raw texts is 2,067,556,874 rather than the expected 2,357,264,196 from the data splits table.For
deduplicated_no
all exported raw texts contain 624,040,887 rather than the expected 776,354,517 tokens.For
deduplicated_mk
it is 122,236,936 rather than 134,544,934 tokens.I'm not expecting the
wc -w
counts to line up exactly with the data splits table, but for comparison thewc -w
count fordeduplicated_mk
on the raw texts is 134,545,424.Issues
Expected results
All texts from the OSCAR release are extracted according to the metadata and aligned with the correct metadata.
Fixes
Not necessarily the exact fixes/checks you may want to use (I didn't test all languages or do any cross-platform testing, I'm not sure all the details are compatible with streaming), however to highlight the issues:
I've tested this with a number of smaller deduplicated languages with 1-20 parts and the resulting datasets looked correct in terms of word count and size when compared to the data splits table and raw texts, and the text/metadata alignments were correct in all my spot checks. However, there are many many languages I didn't test and I'm not sure that there aren't any texts containing blank lines in the corpus, for instance. For the cases I tested, the assertions related to blank lines and EOF made it easier to verify that the text and metadata were aligned as intended, since there would be little chance of spurious alignments of variable-length texts across so much data.
The text was updated successfully, but these errors were encountered: