This repository has been archived by the owner on Nov 1, 2024. It is now read-only.
fix: add support for wide characters when building index of dataset files #728
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
When reading a dataset for the first time metaseq will create an "index" file that will store the byte offset in the .jsonl file where item N begins
Issue
This calculation
cur += len(line)
was previously assuming the "length" of the read line was the offset of "next" item.In the case that the line contains wide Unicode characters, this calculation is incorrect.
len(line)
will return 1 instead of the total bytes of the character.Example
An example where len gives us the wrong number with a normal string that contains wide-unicode char
The correct length is given if we use a byte-string
Solution
Instead of keeping a running counter this PR simplifies the implementation by querying the current position of the file reader after every line is read. This approach properly accounts for the bytes and not the characters.
Credit goes to @tupini07 for diagnosing and fixing.
One of the important bug fixes from #726
Testing steps
We noticed this bug in our repo when testing datasets containing these wide characters and have been using it in our pipeline for few weeks now.