Add <|endoftext|> to end of documents #98
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I modified the grouping function to add the eos_token_id from the given tokenizer to the end of each non-empty document, and to appropriately update the "attention_masks" field of each document with an extra 1.
There is also a data_preprocessing test that checks that the first two batches of 1000 docs from wikitext2 train processed by group() match expected results, and some other basic tests of correctness.
As I was saying in the Propulsion meeting, I am not sure this is the best approach to this, but I am submitting this pull request in compliance with the requested implementation. I think there is merit to actually just attaching the tokenizer's eos_token to the end of the strings before the grouping and tokenization occur (such as when the detokenization step happens).
For instance, right now I am just adding in 1 to the "attention_masks" for each document. It seems like it would be better if how that is set is handled properly by the tokenizer rather than manually and this could lead to issues down the road. If you just modify the string, then the same downstream processes will be applied and whatever is supposed to go in "attention_masks" would be set by the tokenizer rather than our code.
Then again maybe this is no big deal and we always want that to be all 1's or the logic to update properly won't be a big deal and can go in group().