Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add <|endoftext|> to end of documents #98

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Add <|endoftext|> to end of documents #98

wants to merge 13 commits into from

Conversation

J38
Copy link
Contributor

@J38 J38 commented Sep 25, 2021

I modified the grouping function to add the eos_token_id from the given tokenizer to the end of each non-empty document, and to appropriately update the "attention_masks" field of each document with an extra 1.

There is also a data_preprocessing test that checks that the first two batches of 1000 docs from wikitext2 train processed by group() match expected results, and some other basic tests of correctness.

As I was saying in the Propulsion meeting, I am not sure this is the best approach to this, but I am submitting this pull request in compliance with the requested implementation. I think there is merit to actually just attaching the tokenizer's eos_token to the end of the strings before the grouping and tokenization occur (such as when the detokenization step happens).

For instance, right now I am just adding in 1 to the "attention_masks" for each document. It seems like it would be better if how that is set is handled properly by the tokenizer rather than manually and this could lead to issues down the road. If you just modify the string, then the same downstream processes will be applied and whatever is supposed to go in "attention_masks" would be set by the tokenizer rather than our code.

Then again maybe this is no big deal and we always want that to be all 1's or the logic to update properly won't be a big deal and can go in group().

@J38 J38 requested a review from siddk October 12, 2021 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant