Add <|endoftext|> to end of documents #98

J38 · 2021-09-25T10:13:05Z

I modified the grouping function to add the eos_token_id from the given tokenizer to the end of each non-empty document, and to appropriately update the "attention_masks" field of each document with an extra 1.

There is also a data_preprocessing test that checks that the first two batches of 1000 docs from wikitext2 train processed by group() match expected results, and some other basic tests of correctness.

As I was saying in the Propulsion meeting, I am not sure this is the best approach to this, but I am submitting this pull request in compliance with the requested implementation. I think there is merit to actually just attaching the tokenizer's eos_token to the end of the strings before the grouping and tokenization occur (such as when the detokenization step happens).

For instance, right now I am just adding in 1 to the "attention_masks" for each document. It seems like it would be better if how that is set is handled properly by the tokenizer rather than manually and this could lead to issues down the road. If you just modify the string, then the same downstream processes will be applied and whatever is supposed to go in "attention_masks" would be set by the tokenizer rather than our code.

Then again maybe this is no big deal and we always want that to be all 1's or the logic to update properly won't be a big deal and can go in group().

…stral into add-eos-token-2

J38 and others added 13 commits September 10, 2021 02:01

Update run-tests.yaml

fdf622f

add eos_token between docs

7f4ee6e

tests for data preprocessing

ba2b5c6

add more tests

87c1f27

regression data for preprocessing test

aaad0f3

add more regression tests

9016a0e

Update run-tests.yaml

823b4f5

check all attention mask 1

5d8e8a7

Merge branch 'add-eos-token-2' of https://github.com/stanford-crfm/mi…

e78c653

…stral into add-eos-token-2

Update run-tests.yaml

df79c66

change to examples.keys()

859b38c

Merge branch 'add-eos-token-2' of https://github.com/stanford-crfm/mi…

f6a1f68

…stral into add-eos-token-2

check for existence of tokenizer.eos_token_id

c46e90b

J38 requested a review from siddk October 12, 2021 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add <|endoftext|> to end of documents #98

Add <|endoftext|> to end of documents #98

J38 commented Sep 25, 2021 •

edited

Loading

Add <|endoftext|> to end of documents #98

Are you sure you want to change the base?

Add <|endoftext|> to end of documents #98

Conversation

J38 commented Sep 25, 2021 • edited Loading

J38 commented Sep 25, 2021 •

edited

Loading