-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: refactor set special tokens function and add unit tests. #475
base: main
Are you sure you want to change the base?
fix: refactor set special tokens function and add unit tests. #475
Conversation
Thanks for making a pull request! 😃 |
Refactored the set special tokens code to it's own function in the tuning.utils.tokenizer_data_utils file. Imported the new function into sft_trainer and called it to set special tokens. Signed-off-by: Luka Dojcinovic <[email protected]>
Added unit tests for set_special_tokens_dict() for LlamaTokenizerFast, GPT2TokenizerFast and GPTNeoXTokenizerFast. Signed-off-by: Luka Dojcinovic <[email protected]>
42a9c32
to
9cb8af8
Compare
Minor fix to remove print statements Signed-off-by: Luka Dojcinovic <[email protected]>
Also changed the conditional for matching eos and pad tokens to not work when both are None. Signed-off-by: Luka Dojcinovic <[email protected]>
Changes with commit 781ce58I have added a new unit test for when a tokenizer has no special tokens. I figured it was easier to test a tokenizer that was missing all the tokens, instead of doing 4 unit tests for missing EOS, BOS, PAD etc. I can split it into 4 unit tests if you feel that is a better idea. I also changed the conditional for the case of EOS = PAD token. Because both tokens are set to Please let me know your thoughts on this. Thank you |
Added three unit tests for the tokenizer resizing function. Signed-off-by: Luka Dojcinovic <[email protected]>
Changes with Commit ad3c0bfAdded three unit tests for the tokenizer_and_embedding_resize() function. Unit Test 1:Tests the resizing when the special tokens dict contains a PAD token, which means the tokenizer is missing one special token. Unit Test 2:Tests the resizing when the special tokens dict contains a PAD, EOS, BOS and Unk token, which means the tokenizer is missing four special tokens. Unit Test 3:Tests the resizing when the special tokens dict contains a PAD, EOS, BOS and Unk token, which means the tokenizer is missing four special tokens. I'm not sure if these are the tests we would like, please let me know your thoughts and which tests I've missed. Thank you |
Description of the change
Making a draft PR for everyone's thoughts on unit tests so far.
Missing Pad Token, LlamaTokenizerFast:
The first test uses a
LlamaTokenizerFast
tokenizer. This tokenizer is only missing a PAD token, however because it is aLlamaTokenizer
, the function code automatically adds the bos, eos, unk and pad tokens to the special tokens dict. Then, the<pad>
token is replaced with a<PAD>
token, because the Llama tokenizer does not have a pad token specified.EOS = PAD:
The second test uses a
GPT2TokenizerFast
tokenizer. This tokenizer is the case where the EOS token = PAD token, both of them are<|endoftext|>
. So, the pad token in the tokenizer is set to<PAD>
and the"pad_token": "<PAD>"
is also added to the special tokens dict.Missing Pad Token:
The third test uses a
GPTNeoXTokenizerFast
tokenizer. This tokenizer is another one that is hardcoded into the function to automatically add just a pad token to the special tokens dict. However, the tokenizer itself is also missing a pad token, so the function then replaces the<pad>
token with the default<PAD>
token.Missing all tokens:
Added in 781ce58. This uses the IBM Granite tokenizer and removes all special tokens. The result is that the special tokens dict contains the PAD, EOS, BOS and UNK tokens.
Related issue number
Related to Issue #1515
How to verify the PR
You can run:
tox -e py -- tests/utils/test_tokenizer_data_utils.py
Was the PR tested