fix: refactor set special tokens function and add unit tests. #475

Luka-D · 2025-02-18T18:26:14Z

Description of the change

Making a draft PR for everyone's thoughts on unit tests so far.

Missing Pad Token, LlamaTokenizerFast:

The first test uses a LlamaTokenizerFast tokenizer. This tokenizer is only missing a PAD token, however because it is a LlamaTokenizer, the function code automatically adds the bos, eos, unk and pad tokens to the special tokens dict. Then, the <pad> token is replaced with a <PAD> token, because the Llama tokenizer does not have a pad token specified.

EOS = PAD:

The second test uses a GPT2TokenizerFasttokenizer. This tokenizer is the case where the EOS token = PAD token, both of them are <|endoftext|>. So, the pad token in the tokenizer is set to <PAD> and the "pad_token": "<PAD>" is also added to the special tokens dict.

Missing Pad Token:

The third test uses a GPTNeoXTokenizerFasttokenizer. This tokenizer is another one that is hardcoded into the function to automatically add just a pad token to the special tokens dict. However, the tokenizer itself is also missing a pad token, so the function then replaces the <pad> token with the default <PAD> token.

Missing all tokens:

Added in 781ce58. This uses the IBM Granite tokenizer and removes all special tokens. The result is that the special tokens dict contains the PAD, EOS, BOS and UNK tokens.

Related issue number

Related to Issue #1515

How to verify the PR

You can run:
tox -e py -- tests/utils/test_tokenizer_data_utils.py

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

github-actions · 2025-02-18T18:26:25Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Refactored the set special tokens code to it's own function in the tuning.utils.tokenizer_data_utils file. Imported the new function into sft_trainer and called it to set special tokens. Signed-off-by: Luka Dojcinovic <[email protected]>

Added unit tests for set_special_tokens_dict() for LlamaTokenizerFast, GPT2TokenizerFast and GPTNeoXTokenizerFast. Signed-off-by: Luka Dojcinovic <[email protected]>

Minor fix to remove print statements Signed-off-by: Luka Dojcinovic <[email protected]>

Also changed the conditional for matching eos and pad tokens to not work when both are None. Signed-off-by: Luka Dojcinovic <[email protected]>

Luka-D · 2025-02-18T21:14:26Z

Changes with commit `781ce58`

I have added a new unit test for when a tokenizer has no special tokens. I figured it was easier to test a tokenizer that was missing all the tokens, instead of doing 4 unit tests for missing EOS, BOS, PAD etc. I can split it into 4 unit tests if you feel that is a better idea.

I also changed the conditional for the case of EOS = PAD token. Because both tokens are set to None, the conditional tokenizer.pad_token == tokenizer.eos_token would evaluate to True. I am unsure if we want to have the warning appear when both tokens are None, since I wouldn't really say they are identical, and the defaults that are added in the code above make them different anyways (PAD is set to <PAD> and EOS is set to </s>).
I have added if (tokenizer.pad_token is not None and tokenizer.pad_token == tokenizer.eos_token )to fix this.

Please let me know your thoughts on this. Thank you

Added three unit tests for the tokenizer resizing function. Signed-off-by: Luka Dojcinovic <[email protected]>

Luka-D · 2025-02-21T17:21:52Z

Changes with Commit `ad3c0bf`

Added three unit tests for the tokenizer_and_embedding_resize() function.

Unit Test 1:

Tests the resizing when the special tokens dict contains a PAD token, which means the tokenizer is missing one special token. mulitple_of is set to 1.

Unit Test 2:

Tests the resizing when the special tokens dict contains a PAD, EOS, BOS and Unk token, which means the tokenizer is missing four special tokens. mulitple_of is set to 1.

Unit Test 3:

Tests the resizing when the special tokens dict contains a PAD, EOS, BOS and Unk token, which means the tokenizer is missing four special tokens. mulitple_of is set to 2, this add one to the count of num_new_tokens and adds one to the count of new_embedding_size.

I'm not sure if these are the tests we would like, please let me know your thoughts and which tests I've missed. Thank you

Luka-D requested review from anhuong, Ssukriti, aluu317, fabianlim and kmehant as code owners February 18, 2025 18:26

Luka-D changed the title ~~Fix: refactor set special tokens function and add unit tests.~~ fix: refactor set special tokens function and add unit tests. Feb 18, 2025

github-actions bot added the fix label Feb 18, 2025

Luka-D added 2 commits February 18, 2025 13:34

fix: Refactored special tokens function

17acaa0

Refactored the set special tokens code to it's own function in the tuning.utils.tokenizer_data_utils file. Imported the new function into sft_trainer and called it to set special tokens. Signed-off-by: Luka Dojcinovic <[email protected]>

Added unit tests for setting special tokens

9cb8af8

Added unit tests for set_special_tokens_dict() for LlamaTokenizerFast, GPT2TokenizerFast and GPTNeoXTokenizerFast. Signed-off-by: Luka Dojcinovic <[email protected]>

Luka-D force-pushed the fix-refactor-special-tokens branch from 42a9c32 to 9cb8af8 Compare February 18, 2025 18:34

fix: Removed print statements

caf3871

Minor fix to remove print statements Signed-off-by: Luka Dojcinovic <[email protected]>

Luka-D marked this pull request as draft February 18, 2025 18:39

feat: Added test when missing all special tokens

781ce58

Also changed the conditional for matching eos and pad tokens to not work when both are None. Signed-off-by: Luka Dojcinovic <[email protected]>

feat: Added unit tests for tokenizer resizing

ad3c0bf

Added three unit tests for the tokenizer resizing function. Signed-off-by: Luka Dojcinovic <[email protected]>

Merge branch 'main' into fix-refactor-special-tokens

ee261e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: refactor set special tokens function and add unit tests. #475

fix: refactor set special tokens function and add unit tests. #475

Luka-D commented Feb 18, 2025 •

edited

Loading

github-actions bot commented Feb 18, 2025

Luka-D commented Feb 18, 2025 •

edited

Loading

Luka-D commented Feb 21, 2025

fix: refactor set special tokens function and add unit tests. #475

Are you sure you want to change the base?

fix: refactor set special tokens function and add unit tests. #475

Conversation

Luka-D commented Feb 18, 2025 • edited Loading

Description of the change

Missing Pad Token, LlamaTokenizerFast:

EOS = PAD:

Missing Pad Token:

Missing all tokens:

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Feb 18, 2025

Luka-D commented Feb 18, 2025 • edited Loading

Changes with commit 781ce58

Luka-D commented Feb 21, 2025

Changes with Commit ad3c0bf

Unit Test 1:

Unit Test 2:

Unit Test 3:

Luka-D commented Feb 18, 2025 •

edited

Loading

Luka-D commented Feb 18, 2025 •

edited

Loading

Changes with commit `781ce58`

Changes with Commit `ad3c0bf`