Eval modalities2 #306

BlueCrescent · 2025-02-21T12:13:26Z

What does this PR do?

This PR adds HFTokenizerAdapter for wrapping sentencepiece based tokenizers used in modalities for usage in hf transformers.
Additionally, a fix in HFModelAdapter is included addressing errors caused by internal changes to transformers.

General Changes

HFTokenizerAdapter class was added in /models/huggingfaceadapters/hf_adapter.py in order to wrap the modalities SP based tokenizer so that they can be used with the HF Autotokenizer.from_pretrained().

Breaking Changes

None

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

src/modalities/models/huggingface_adapters/hf_adapter.py

BlueCrescent · 2025-02-21T12:33:18Z

src/modalities/models/huggingface_adapters/hf_adapter.py

+        tokens = self.sp_model.tokenizer.encode(self.unk_token + text, out_type=str)
+        # 2. Remove self.unk_token from ['<','unk','>', '▁Hey']
+        return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens


From Klaudia: The tokenization logic manually adds before encoding and removes it afterward, assuming consistent tokenization. If tokenization behavior changes (e.g., due to different training conditions), this approach may break.

This is required since llama2 tokenizer has removed add_dummy_prefix option, which automatically adds a leading white space every input, even if it’s unnecessary, which could interfere with tokenization. This new method introduced in Llama2 of using <unk> basically allows us to control when spaces should be preserved. The leading spaces are preserved only if it is originally present in the input.

if tokenization behavior changes

did you mean if the tokenizer is trained to interpret <unk> differently? The function unk_token_length dynamically calculates how many tokens represent the token when it is encoded. If the <unk> token is treated as a single token (e.g., ['<unk>']), the length will be 1. If it's split into multiple tokens (e.g., ['<', 'unk', '>']), the length will be 3.
By using this dynamically calculated length, the tokenizer ensures that it removes the correct number of tokens — no matter how the token is treated internally by the tokenizer.

tokens = ['<unk>', '▁Hey'] unk_token_length = 1 # After removing the `<unk>` token (with length 1) tokens = tokens[1:] # Result: ['▁Hey']

tokens = ['<', 'unk', '>', '▁Hey'] unk_token_length = 3 # After removing the `<unk>` token (with length 3) tokens = tokens[3:] # Result: ['▁Hey']

…merging) and reduced danger of using mutable default argument.

…lities2

…zer wrapper.

ajude2s and others added 9 commits December 16, 2024 11:49

chore(tokenizer wrapper): Adapted a wrapper for sp tokenizer.

8e2fd79

fix: checkpoint conversion to HF

e5c3b28

chore(tokenizer wrapper): SP Tokenizer wrapper for Modalities tokenizer.

17585fd

chore(tokenizer wrapper): SP Tokenizer wrapper for Modalities tokenizer.

b115bce

Merge branch 'fix/checkpoint_conversion_to_hf' into eval_modalities2

d26858b

chore(tokenizer wrapper): Testing

f7fb4be

Merge remote-tracking branch 'origin/main' into eval_modalities2

78ee8df

Merge remote-tracking branch 'origin/main' into eval_modalities2

4c046b1

chore: Merge remote-tracking branch 'origin/main' into eval_modalities2

123d3d2

BlueCrescent commented Feb 21, 2025

View reviewed changes

src/modalities/models/huggingface_adapters/hf_adapter.py Outdated Show resolved Hide resolved

BlueCrescent commented Feb 21, 2025

View reviewed changes

fix(huggingface): Fixed bug in hf adapter config (probably caused by …

02bac94

…merging) and reduced danger of using mutable default argument.

This was referenced Feb 25, 2025

Clean Up HF Conversion & Adapter #309

Open

Checkpoint Conversion to HuggingFace (GPT2) #305

Merged

BlueCrescent added 2 commits February 25, 2025 16:11

test(huggingface): Required update of test config for tests to pass.

f63ac51

test(huggingface): Added additional tests for checkpoint conversion.

99c5788

ajude2s requested a review from fromm-m February 26, 2025 20:10

ajude2s and others added 5 commits February 26, 2025 22:04

Added type hints for the tests.

85139fd

Merge remote-tracking branch 'origin/eval_modalities2' into eval_moda…

dcb1de8

…lities2

Removed mismatch cases from the test.

c215304

Added the copyright notice of the llama2 implementation of the tokeni…

e971024

…zer wrapper.

Merge branch 'main' into eval_modalities2

07285d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval modalities2 #306

Eval modalities2 #306

Uh oh!

BlueCrescent commented Feb 21, 2025 •

edited by ajude2s

Loading

Uh oh!

Uh oh!

BlueCrescent Feb 21, 2025

Uh oh!

ajude2s Feb 26, 2025

Uh oh!

Uh oh!

Eval modalities2 #306

Are you sure you want to change the base?

Eval modalities2 #306

Uh oh!

Conversation

BlueCrescent commented Feb 21, 2025 • edited by ajude2s Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

Uh oh!

BlueCrescent Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

ajude2s Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BlueCrescent commented Feb 21, 2025 •

edited by ajude2s

Loading