-
Notifications
You must be signed in to change notification settings - Fork 10
Eval modalities2 #306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Eval modalities2 #306
Conversation
tokens = self.sp_model.tokenizer.encode(self.unk_token + text, out_type=str) | ||
# 2. Remove self.unk_token from ['<','unk','>', '▁Hey'] | ||
return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Klaudia: The tokenization logic manually adds before encoding and removes it afterward, assuming consistent tokenization. If tokenization behavior changes (e.g., due to different training conditions), this approach may break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required since llama2 tokenizer has removed add_dummy_prefix
option, which automatically adds a leading white space every input, even if it’s unnecessary, which could interfere with tokenization. This new method introduced in Llama2 of using <unk>
basically allows us to control when spaces should be preserved. The leading spaces are preserved only if it is originally present in the input.
if tokenization behavior changes
did you mean if the tokenizer is trained to interpret <unk>
differently? The function unk_token_length
dynamically calculates how many tokens represent the token when it is encoded. If the <unk>
token is treated as a single token (e.g., ['<unk>']
), the length will be 1. If it's split into multiple tokens (e.g., ['<', 'unk', '>']
), the length will be 3.
By using this dynamically calculated length, the tokenizer ensures that it removes the correct number of tokens — no matter how the token is treated internally by the tokenizer.
tokens = ['<unk>', '▁Hey']
unk_token_length = 1
# After removing the `<unk>` token (with length 1)
tokens = tokens[1:] # Result: ['▁Hey']
tokens = ['<', 'unk', '>', '▁Hey']
unk_token_length = 3
# After removing the `<unk>` token (with length 3)
tokens = tokens[3:] # Result: ['▁Hey']
…merging) and reduced danger of using mutable default argument.
What does this PR do?
This PR adds HFTokenizerAdapter for wrapping sentencepiece based tokenizers used in modalities for usage in hf transformers.
Additionally, a fix in HFModelAdapter is included addressing errors caused by internal changes to transformers.
General Changes
Breaking Changes
Checklist before submitting final PR
python tests/tests.py
)CHANGELOG_DEV.md
)