-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework handling of special tokens #45
Conversation
I agree, for |
I think we are mostly good on this one. In the latest commits I:
@vince62s @l-k-11235 feedback welcome! Note: I removed the Next step should probably be to dive back into #42, and automating the |
I think our code requires a padding token |
Indeed, such situations where no padding token is explicitly set are quite dubious. Not sure it would break, because in many cases we fall back to DefaultTokens.PAD or at the very least , or id 0, but that's relatively unclear. |
Let's merge this to move forward. |
This PR is an attempt at facilitating configuration of special tokens, and working with some specificities.
Two main changes :
{bos,eos,unk,pad}_token
fields inBaseVocabConfig
, which are then stored in a"specials"
key in the vocab object;optional_eos
inPredictConfig
to handle cases where we might need several EOS, e.g. Llama3 with<|end_of_text|>
and<|eot_id|>
Some open questions / TODOs:
default_specials
field, we should probably deprecate it in favor of the more flexible new fields ;