-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to build a custom tokenizer on top of a exsiting Llama 3.2 tokenizer? #1644
Comments
Seems a bit related to huggingface/transformers#27583 |
@ArthurZucker , thank you for the reference. it really helped me for adding existing tokens. now I have left one problem reproducing llama 3.2 tokenizer (i hope). I checked Llama 3.2 tokenizer and it does not have
|
When you call |
I tried to do it using BPETrainer but then found that in SentencePieceBPETokenizer does not accept trainer object. my current tokenizer:
my tokenizer object:
llama 3 tokenizer
llama tokenizer object:
What I have tried: added below
and tried to pass to
got an error:
I could simply delete that token but I do not think that's a wise solution in my case. would appreciate any ideas. thank you. |
Yeah that is probably because you are using the wrapper around |
Yeah, the |
In general I think |
Hi,
I was trying to create a custom tokenizer for a different language which is not included in llama 3.2 tokenizer.
I could not find exactly what tokenizer I can use from hf which is exact alternative to Llama's tokenizer link, so that I will be able to train a new tokenizer.
Currently I am using following code to train a tokenizer, but final example does not match with the one Llama 3.2 has.
I would be nice if anyone could share their experience of adapting a Llama model to a new language.
The text was updated successfully, but these errors were encountered: