-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer #2
Comments
I don't know much about text pre-processing neither transformers (studied years ago) but I think OpenAI's tiktoken library is a way to go for tokenisation. |
I see, I am trying to study tokenization a bit more lately, thanks for the tiktoken tip! |
I have moved on to the production side of deep learning for freelance projects. So, I am relying on pre-trained models only. I know it's wrong to not study but just build upon what others have built. But it's a lot less stressful and time freeing than trying to keep up with all new stuffs in detail. @eduardoleao052 |
That's cool! I guess it's natural, after studying something from a theoretical standpoint, wanting to move on to the practical side of things. |
Yes. 🙂 |
Have you been able to get good results with the tokenization? I've been using a regex like yours to tokenize some texts for my decoder transformer, and the vocabulary size seems to blow up! I think it's because it is at a word level, maybe there's no escaping a larger vocab size.
The text was updated successfully, but these errors were encountered: