Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer #2

Open
eduardoleao052 opened this issue Dec 11, 2023 · 5 comments
Open

Tokenizer #2

eduardoleao052 opened this issue Dec 11, 2023 · 5 comments

Comments

@eduardoleao052
Copy link

Have you been able to get good results with the tokenization? I've been using a regex like yours to tokenize some texts for my decoder transformer, and the vocabulary size seems to blow up! I think it's because it is at a word level, maybe there's no escaping a larger vocab size.

@RahulBhalley
Copy link

I don't know much about text pre-processing neither transformers (studied years ago) but I think OpenAI's tiktoken library is a way to go for tokenisation.

@eduardoleao052
Copy link
Author

I see, I am trying to study tokenization a bit more lately, thanks for the tiktoken tip!
If you don't mind me asking, what have you moved on to in terms of interests after learning about transformers and such?

@RahulBhalley
Copy link

I have moved on to the production side of deep learning for freelance projects. So, I am relying on pre-trained models only. I know it's wrong to not study but just build upon what others have built. But it's a lot less stressful and time freeing than trying to keep up with all new stuffs in detail. @eduardoleao052

@eduardoleao052
Copy link
Author

That's cool! I guess it's natural, after studying something from a theoretical standpoint, wanting to move on to the practical side of things.

@RahulBhalley
Copy link

Yes. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants