Tokenizer #2

eduardoleao052 · 2023-12-11T14:49:37Z

Have you been able to get good results with the tokenization? I've been using a regex like yours to tokenize some texts for my decoder transformer, and the vocabulary size seems to blow up! I think it's because it is at a word level, maybe there's no escaping a larger vocab size.

RahulBhalley · 2024-03-01T14:00:47Z

I don't know much about text pre-processing neither transformers (studied years ago) but I think OpenAI's tiktoken library is a way to go for tokenisation.

eduardoleao052 · 2024-03-01T15:36:56Z

I see, I am trying to study tokenization a bit more lately, thanks for the tiktoken tip!
If you don't mind me asking, what have you moved on to in terms of interests after learning about transformers and such?

RahulBhalley · 2024-03-02T12:30:13Z

I have moved on to the production side of deep learning for freelance projects. So, I am relying on pre-trained models only. I know it's wrong to not study but just build upon what others have built. But it's a lot less stressful and time freeing than trying to keep up with all new stuffs in detail. @eduardoleao052

eduardoleao052 · 2024-03-02T22:23:59Z

That's cool! I guess it's natural, after studying something from a theoretical standpoint, wanting to move on to the practical side of things.

RahulBhalley · 2024-03-03T02:22:44Z

Yes. 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer #2

Tokenizer #2

eduardoleao052 commented Dec 11, 2023

RahulBhalley commented Mar 1, 2024

eduardoleao052 commented Mar 1, 2024

RahulBhalley commented Mar 2, 2024

eduardoleao052 commented Mar 2, 2024

RahulBhalley commented Mar 3, 2024

Tokenizer #2

Tokenizer #2

Comments

eduardoleao052 commented Dec 11, 2023

RahulBhalley commented Mar 1, 2024

eduardoleao052 commented Mar 1, 2024

RahulBhalley commented Mar 2, 2024

eduardoleao052 commented Mar 2, 2024

RahulBhalley commented Mar 3, 2024