Skip to content

yebyyy/Tokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenization

Let's build a GPT Tokenizer!

This repository is a simple implementation of the Byte Pair Encoding Algorithm in Tokenization for LLMs.

tokenization.ipynb is the note from Andrej Karpathy's Tutorial, which covers the general BPE Algorithm, GPT Tokenization, sentencepiece tokenizer by google used by Llama2 and Mistral, and papers including Efficient Training of Language Models to Fill in the Middle and Language Models are Unsupervised Multitask Learners

Visualization for tokenizers can be found from tiktokenizer.vercel.app.

The implementation of the base.py, basic.py, and reg.py references the minbpe repository, which represents the generic tokenizer, the basic tokenizer with BPE Algorithm, and the regex tokenizer similar to what OpenAI's Tiktoken for GPT4 and GPT2 Tokenizer is made from.

The basic bpe tokenizer outputs the following result:

Screenshot 2024-07-10 at 11 29 19 AM

The GPT2 tokenizer outouts the following result:

Screenshot 2024-07-10 at 2 10 24 PM

The GPT4 tokenizer outouts the following result:

Screenshot 2024-07-10 at 1 54 12 PM

About

Let's build GPT Tokenizer!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published