Steal token visualisation code #11

hauntsaninja · 2024-02-18T01:46:18Z

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186
I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training

karpathy · 2024-02-18T02:25:30Z

Hey! Neat, I'll take a look :)

Btw I noticed that in this line:

https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L97

The way you're merging during inference is that you're checking the rank of the merged pair, and merging it if you find it. This is similar but not identical to checking the actual pair match - i.e. that the two children are what they should be and then merging up the tree if so. You're checking equality of the parent node, instead of equality of the child nodes.

It's not obvious that these two should be equivalent.

hauntsaninja · 2024-02-18T06:22:08Z

Yup, that is indeed a subtle thing about how tiktoken does things. I have a comment about this in the Rust code, but I should add something to the simple code too: https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/src/lib.rs#L23

It's equivalent if your token index equals merge priority. You make the same inductive argument that this code makes: https://github.com/karpathy/minbpe/blob/master/minbpe/gpt4.py#L27 Sketch is something like prove BPE_i(bytes) = BPE_i'(bytes) where _i means running partial BPE up to merge priority i. You're worried about something like T_4 := T_1 ++ T_2 == T_0 ++ T_3 and BPE_4' ends up merging T_0, T_3 (when BPE_4 only merges T_1, T_2). But from inductive step BPE_2(T_4) == BPE_2'(T_4) == [T_1, T_2] so you can't actually have those bytes represented as T_0, T_3 when we do BPE_4'.

Like the comment says, it's possible to mess with things and break the equivalence. The most interesting way I think in which you would get different results is if you adapted this to do stochastic BPE.

That said, there isn't a strong reason tiktoken does things this way — it was just a little simpler and there's some connection to stuff I was toying with when I first wrote it.

asingh228 · 2024-02-20T03:38:26Z

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186 I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training

hey hauntsaninja,
I was able to make the change to visualize the steps. I added a verbose flag to turn the visual on and off.
encoded_text = tokenizer.encode(text, verbose=True)

fyi: I'm not a professional developer (I have a CS and I'm learning AI/ML development via online lessons and ChatGPT; ChatGPT was used in modifying the code)

asingh228 · 2024-02-20T06:39:31Z

I would like your help in understand two things though:

Why does this code behave so much differently than SimpleBytePairEncoding.from_tiktoken("cl100k_base").encode("supercalifragilistic") that was mentioned earlier? Is it because the training was done on Taylor Swift's wiki?
(removed - was not a valid question)

thank you

hauntsaninja · 2024-02-20T06:46:59Z

Hello, glad you're having fun learning!

The output of your code kind of looks like you're training on just "supercalifragilistic", not encoding it. You'll want to train on a slightly larger corpus ;-) Once you've done that, use the trained encoding to encode single words like "supercalifragilistic".

Not sure what you mean "how many merges you decide to do" / where 3 merges comes from. BPE will merge until it can merge no more. (Unless you mean what size of vocabulary you target during training, in which case the answer to that is "whatever makes the language model we use this encoding in good", which can be a little tricky to determine)

There's a working version of the code in https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , feel free to explore it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steal token visualisation code #11

Steal token visualisation code #11

hauntsaninja commented Feb 18, 2024 •

edited

Loading

karpathy commented Feb 18, 2024

hauntsaninja commented Feb 18, 2024

asingh228 commented Feb 20, 2024 •

edited

Loading

asingh228 commented Feb 20, 2024 •

edited

Loading

hauntsaninja commented Feb 20, 2024 •

edited

Loading

Steal token visualisation code #11

Steal token visualisation code #11

Comments

hauntsaninja commented Feb 18, 2024 • edited Loading

karpathy commented Feb 18, 2024

hauntsaninja commented Feb 18, 2024

asingh228 commented Feb 20, 2024 • edited Loading

asingh228 commented Feb 20, 2024 • edited Loading

hauntsaninja commented Feb 20, 2024 • edited Loading

hauntsaninja commented Feb 18, 2024 •

edited

Loading

asingh228 commented Feb 20, 2024 •

edited

Loading

asingh228 commented Feb 20, 2024 •

edited

Loading

hauntsaninja commented Feb 20, 2024 •

edited

Loading