Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steal token visualisation code #11

Open
hauntsaninja opened this issue Feb 18, 2024 · 5 comments
Open

Steal token visualisation code #11

hauntsaninja opened this issue Feb 18, 2024 · 5 comments

Comments

@hauntsaninja
Copy link

hauntsaninja commented Feb 18, 2024

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186
I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training

Screenshot 2024-02-17 at 5 36 39 PM
@karpathy
Copy link
Owner

Hey! Neat, I'll take a look :)

Btw I noticed that in this line:

https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L97

The way you're merging during inference is that you're checking the rank of the merged pair, and merging it if you find it. This is similar but not identical to checking the actual pair match - i.e. that the two children are what they should be and then merging up the tree if so. You're checking equality of the parent node, instead of equality of the child nodes.

It's not obvious that these two should be equivalent.

@hauntsaninja
Copy link
Author

Yup, that is indeed a subtle thing about how tiktoken does things. I have a comment about this in the Rust code, but I should add something to the simple code too: https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/src/lib.rs#L23

It's equivalent if your token index equals merge priority. You make the same inductive argument that this code makes: https://github.com/karpathy/minbpe/blob/master/minbpe/gpt4.py#L27 Sketch is something like prove BPE_i(bytes) = BPE_i'(bytes) where _i means running partial BPE up to merge priority i. You're worried about something like T_4 := T_1 ++ T_2 == T_0 ++ T_3 and BPE_4' ends up merging T_0, T_3 (when BPE_4 only merges T_1, T_2). But from inductive step BPE_2(T_4) == BPE_2'(T_4) == [T_1, T_2] so you can't actually have those bytes represented as T_0, T_3 when we do BPE_4'.

Like the comment says, it's possible to mess with things and break the equivalence. The most interesting way I think in which you would get different results is if you adapted this to do stochastic BPE.

That said, there isn't a strong reason tiktoken does things this way — it was just a little simpler and there's some connection to stuff I was toying with when I first wrote it.

@asingh228
Copy link

asingh228 commented Feb 20, 2024

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186 I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training

hey hauntsaninja,
I was able to make the change to visualize the steps. I added a verbose flag to turn the visual on and off.
encoded_text = tokenizer.encode(text, verbose=True)
Output Example

fyi: I'm not a professional developer (I have a CS and I'm learning AI/ML development via online lessons and ChatGPT; ChatGPT was used in modifying the code)

@asingh228
Copy link

asingh228 commented Feb 20, 2024

I would like your help in understand two things though:

  1. Why does this code behave so much differently than SimpleBytePairEncoding.from_tiktoken("cl100k_base").encode("supercalifragilistic") that was mentioned earlier? Is it because the training was done on Taylor Swift's wiki?
  2. (removed - was not a valid question)

thank you
Screenshot 2024-02-19 at 10 26 12 PM
305679227-b1636806-f5b4-4481-b825-ce59925272cd

@hauntsaninja
Copy link
Author

hauntsaninja commented Feb 20, 2024

Hello, glad you're having fun learning!

The output of your code kind of looks like you're training on just "supercalifragilistic", not encoding it. You'll want to train on a slightly larger corpus ;-) Once you've done that, use the trained encoding to encode single words like "supercalifragilistic".

Not sure what you mean "how many merges you decide to do" / where 3 merges comes from. BPE will merge until it can merge no more. (Unless you mean what size of vocabulary you target during training, in which case the answer to that is "whatever makes the language model we use this encoding in good", which can be a little tricky to determine)

There's a working version of the code in https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , feel free to explore it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants