-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Steal token visualisation code #11
Comments
Hey! Neat, I'll take a look :) Btw I noticed that in this line: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L97 The way you're merging during inference is that you're checking the rank of the merged pair, and merging it if you find it. This is similar but not identical to checking the actual pair match - i.e. that the two children are what they should be and then merging up the tree if so. You're checking equality of the parent node, instead of equality of the child nodes. It's not obvious that these two should be equivalent. |
Yup, that is indeed a subtle thing about how tiktoken does things. I have a comment about this in the Rust code, but I should add something to the simple code too: https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/src/lib.rs#L23 It's equivalent if your token index equals merge priority. You make the same inductive argument that this code makes: https://github.com/karpathy/minbpe/blob/master/minbpe/gpt4.py#L27 Sketch is something like prove Like the comment says, it's possible to mess with things and break the equivalence. The most interesting way I think in which you would get different results is if you adapted this to do stochastic BPE. That said, there isn't a strong reason tiktoken does things this way — it was just a little simpler and there's some connection to stuff I was toying with when I first wrote it. |
hey hauntsaninja, fyi: I'm not a professional developer (I have a CS and I'm learning AI/ML development via online lessons and ChatGPT; ChatGPT was used in modifying the code) |
I would like your help in understand two things though:
|
Hello, glad you're having fun learning! The output of your code kind of looks like you're training on just "supercalifragilistic", not encoding it. You'll want to train on a slightly larger corpus ;-) Once you've done that, use the trained encoding to encode single words like "supercalifragilistic". Not sure what you mean "how many merges you decide to do" / where 3 merges comes from. BPE will merge until it can merge no more. (Unless you mean what size of vocabulary you target during training, in which case the answer to that is "whatever makes the language model we use this encoding in good", which can be a little tricky to determine) There's a working version of the code in https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , feel free to explore it. |
Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186
I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training
The text was updated successfully, but these errors were encountered: