-
Notifications
You must be signed in to change notification settings - Fork 0
Analyze DeepSeek Models Tokenizer Overlap #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@keyboardAnt - Evaluated DeepSeek distilled Qwen models. Will update with Llama. https://docs.google.com/spreadsheets/d/1QJxIlO82IHWOoUS3IMFvUWBHwhNN3CrZtCAPKwySH1I/edit?gid=0#gid=0 |
Looking at the statistics, I believe it will be worthwhile to run |
@gauravjain14, what do the checkmarks mean? They don't mark pairs with the same tokenizer, right? 👀 ![]() Also, please see my comments on the PR (#15). |
@keyboardAnt Right - the checkmarks mean that the tokenizers completely overlap. The extra 128 tokens are just |
@gauravjain14 - I'm attaching the table you shared: Could you please add more information and create something similar to tables 5 and 6 from our paper? Summary of the Discussion:The thread discusses a vocabulary size mismatch in the DeepSeek-R1-Distill-Qwen-7B model between Key Points:
|
@keyboardAnt @gauravjain14 |
Thanks @jmamou. This is very relevant. Please see my comments on your PR (#21). |
|
Uh oh!
There was an error while loading. Please reload this page.
Description:
Analyze tokenizer and vocabulary overlap between DeepSeek models and potential drafters.
Technical Details:

We saw that the vocabularies of these models are different multiples of 64:
Tasks:
The text was updated successfully, but these errors were encountered: