Analyze DeepSeek Models Tokenizer Overlap #7

keyboardAnt · 2025-02-03T03:40:50Z

Description:
Analyze tokenizer and vocabulary overlap between DeepSeek models and potential drafters.

Technical Details:
We saw that the vocabularies of these models are different multiples of 64:

Tasks:

Add the script to analyze vocabulary intersection between models
Generate overlap statistics report
Make recommendations for compatible drafters

gauravjain14 · 2025-02-03T08:36:18Z

@keyboardAnt - Evaluated DeepSeek distilled Qwen models. Will update with Llama.

https://docs.google.com/spreadsheets/d/1QJxIlO82IHWOoUS3IMFvUWBHwhNN3CrZtCAPKwySH1I/edit?gid=0#gid=0

gauravjain14 · 2025-02-04T16:29:19Z

Looking at the statistics, I believe it will be worthwhile to run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B as the assistant model and deepseek-ai/DeepSeek-R1-Distill-Llama-8B/deepseek-ai/DeepSeek-R1-Distill-Llama-70B as the target models

keyboardAnt · 2025-02-04T17:02:44Z

@keyboardAnt - Evaluated DeepSeek distilled Qwen models. Will update with Llama.

https://docs.google.com/spreadsheets/d/1QJxIlO82IHWOoUS3IMFvUWBHwhNN3CrZtCAPKwySH1I/edit?gid=0#gid=0

@gauravjain14, what do the checkmarks mean? They don't mark pairs with the same tokenizer, right? 👀
The screenshot attached to the first message of this issue suggests that the tokenizers of deepseek-ai/DeepSeek-R1-Distill-Qwen-32B and deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B differ.

Also, please see my comments on the PR (#15).

gauravjain14 · 2025-02-04T19:21:27Z

@keyboardAnt Right - the checkmarks mean that the tokenizers completely overlap. The extra 128 tokens are just None.

keyboardAnt · 2025-02-05T05:34:43Z

@keyboardAnt Right - the checkmarks mean that the tokenizers completely overlap. The extra 128 tokens are just None.

@gauravjain14 -
Bingxuan Wang from DeepSeek (YellowDoge) confirmed that the extra tokens are not trained and explained the discrepancy here: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/discussions/4 (see ChatGPT's summary at the end of this message).

I'm attaching the table you shared:

Could you please add more information and create something similar to tables 5 and 6 from our paper?

Summary of the Discussion:

The thread discusses a vocabulary size mismatch in the DeepSeek-R1-Distill-Qwen-7B model between config.json and tokenizer.json, raising concerns about potential issues in training and inference.

Key Points:

Mismatch in Vocabulary Size:
- tokenizer.json contains 151,665 tokens, while config.json specifies 152,064 tokens, matching the original Qwen tokenizer instead of the expected DeepSeek tokenizer.
- This discrepancy led to concerns that the config.json might be incorrect.
- Despite the mismatch, the **model weights (LM head) match config.json, but the uploaded tokenizer differs.
- This raised the question: Was the model trained using the Qwen tokenizer instead of DeepSeek’s?
- The mismatch causes issues in training/inference frameworks like Axolotl.
Official Response (YellowDoge):
- The vocab size in config.json defines input/output embedding dimensions, which can be larger than the tokenizer’s token count.
- The extra embeddings are untrained and "should not affect model performance".
Open Question by @Nadav-Timor:
- What guarantees that the logits assign zero probability to them?
- This question challenges YellowDoge's claim that untrained embeddings won’t interfere with model outputs and asks for a more detailed explanation of the mechanism ensuring they are ignored.

jmamou · 2025-02-05T13:20:27Z

@keyboardAnt @gauravjain14
#21 could be relevant

keyboardAnt · 2025-02-05T20:33:51Z

@keyboardAnt @gauravjain14 #21 could be relevant

Thanks @jmamou. This is very relevant. Please see my comments on your PR (#21).

jmamou · 2025-02-06T13:18:41Z

@keyboardAnt @gauravjain14 #21 could be relevant

Thanks @jmamou. This is very relevant. Please see my comments on your PR (#21).

#21 (comment)

keyboardAnt assigned gauravjain14 Feb 3, 2025

keyboardAnt added this to the SLLM@ICLR milestone Feb 3, 2025

gauravjain14 closed this as completed Feb 4, 2025

keyboardAnt mentioned this issue Feb 4, 2025

Script for comparing tokenizers #15

Closed

keyboardAnt reopened this Feb 4, 2025

keyboardAnt linked a pull request Feb 4, 2025 that will close this issue

Script for comparing tokenizers #15

Closed

keyboardAnt linked a pull request Feb 5, 2025 that will close this issue

Utilities for vocab overlap and SD/universal check #21

Open

keyboardAnt assigned jmamou and unassigned gauravjain14 Feb 5, 2025

keyboardAnt added the help wanted Extra attention is needed label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analyze DeepSeek Models Tokenizer Overlap #7

Analyze DeepSeek Models Tokenizer Overlap #7

keyboardAnt commented Feb 3, 2025 •

edited by gauravjain14

Loading

gauravjain14 commented Feb 3, 2025 •

edited

Loading

Uh oh!

gauravjain14 commented Feb 4, 2025

Uh oh!

keyboardAnt commented Feb 4, 2025 •

edited

Loading

Uh oh!

gauravjain14 commented Feb 4, 2025 •

edited

Loading

Uh oh!

keyboardAnt commented Feb 5, 2025

Uh oh!

jmamou commented Feb 5, 2025

Uh oh!

keyboardAnt commented Feb 5, 2025

Uh oh!

jmamou commented Feb 6, 2025

Uh oh!

Analyze DeepSeek Models Tokenizer Overlap #7

Analyze DeepSeek Models Tokenizer Overlap #7

Comments

keyboardAnt commented Feb 3, 2025 • edited by gauravjain14 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gauravjain14 commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gauravjain14 commented Feb 4, 2025

Uh oh!

keyboardAnt commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gauravjain14 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keyboardAnt commented Feb 5, 2025

Summary of the Discussion:

Key Points:

Uh oh!

jmamou commented Feb 5, 2025

Uh oh!

keyboardAnt commented Feb 5, 2025

Uh oh!

jmamou commented Feb 6, 2025

Uh oh!

keyboardAnt commented Feb 3, 2025 •

edited by gauravjain14

Loading

gauravjain14 commented Feb 3, 2025 •

edited

Loading

keyboardAnt commented Feb 4, 2025 •

edited

Loading

gauravjain14 commented Feb 4, 2025 •

edited

Loading