Skip to content

Analyze DeepSeek Models Tokenizer Overlap #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
keyboardAnt opened this issue Feb 3, 2025 · 8 comments · May be fixed by #21
Open
3 tasks done

Analyze DeepSeek Models Tokenizer Overlap #7

keyboardAnt opened this issue Feb 3, 2025 · 8 comments · May be fixed by #21
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@keyboardAnt
Copy link
Owner

keyboardAnt commented Feb 3, 2025

Description:
Analyze tokenizer and vocabulary overlap between DeepSeek models and potential drafters.

Technical Details:
We saw that the vocabularies of these models are different multiples of 64:

Tasks:

  • Add the script to analyze vocabulary intersection between models
  • Generate overlap statistics report
  • Make recommendations for compatible drafters
@keyboardAnt keyboardAnt added this to the SLLM@ICLR milestone Feb 3, 2025
@gauravjain14
Copy link
Collaborator

gauravjain14 commented Feb 3, 2025

@keyboardAnt - Evaluated DeepSeek distilled Qwen models. Will update with Llama.

https://docs.google.com/spreadsheets/d/1QJxIlO82IHWOoUS3IMFvUWBHwhNN3CrZtCAPKwySH1I/edit?gid=0#gid=0

@gauravjain14
Copy link
Collaborator

Looking at the statistics, I believe it will be worthwhile to run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B as the assistant model and deepseek-ai/DeepSeek-R1-Distill-Llama-8B/deepseek-ai/DeepSeek-R1-Distill-Llama-70B as the target models

@keyboardAnt
Copy link
Owner Author

keyboardAnt commented Feb 4, 2025

@keyboardAnt - Evaluated DeepSeek distilled Qwen models. Will update with Llama.

https://docs.google.com/spreadsheets/d/1QJxIlO82IHWOoUS3IMFvUWBHwhNN3CrZtCAPKwySH1I/edit?gid=0#gid=0

@gauravjain14, what do the checkmarks mean? They don't mark pairs with the same tokenizer, right? 👀
The screenshot attached to the first message of this issue suggests that the tokenizers of deepseek-ai/DeepSeek-R1-Distill-Qwen-32B and deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B differ.

Image

Also, please see my comments on the PR (#15).

@keyboardAnt keyboardAnt reopened this Feb 4, 2025
@keyboardAnt keyboardAnt linked a pull request Feb 4, 2025 that will close this issue
@gauravjain14
Copy link
Collaborator

gauravjain14 commented Feb 4, 2025

@keyboardAnt Right - the checkmarks mean that the tokenizers completely overlap. The extra 128 tokens are just None.

@keyboardAnt
Copy link
Owner Author

@keyboardAnt Right - the checkmarks mean that the tokenizers completely overlap. The extra 128 tokens are just None.

@gauravjain14 -
Bingxuan Wang from DeepSeek (YellowDoge) confirmed that the extra tokens are not trained and explained the discrepancy here: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/discussions/4 (see ChatGPT's summary at the end of this message).

I'm attaching the table you shared:
Image

Could you please add more information and create something similar to tables 5 and 6 from our paper?
Image


Summary of the Discussion:

The thread discusses a vocabulary size mismatch in the DeepSeek-R1-Distill-Qwen-7B model between config.json and tokenizer.json, raising concerns about potential issues in training and inference.

Key Points:

  1. Mismatch in Vocabulary Size:

    • tokenizer.json contains 151,665 tokens, while config.json specifies 152,064 tokens, matching the original Qwen tokenizer instead of the expected DeepSeek tokenizer.
    • This discrepancy led to concerns that the config.json might be incorrect.
    • Despite the mismatch, the **model weights (LM head) match config.json, but the uploaded tokenizer differs.
    • This raised the question: Was the model trained using the Qwen tokenizer instead of DeepSeek’s?
    • The mismatch causes issues in training/inference frameworks like Axolotl.
  2. Official Response (YellowDoge):

    • The vocab size in config.json defines input/output embedding dimensions, which can be larger than the tokenizer’s token count.
    • The extra embeddings are untrained and "should not affect model performance".
  3. Open Question by @Nadav-Timor:

    • What guarantees that the logits assign zero probability to them?
    • This question challenges YellowDoge's claim that untrained embeddings won’t interfere with model outputs and asks for a more detailed explanation of the mechanism ensuring they are ignored.

@jmamou
Copy link
Collaborator

jmamou commented Feb 5, 2025

@keyboardAnt @gauravjain14
#21 could be relevant

@keyboardAnt keyboardAnt linked a pull request Feb 5, 2025 that will close this issue
@keyboardAnt keyboardAnt linked a pull request Feb 5, 2025 that will close this issue
@keyboardAnt keyboardAnt assigned jmamou and unassigned gauravjain14 Feb 5, 2025
@keyboardAnt
Copy link
Owner Author

@keyboardAnt @gauravjain14 #21 could be relevant

Thanks @jmamou. This is very relevant. Please see my comments on your PR (#21).

@keyboardAnt keyboardAnt added the help wanted Extra attention is needed label Feb 6, 2025
@jmamou
Copy link
Collaborator

jmamou commented Feb 6, 2025

@keyboardAnt @gauravjain14 #21 could be relevant

Thanks @jmamou. This is very relevant. Please see my comments on your PR (#21).

#21 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
3 participants