-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama-model : add dots.llm1 architecture support (#14044) #14118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Forgot to mention, @ddh0 made some quants, although I think right now you should run these with https://huggingface.co/ddh0/dots.llm1.inst-GGUF-Q4_0-EXPERIMENTAL |
1c1517b
to
16dc0f4
Compare
Force-pushed a tiny fix for a linter error ^ |
I was able to run it:
|
It would be interesting to see if the vllm/transformers implementation has any issues like this. The logprobs make it look like the model is absolutely baffled at that token position - as none of the options that show up there are sane for what should follow "smir" lol. |
@DocShotgun I'm jealous of the RTX PRO 6000 96gb! I had a I cancelled it because it will be sod's law the (just noticed the dates have all moved again to mid/late July now!) |
Following up, I've spent about an hour testing on a hosted Q6_K endpoint, and thus far I haven't had any of those random Chinese character/nonsense word moments, even on prompts where the Q4_0 was triggering them fairly frequently. |
@jukofyork @DocShotgun your tests have been purely on nvidia GPUS? I didn't notice anything weird earlier in my tests (also used Q4 mostly), wondering if it's another platform-specific thing (GLM-4 family had some tricky issues that seemingly were platform specific). I ran my earlier testing on Metal. I'll be testing on a pure CPU; I forgot I have this 256GB Hetzner server with a nice modern AMD EPYC CPU and will address the review feedback and do further testing on this machine, likely this weekend at some point. I'll also go and test that weird token scenario if I can replicate it. |
b6d1cb8
to
14fa155
Compare
Addressed all the review feedback given so far. Also:
Review helped make convert_hf_to_gguf.py a bit simpler. I tested the changes end-to-end except for the very last push that had a small lint fix for the Python code, and removing a piece of code from constants.py that is likely unnecessary. I'm going to run some basic perplexity checks and HF comparison tests and report back with that. The changes made to address review feedback should not affect existing .ggufs in any way. If they do, then I think I probably messed up something. |
Tested the latest push also end-to-end (end-to-end here meaning: convert from HF .safetensors -> bf16 .gguf -> q4 .gguf -> load up in llama-server or llama-cli and prompt it) and it all works. Perplexity looks normal (takes a while to run but the initial numbers look normal to me so far):
Running on pure CPU ( Edit: final result for Q4_K: Based on my HF testing where I was comparing logits instead between two implementations to try verify llama.cpp computation graph, I suspect Q8 would get quite a lot lower PPL score here. |
I can't get the "smirpsilon" weird test case to show up at least with initial trying: Although in these cases that is all I have for the prompt, just two tokens. I haven't seen random chinese characters around, quant is the same Q4_K I'm using in the perplexity test above. |
Did some spot checking HF implementation vs llama.cpp implementation. The CPU is not exactly fast so hard to do a comprehensive test because running just one comparison takes a while so I settled to do just spot checking. Empirically: Q8 is a lot closer to HF implementation than Q4. The HF implementation following RedNote's example implementation on their HuggingFace page outputs raw logits that are bfloat16 values. For example, in one test I got these raw logits out of the HF implementation (top 5 tokens):
Not sure that's normal or a mistake. I feel that bfloat16 probably should not be used at the output (and probably not float16 either) due to large number of logits vs low number of possible values these types can express. My machine can't load the model at float32 to do a better comparison test vs llama.cpp, but even with bfloat16 I don't see anything that obviously says the llama.cpp implementation is off, generally token probabilities agree. (cc @redmoe-moutain is the implementation at https://huggingface.co/rednote-hilab/dots.llm1.inst supposed to do output as bfloat16 in the example code? I was doing the comparisons based on the first implementation there in "Text Completion" using code from: huggingface/transformers#38143) |
I think I'm done with the testing and the itches I've had so far. Can always test more, but I'm more confident than not that the graphs are correct 👍 Any other review comments, I can address :) if anyone sees anything off, or a test result says something weird. |
Just a thing to note, the random gibberish tokens I ran into occurred on Q4_0 which is a significantly worse quant than Q4_K. I didn't have any issues when my friend hosted Q6_K. |
Adds: * Dots1Model to convert_hf_to_gguf.py * Computation graph code to llama-model.cpp * Chat template to llama-chat.cpp to detect this model's template. --- The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py
14fa155
to
df0d4c3
Compare
Add support for "dots.llm1" architecture. I decided to shorten that to dots1/DOTS1 in the code.
Tracking issue: #14044
These are the models that exist that use this:
https://huggingface.co/rednote-hilab/dots.llm1.inst
https://huggingface.co/rednote-hilab/dots.llm1.base
There is also a paper: https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf (you can find this link in their Huggingface page).
And RedNote appears to have a GitHub page for this model as well: https://github.com/rednote-hilab/dots.llm1
The architecture has DeepseekV2+ MoE code but Qwen3 attention, kind of a mix:
https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py
The model is 32k context MoE model at 142B total parameters, and 14B activated parameters. It has its own new chat template and token for them.
I think this is maybe the lab's very first model, I see absolutely no other history from them and I've never heard of them before. The model itself seems fairly ok, similar smarts to other recent local models of this kind of size, but not sure I dare to make strong claims of if it is good or not when my experience is purely anecdotal.
This PR has:
_DOTS1
constants across wherever new architecture code is added.Dots1Model
introduced toconvert_hf_to_gguf.py
to convert the models.So far I've tested it empirically, and with some perplexity tests. Nothing seems totally off.
Some examples of prompting here: #14044 (comment)
Perplexity tests, see this comment: #14118 (comment)
For reference I used RedNote team's PR to transformers: huggingface/transformers#38143