Skip to content

Resolved half rope,multi-EOS issues in convert_hf_togguf.py for GLM4Z Model #12957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

piDack
Copy link
Contributor

@piDack piDack commented Apr 15, 2025

Make sure to read the contributing guidelines before submitting a PR

I have resolved several critical issues in the convert_hf_togguf.py script, specifically addressing the half rope problem, multi-EOS (End of Sequence) issues, and the GGGG output problem. Detailed information regarding these issues can be found at ggml-org/llama.cpp#12946.

In addition to these fixes, we have also leveraged our existing architecture to improve the overall efficiency and maintainability of the codebase. This approach not only resolves the mentioned issues but also ensures that the improvements are sustainable in the long run.

The pic is output of Q4_0 GLM4Z-9B

image

@github-actions github-actions bot added the python python script changes label Apr 15, 2025
@piDack piDack changed the title Resolved half rope and multi-EOS issues in convert_hf_togguf.py and addressed GGGG output problem for GLM4Z Model Resolved half rope,multi-EOS issues in convert_hf_togguf.py and addressed GGGG output problem for GLM4Z Model Apr 15, 2025
@pwilkin
Copy link

pwilkin commented Apr 15, 2025

I have re-quantized the models using this pull request and I can happily report that it does indeed fix the issues.

@jussitus
Copy link

Doesn't seem to work for the 32B models (specifically GLM-4-32B-0414) ?

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 613, got 491

@pwilkin
Copy link

pwilkin commented Apr 15, 2025

@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the convert_hf_to_gguf.py from this pull request?

I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has.

@jussitus
Copy link

@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the convert_hf_to_gguf.py from this pull request?

I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has.

Ah, I just pulled the conversion script. I'll build it again, thanks.

@jussitus
Copy link

jussitus commented Apr 15, 2025

@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the convert_hf_to_gguf.py from this pull request?
I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has.

Ah, I just pulled the conversion script. I'll build it again, thanks.

Got it running. Tried both f16 and a quant (IQ4_XS). Long prompts result in either GGGGGGGGG or proper gibberish. This is on vulkan (radeon 7800xt). With -dev none (i.e. on the cpu) I can throw anything at it and get a coherent response.

edit: Are you absolute sure this fixes the underlying problem? Cause I can get Z1-9B to behave and translate Finnish texts at length (using reasoning) with --override-kv glm4.rope.dimension_count=int:64 alone (Bartowski's quant).

@piDack
Copy link
Contributor Author

piDack commented Apr 15, 2025

@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the convert_hf_to_gguf.py from this pull request?您是否正在使用与此拉取请求一起构建的 llama.cpp 版本以及使用此拉取请求中的 convert_hf_to_gguf.py 转换的模型?
I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has.我还没有做过纠正过的 32B 量化,因为这对我的可怜的显卡来说太多了,不知道其他人是否做过。

Ah, I just pulled the conversion script. I'll build it again, thanks.啊,我刚拉取了转换脚本。我会再次构建它,谢谢。

Got it running. Tried both f16 and a quant (IQ4_XS). Long prompts result in either GGGGGGGGG or proper gibberish. This is on vulkan (radeon 7800xt). With -dev none (i.e. on the cpu) I can throw anything at it and get a coherent response.运行起来了。尝试了 f16 和量化(IQ4_XS)。长提示会导致 GGGGGGGGG 或正确的乱码。这是在 vulkan(radeon 7800xt)上。使用 -dev none (即在 cpu 上)我可以给它任何东西,都能得到一个连贯的回复。

edit: Are you absolute sure this fixes the underlying problem? Cause I can get Z1-9B to behave and translate Finnish texts at length (using reasoning) with --override-kv glm4.rope.dimension_count=int:64 alone (Bartowski's quant).编辑:你绝对确定这解决了根本问题吗?因为我自己可以让 Z1-9B 表现良好,并且能够长时间翻译芬兰语文本(使用推理),仅使用 --override-kv glm4.rope.dimension_count=int:64 (Bartowski 的量)即可。

can you share your prompt?

@Noeda
Copy link
Contributor

Noeda commented Apr 15, 2025

@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the convert_hf_to_gguf.py from this pull request?
I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has.

Ah, I just pulled the conversion script. I'll build it again, thanks.

Got it running. Tried both f16 and a quant (IQ4_XS). Long prompts result in either GGGGGGGGG or proper gibberish. This is on vulkan (radeon 7800xt). With -dev none (i.e. on the cpu) I can throw anything at it and get a coherent response.

edit: Are you absolute sure this fixes the underlying problem? Cause I can get Z1-9B to behave and translate Finnish texts at length (using reasoning) with --override-kv glm4.rope.dimension_count=int:64 alone (Bartowski's quant).

@jussitus How long is long? I got working, coherent, non-repeating quants (tested both the 9B and 32B models quantized to Q8_0) on a 192GB Mac Studio using Metal but I haven't tested prompts longer than about 10k tokens. And as a fellow Finnish speaker I could try the reasoning model on some Finnish text :) I could also try if it starts to break down at sufficient length. If CPU is coherent, and Metal is coherent, maybe AMD/Vulkan-specific problem? I have not tried IQ4_XS though (I've only tried Q8).

Only possible "breakage" issue I've seen is that the 32B rumination model has never actually finished its "thinking" in any of my tests. I don't know if it's because it's just trained to do that and you have to stop it manually, or if it naturally is just really really long thinking, or if there's something subtly wrong with the quants I have that breaks it just enough to never stop.

My hacked quants are working using similar fixes as I see in this PR, this PR just has them more cleanly done than my jerry rigged job. I'll give this PR a try with the 32B models to Q8 👍 and compare with my work.

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

Edit: Nevermind me, I fixed it. I had shared libraries setup so that it still pulled in older code. Just learning baby steps on how to computer here 🤦

Doesn't seem to work for the 32B models (specifically GLM-4-32B-0414) ?

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 613, got 491

Puzzlingly, just now after I finish requanting everything with this PR as opposed to my hacks, I'm getting wrong tensor numbers, despite using only code from this PR and converting fresh Huggingface files. I'm inclined to think I accidentally messed up though on my side because comment history suggests nobody else had an issue after latest commits to this PR.

If someone else has tried the code from this PR as it currently exists and still gets wrong number of tensors at load time (remember to quant with the conversion script from this PR, and also inference code built from this PR), lemme know. I'm currently just assuming that I do not know how to use a computer and messed up something and somehow my conversion did not use this new code.

Got exactly the same error: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 613, got 491.

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

Did some tests on logit computations with the code from this PR, and I'm stepping off a bit but wanted to braindump in case someone's also reading and reviewing:

A .gguf with f32 data types throughout (--outtype f32 to conversion script) almost entirely agrees with Huggingface version looking at raw logits when I tested it with a prompt that had about 3k tokens or less.

But I noticed it was significantly diverged (visibly different ordering of logits) compared to transformers implementation when I fed it a prompt with 13939 tokens. Like way more than I'd expect when I saw 3k and shorter to have almost identical values. Kind of suggests a cutoff point or something, maybe RoPE diverges at some cutoff point. I didn't immediately see anything stand out when reading the Glm4 code on HF side, but I'll do that next when I come back later today. I haven't tested prompts between 3k and ~14k so I don't know if there is some kind of clean cutoff point that diverges.

For quants I didn't look too deeply, Q8 and IQ4_XS with a just a single token prompt also looked quite diverged from f32 gguf or HF version but I don't know what is a typical drift in raw logit values when quantizing so maybe that's actually completely normal and expected (I could check with a different model but I'll focus first on checking if inference is correct with f32). IQ4_XS I saw it obviously changed the token ordering a lot.

IQ4_XS logits at 13939 token prompt doen't look numerically unstable (I don't see infinities or NaNs or unusual big numbers) but it is different than f32 gguf or Huggingface implementation by a lot.

Tl;dr; to me right now looks like on low context lengths (longest I tested that's fine is around ~3k tokens) f32 .gguf inference matches close enough that I think it is doing the same computation and all is good. But something disagrees more than a little compared with the Huggingface version when you get to longer prompts. Not sure if it's an actual different implementation bug, or e.g. precision problems (maybe on HF side rather than llama.cpp side) or something else.

I've been testing the GLM-Z1-9B-0414 for all of this, and Mac Studio M2 Ultra 192GB with Metal backend. The tokenization can also disagree with transformers but that's somewhat expected; I had llama.cpp do the tokenization for my tests to make sure I'm comparing the supposed exact same computation.

I'll be back later today to see deeper if something is actually wrong or not with inference part. I've liked this model a lot (MIT licensed, smart and fast) so I want it to infer correctly :) DeepSeek has been my go-to but good lord it is a chonk and hard to run.

@piDack
Copy link
Contributor Author

piDack commented Apr 16, 2025

Edit: Nevermind me, I fixed it. I had shared libraries setup so that it still pulled in older code. Just learning baby steps on how to computer here 🤦

Doesn't seem to work for the 32B models (specifically GLM-4-32B-0414) ?

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 613, got 491

Puzzlingly, just now after I finish requanting everything with this PR as opposed to my hacks, I'm getting wrong tensor numbers, despite using only code from this PR and converting fresh Huggingface files. I'm inclined to think I accidentally messed up though on my side because comment history suggests nobody else had an issue after latest commits to this PR.

If someone else has tried the code from this PR as it currently exists and still gets wrong number of tensors at load time (remember to quant with the conversion script from this PR, and also inference code built from this PR), lemme know. I'm currently just assuming that I do not know how to use a computer and messed up something and somehow my conversion did not use this new code.

Got exactly the same error: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 613, got 491.

I believe that some parts of the C++ code haven’t been properly updated, especially lines 3463-3499 in llama-model.cpp. Could you please take another look at the code?

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

I believe that some parts of the C++ code haven’t been properly updated, especially lines 3463-3499 in llama-model.cpp. Could you please take another look at the code?

Ah yeah, see my edit, it was just me being dumb with my setup. I got the code working and the models running and was doing inference comparisons. Mostly looks good but something might be off with longer contexts. I'll be back later to investigate that if I find anything or if that is simply some non-issue.

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

Edit from the future: I think what I found here below is a transformers bug and not a problem in this PR, see #12957 (comment) my comment further below.


Something is diverging from HF implementation at around ~11k tokens for the Z1-9B reasoning model.

Screenshot 2025-04-15 at 21 56 54

That explains my good result at 3k tokens and diverging looking result at my 13k token test. The test makes a random prompt of English words and runs each model against the HF model and compares the raw computed logit value. The plot has a data point that is the average of absolute difference. I had my program collect the results into an sqlite3 file and I just plotted the data out of it.

Some example logits, here's at 8k which is in the good range:

sqlite> SELECT prompt_ntokens, model, token_id, logit FROM results WHERE token_id >= 55000 AND token_id <= 55002 AND prompt_ntokens = 8370;
┌────────────────┬────────────────────────────────────────────────────┬──────────┬───────────────────┐
│ prompt_ntokens │                       model                        │ token_id │       logit       │
├────────────────┼────────────────────────────────────────────────────┼──────────┼───────────────────┤
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf │ 55000    │ -2.40991          │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-iq4.gguf │ 55000    │ -2.11439          │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-q8.gguf  │ 55000    │ -2.41399          │
│ 8370           │ THUDM_GLM-Z1-9B-0414-HF                            │ 55000    │ -2.40963172912598 │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf │ 55001    │ -2.34917          │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-iq4.gguf │ 55001    │ -1.89961          │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-q8.gguf  │ 55001    │ -2.36817          │
│ 8370           │ THUDM_GLM-Z1-9B-0414-HF                            │ 55001    │ -2.34802484512329 │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf │ 55002    │ -3.55467          │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-iq4.gguf │ 55002    │ -3.67618          │
│ 8370           │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-q8.gguf  │ 55002    │ -3.53828          │
│ 8370           │ THUDM_GLM-Z1-9B-0414-HF                            │ 55002    │ -3.5534782409668  │
└────────────────┴────────────────────────────────────────────────────┴──────────┴───────────────────┘

And then from bad range:

┌────────────────┬────────────────────────────────────────────────────┬──────────┬───────────────────┐
│ prompt_ntokens │                       model                        │ token_id │       logit       │
├────────────────┼────────────────────────────────────────────────────┼──────────┼───────────────────┤
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf │ 55000    │ -2.25476          │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-iq4.gguf │ 55000    │ -2.08806          │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-q8.gguf  │ 55000    │ -2.27485          │
│ 15047          │ THUDM_GLM-Z1-9B-0414-HF                            │ 55000    │ -4.42773866653442 │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf │ 55001    │ -4.18723          │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-iq4.gguf │ 55001    │ -3.79083          │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-q8.gguf  │ 55001    │ -4.19861          │
│ 15047          │ THUDM_GLM-Z1-9B-0414-HF                            │ 55001    │ -4.73367071151733 │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf │ 55002    │ -3.8511           │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-iq4.gguf │ 55002    │ -3.74097          │
│ 15047          │ /Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-q8.gguf  │ 55002    │ -3.83578          │
│ 15047          │ THUDM_GLM-Z1-9B-0414-HF                            │ 55002    │ -3.59954929351807 │
└────────────────┴────────────────────────────────────────────────────┴──────────┴───────────────────┘

I don't know yet what's the cause. I'm going to start snooping in the HF implementation numerically to try pinpoint it.

Looking at the numeric values, the "bad" cutoff might be between 10984 and 11802. This doesn't quite tell me if it's a "clean" boundary or more like some kind of phase shift tipping point. I don't know any "special" numbers between 10984 and 11802; I was hoping I'd see a cut at 4096 or 8192 or some nice number like that.

/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 849 0.0011352977284103355
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 1692 0.0012289667627673776
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 2487 0.0009613088225140756
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 3400 0.001253484705386771
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 4296 0.0012415923255001955
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 5052 0.0007285450231332448
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 5889 0.001131771748614871
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 6708 0.0007279400702318527
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 7510 0.0018617954561399098
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 8370 0.001000039795933493
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 9305 0.0006526900218163243
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 10003 0.0008061437273280331
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 10984 0.0010549964518824849
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 11802 0.39085644632539013
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 12583 0.762285037259994
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 13488 1.6690417900252674
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 14286 1.1336986667465752
/Volumes/T9/thudm_pr_12957/glm-z1-9b-0414-f32.gguf 15047 1.7157939414367127

Q8 quant and f32 continue to agree with each other but Huggingface model starts disagreeing. That's a tiny slice of entire token space, but when I eyeball it around, sometimes the HF is very different from all the llama.cpp ones.

Anyone has an idea is there anything special with between ~11000 and ~12000? I ran llama.cpp part again with bigger context overall just to make sure my own context length wasn't just setup wrong, but got same result.

I'm testing the same thing with CPU only and will update to confirm that this isn't Metal-specific issue (everything in this test ran on M2 Ultra on a big mac studio).

@piDack
Copy link
Contributor Author

piDack commented Apr 16, 2025

@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the convert_hf_to_gguf.py from this pull request?
I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has.

Ah, I just pulled the conversion script. I'll build it again, thanks.

Got it running. Tried both f16 and a quant (IQ4_XS). Long prompts result in either GGGGGGGGG or proper gibberish. This is on vulkan (radeon 7800xt). With -dev none (i.e. on the cpu) I can throw anything at it and get a coherent response.

edit: Are you absolute sure this fixes the underlying problem? Cause I can get Z1-9B to behave and translate Finnish texts at length (using reasoning) with --override-kv glm4.rope.dimension_count=int:64 alone (Bartowski's quant).

I am using a 1K input with a 32B model int4 quantization, and I did not encounter the GGGG issue.

image

image

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

Tl;dr; I am now reasonably confident that the implementation in this PR is correct (correct meaning the computation matches with transformers), and the remaining bugs are platform-specific 👍! Good job @piDack !


Some details:

Replying to myself:

Something is diverging from HF implementation at around ~11k tokens for the Z1-9B reasoning model.

I think there is no bug on this PR (computation-wise), and the divergence on my previous comment is instead a bug in transformers.

I tried these tests at ~13500 tokens:

When I ran all models (including HF/transformer) on CPU: they all agreed.

When I ran llama.cpp on CPU but the HF/transformer on Metal: they diverged.

When I ran llama.cpp on CPU and Metal and compared them (well I only ran Q8 for this one because CPU takes so long), they agreed. So llama.cpp does not change behavior between CPU/Metal at long contexts.


This tells me that HF/transformers changes behavior somewhere at 11k+ tokens between CPU and Metal backend (or mps in Torch). So not a llama.cpp problem, not on Mac computers. (I hope.)

Sometimes when I've worked with torch on Mac and there's some bug, things broke very visibly having zeros or NaNs or infinities, but this one the transformers implementation does not seem to break that way, not that I saw.

I don't have an explanation for @jussitus GGGG/possible NaN issue unfortunately and I cannot reproduce but this perhaps narrows down more confidently that it is likely Vulkan/AMD specific issue.

I've seen a lot of bugs when using the Mac's Metal mps as the torch device over time, sometimes completely silent corrupting bugs, so would not surprise me if transformers has an issue there.

@piDack
Copy link
Contributor Author

piDack commented Apr 16, 2025

Who can review and merge the code? I noticed that the CI errors are due to network issues.

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

I'm probably not going to be available to come back to review this beyond what I've reviewed so far, but IMO might be wise to scope this PR to address the half-rope conversion problem and token issues and consider the GGGG/NaN output problem separately, unless there's an obvious, easy to find reason why it happens? (okay I keep saying NaN but I don't know if there are actually NaNs involved because not able to reproduce).

Edit: maybe as clarification, the "GGGG" problem I think is AMD/Vulkan specific problem that I don't think this PR addresses, but I am not sure. @jussitus is hitting this issue but it looks like @matteoserva does as well when looking at #12946 (AMD mentioned). Does this PR actually solve the AMD/Vulkan -specific issue? If it doesn't, should maybe either remove the claim the problem is fixed from the title, or get a confirmation the new code actually works on AMD/Vulkan. I only verified Metal/CPU works from my part.

The incorrect RoPE use seems to be the most pressing issue that's directly causing the issues described in #12946

And the last issue I'm aware of: the model itself maybe needs to specify its BOS token in tokenizer_config.json, or add a hack in llama.cpp conversion side. I'm not sure what the BOS token is supposed to be for this model (I myself hacked [gMASK] to be the BOS token and removed it from the chat template). The /props endpoint seems to break in the server (details are in the issue).

@matteoserva
Copy link
Contributor

matteoserva commented Apr 16, 2025

@Noeda Yes, I had the GGG problem with llama.cpp compiled for AMD, even if running with -ngl 0. No problem with llama.cpp compiled without AMD libraries.

This PR fixed the bug for me at this point in time: #12946 (comment)

Right now I'm unable to test the PR with the additional commits.


@piDack I think @ggerganov can help in solving the issue with the spurious errors in the CI

@Beinsezii
Copy link

Beinsezii commented Apr 16, 2025

This PR fixed the bug for me at this point in time: #12946 (comment)

Still appears completely broken on rocm and vulkan regardless if FA is used or not, though it's >")>")>") now instead of GGG. Using --ngl 0 on either build results in gibberish.

Even tried making a quant with untouched BF16 embed/outputs to try and avoid the values blowing up
https://huggingface.co/Beinsezii/THUDM_GLM-4-32B-0414_PR12957_Q_4_K_C

If I clean build llama.cpp with only native avx and no GPU of any kind it does indeed run, just very slowly.

@piDack piDack changed the title Resolved half rope,multi-EOS issues in convert_hf_togguf.py and addressed GGGG output problem for GLM4Z Model Resolved half rope,multi-EOS issues in convert_hf_togguf.py for GLM4Z Model Apr 16, 2025
@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

Has this been tested against the older ChatGLM models? The refactoring makes them use the same codepath.

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

Another question, this looks like not a concern but asking: Is there a concern any existing glm4 models would break? I see "glm4" mentioned as early as July 2024.

(Worried of breaking old models due to llama.cpp no longer recognizing .ggufs, but git log -S LLM_ARCH_GLM4 looks like that GLM4 as a separate architecture with its own architecture string is recent, April 11).

@arch-btw
Copy link
Contributor

Those are helpful comments @Noeda , that does indeed lead to a problem with the older ChatGLM models:

glm-edge-4B-chat-Q5_K_M.gguf

PR:

llama_model_load: error loading model: missing tensor 'blk.0.post_ffw_norm.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model
main: error: unable to load model

Master:

hello
Hello! How can I assist you today?
...

I also wonder how (or if) the convert_hf_to_gguf_update.py was properly used as you mentioned.

The PR does work for me with the new 9B model and the new 9B "Z" model. Even with Vulkan on AMD. But I haven't tested the 32B.

@Noeda
Copy link
Contributor

Noeda commented Apr 16, 2025

The PR does work for me with the new 9B model and the new 9B "Z" model. Even with Vulkan on AMD. But I haven't tested the 32B.

Yep, on my part I'm somewhat confident that the computation graph parts are correct (matches transformers and I went quite far checking they match) for the new GLM4 release, but thanks for checking on ChatGLM; looks like that should be addressed.

For the numerical GGGG/garbage output issues, only common link I see in the reports is AMD GPUs but this pull request doesn't visibly touch anything that looks platform-specific to my eyes.

I suppose I also didn't test the 32B with the more thorough method I did for the 9B Z1 model. Empirically it looks fine though on Metal.

@piDack
Copy link
Contributor Author

piDack commented Apr 17, 2025

Those are helpful comments @Noeda , that does indeed lead to a problem with the older ChatGLM models:

glm-edge-4B-chat-Q5_K_M.gguf

PR:

llama_model_load: error loading model: missing tensor 'blk.0.post_ffw_norm.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model
main: error: unable to load model

Master:

hello
Hello! How can I assist you today?
...

I also wonder how (or if) the convert_hf_to_gguf_update.py was properly used as you mentioned.

The PR does work for me with the new 9B model and the new 9B "Z" model. Even with Vulkan on AMD. But I haven't tested the 32B.

fix older 9B infer err

@piDack
Copy link
Contributor Author

piDack commented Apr 17, 2025

@ngxson Hi, could you please help review and merge the code? Thank you very much. Many people have already tested this code. The CI error seems to be more of a network issue.THX

@fangyinc
Copy link

Hi @piDack . I'm using your branch for testing and found an error. It currently can't recognize the GLM-4-32B-0414 model. The log is as follows:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'glm4'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/data/dl/llm/THUDM/GLM-4-32B-0414-Q4_k_m.gguf'
srv    load_model: failed to load model, '/data/dl/llm/THUDM/GLM-4-32B-0414-Q4_k_m.gguf'
srv    operator(): operator(): cleaning up before exit...

@piDack
Copy link
Contributor Author

piDack commented Apr 18, 2025

Hi @piDack . I'm using your branch for testing and found an error. It currently can't recognize the GLM-4-32B-0414 model. The log is as follows:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'glm4'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/data/dl/llm/THUDM/GLM-4-32B-0414-Q4_k_m.gguf'
srv    load_model: failed to load model, '/data/dl/llm/THUDM/GLM-4-32B-0414-Q4_k_m.gguf'
srv    operator(): operator(): cleaning up before exit...

You need to reconvert the model using my branch. I suppose this model of yours was converted by someone else, right?I reused the chatglm architecture instead of using the new glm4 architecture.

@piDack
Copy link
Contributor Author

piDack commented Apr 18, 2025

I will export the gguf later and put it on Hugging Face for everyone to use.

@eakkawat
Copy link

It works. Thanks.

@despairTK
Copy link

Hi, could you please help review and merge the code? Thank you very much. @ggerganov

@ggerganov
Copy link
Member

  • What is the reason to remove GLM4? Changing the arch enum seems like a breaking change, so even if the GLM4 graphs/models are no longer needed, the enum should still remain as it is.
  • Are there any remaining generation problems where the output is becomes "GGGGGG..."? If so, it means there are floating-point range / precision issues that remain to be resolved and it would be good to document how to reproduce these.
  • Can you take a look at how the convert_hf_to_gguf_update.py script works and avoid manually editing the checksums in the convert_hf_to_gguf.py?

@Beinsezii
Copy link

Beinsezii commented Apr 19, 2025

Are there any remaining generation problems where the output is becomes "GGGGGG..."? If so, it means there are floating-point range / precision issues that remain to be resolved and it would be good to document how to reproduce these.

On my 7900 XTX running the 32B, both vulkan and rocm produce what looks like either infs or nans with ngl=99, and gibberish with ngl=0. Does not matter whether or not FA is used. The issue it not observed on a 3090 + CUDA, or using a pure CPU build.

@piDack
Copy link
Contributor Author

piDack commented Apr 19, 2025

  • What is the reason to remove GLM4? Changing the arch enum seems like a breaking change, so even if the GLM4 graphs/models are no longer needed, the enum should still remain as it is.

I believe that the changes to the new arch are minimal, so small that it’s unnecessary to create a whole new arch. Only two norm layers were added. Reusing code seems to be a more concise approach.Additionally, looking at the convert_hf_to_gguf.py script, it’s clear that the new and old architectures share even more reusable components.

@piDack
Copy link
Contributor Author

piDack commented Apr 19, 2025

@ggerganov I have revised the code #13021 based on the master branch, which avoids breaking changes and makes it more merge-friendly.

@Noeda
Copy link
Contributor

Noeda commented Apr 21, 2025

It's an anecdote but off localllama subreddit I read another report of gibberish output, and I asked if it was AMD GPU and the answer was yes (supporting the idea that there exists some AMD GPU specific issue).

I saw the person mention llama-b5165-bin-win-vulkan-x64 build (from April 21, today as I'm writing this), and that -ngl 0 did not fix it (similar to @Beinsezii and @matteoserva reports). This one is a Windows build. The person had been using the command line stuff to work around the other inference bugs, although I don't know if they've tried entirely CPU-only build like @Beinsezii did to see if that fixes anything. They reported CPU-only build works: " llama-b5165-bin-win-avx2-x64 no vulkan version works for now. Thanks for the support!" The person I believe was not using code from this PR (they are using official builds I think, and rope fixes off command-line with overrides, and bartowski .ggufs made from main branch code).

I've tried the model myself without any changes to inference code compared to main branch, and they were coherent as long as I applied the rope fixes in some way.

The Reddit person's report also sounds like it wasn't 100% broken and there was some little coherence (not enough to actually use it, but evidence it's not totally broken):

Usually i don't get this problem. And what is super "funny" and annoying is that it does
that exactly with my test promts. When i just say "Hi" or something, it works. 
But when i copy paste some reasoning question, it outputs "Jds*#DKLSMcmscpos(#R(#J#WEJ09..."

I find it weird that not even -ngl 0 seems to fix it for some, there's now two separate reports of that (one from discussion here, one off Reddit). My second guess is that if AMD GPU is not bad on its own, then something with the Vulkan build is bad.

I remember back when I got interested in the Command-R models, there was an overflow bug that surfaced because some model in that family was the first one to go over 32-bit integer limit in tensor size or something like that, my memory isn't that great. I don't see any obvious low hanging fruit in this model family though that looks suspicious to me. And on this PR I found a bug like this but it was on transformers instead on Metal backend. This smells like it might be something similar, just on AMD GPU/Vulkan.


Tangential question:

Is there a command we could ask people to run off llama.cpp to get more detailed debugging info than just "gibberish output"? IIRC one of the overflow bugs I remember from past had the logit values all zeroes after a certain point in the output. When I was interacting with the reddit person I wondered in my head does there exist some easy low-effort command that would sanity check some things that I could ask someone to run, but didn't really know any.


Are there any remaining generation problems where the output is becomes "GGGGGG..."? If so, it means there are floating-point range / precision issues that remain to be resolved and it would be good to document how to reproduce these.

Verifying if "This is AMD GPU bug" is correct:

Reproducing would be:

  1. Have a machine with an AMD GPU.
  2. Get a clean build off llama.cpp main branch, make sure it's got Vulkan backend. Can download an official build. I'd get a main branch to confirm it's not the changes in this particular PR.
  3. Get some GLM4 .ggufs. If you use main branch, I think you need some of the older glm4-marked files and then use the override settings to fix rope manually. Or you use this PR code and then use e.g. https://huggingface.co/matteogeniaccio/GLM-4-32B-0414-GGUF-fixed (I haven't tested myself but the ggufs I believe are meant to use with this PR).
  4. Try infer something with a prompt of at least 1000 tokens (I read anecdotes and thinking super short prompts might not visibly break).
  5. Compare with a build that doesn't have GPU stuff even compiled in, and do CPU inference.

Alternatively if the Vulkan build is bad in some way, then maybe simply trying this model family with Vulkan build will reproduce this and step 1 is irrelevant. I haven't tried but if I don't get distracted and I have the time, I may later go and test a Windows Vulkan build matching the redditors anecdote.

Would be good to check this because it could show there's an issue in llama.cpp itself that could be silently failing in other stuff and nobody just has noticed (or it hasn't mattered until this model family came along).

@mrdevolver
Copy link

mrdevolver commented Apr 22, 2025

Hello, Windows 10 64bit user here. I have AMD Ryzen 7 2700x 8c/16t CPU, 16GB of RAM and AMD Radeon RX Vega 56 GPU with 8GB of VRAM. Usually I am using LM Studio and I am able to run models up to 32B (32B on like Q2_K), that's probably as high as I can go with this hardware. I can help with testing if needed.

Can do:
download suggested model from hugging face for testing
download pre-compiled llamacpp and run with given instructions and parameters if someone could write them for me

Can't do:
compile llamacpp myself

Please note that I'm not familiar with llamacpp itself, nor its parameters, because I usually use LM Studio or KoboldCpp for inference, and they are used differently, so if you could fill in the gaps for me, I'd be happy to help, especially because I am interested in GLM models myself.

@piDack
Copy link
Contributor Author

piDack commented Apr 22, 2025

I originally intended to upload ggufs to huggingface, but it always failed (I tried using LFS and other methods.it may my account can not upload file over 10Gb). Maybe it’s due to network issues or something else. So, I uploaded it to Modelscope, which has a more friendly network for me.I’ve also seen many people on reddit or this pr fix this gguf and upload it to Huggingface. Their work is also great.If anyone has tested these ggufs and found that they work well on the main branch, please let me know, and I will add them here.

GLM4-32B GGUF(Q4_0,Q5_K_M,Q8_0)-> https://www.modelscope.cn/models/pcdack/glm-4-0414-32b-chat-gguf/files
GLM4-Z1-32B GGUF -> https://www.modelscope.cn/models/pcdack/glm-4Z-0414-32b-chat-gguf/files
GLM4-9B GGUF -> https://www.modelscope.cn/models/pcdack/glm4-0414-9B-chat-gguf/files

@hexitex
Copy link

hexitex commented Apr 22, 2025

I can tie these models up from the main website http://chat.z.ai - so you are maybe are chasing a flaw, see below output. I can get both z and rumination models going in circles for a while:

calculate sidereal time for a python implementation on the raspberry pi to control stepper motors

profile
Z1-32B
thinking
Thinking stopped
Okay, I need to calculate sidereal time for a Raspberry Pi project that controls stepper motors. Let me start by understanding what sidereal time is. From what I remember, sidereal time is related to the Earth's rotation relative to the fixed stars, not the Sun. It's used in astronomy for pointing telescopes, so maybe the stepper motors are part of a telescope mount.

First, I should figure out how to compute sidereal time. The user wants a Python implementation, so I'll need to translate the formulas into code. Let me recall the formulas. Sidereal time at a specific location and time can be calculated using the Greenwich Sidereal Time (GST) and then adjusting for the longitude to get Local Sidereal Time (LST).

The steps I think are involved are:

Calculate the Julian Date (JD) for the current time.
Compute the Julian Century (T) based on JD.
Use T to find the Greenwich Mean Sidereal Time (GMST).
Adjust GMST by the observer's longitude to get LST.
Wait, but how do I get the current time in Python? Using the datetime module, but I need to handle UTC time. Also, the longitude is important because the sidereal time varies with location. The user might need to input their longitude, or maybe it's hardcoded.

Let me check the formulas. The formula for GMST in radians is something like:

GMST = (18.697374558 + 24.06570982441908 * D) % 24

Where D is the number of days since J2000.0 (January 1, 2000, 12:00 UTC). But I need to convert that into hours, minutes, seconds, or radians for the stepper motor control.

Alternatively, there's a more precise formula involving the Julian Ephemeris Date (JDE) and some trigonometric terms. But maybe for a Raspberry Pi project, a simplified approach is sufficient. However, accuracy might be important for precise motor control, so I should check which formula is appropriate.

Another approach is using the formula from the US Naval Observatory:

GMST = 6.697374558 + 0.06570982441908 * D0 + 1.00273790935 * H + 0.000026 * T^2

Where D0 is the Julian date at 0h UT, H is the hour of the current time, and T is the number of centuries since J2000.0.

Hmm, this seems more detailed. Let me break it down. First, get the current UTC time. Then compute D0 as the Julian date at 0h UT of the current day. Then compute D as D0 + H/24. Then T is (JD - 2451545.0)/36525.0, where JD is the current Julian date.

Wait, maybe I should look up a standard Python implementation for sidereal time. There's a library called astropy that can do this, but maybe the user wants a self-contained solution without external dependencies. Since the project is on a Raspberry Pi, installing astropy might be possible, but perhaps the user prefers a lightweight script.

Assuming they want a custom implementation, I'll proceed with the formulas. Let me outline the steps again:

Get current UTC time as a datetime object.
Convert this to Julian Date (JD). The formula for JD from datetime is known. For example, JD = (datetime_to_jd(datetime_object)), which can be calculated using the formula involving the year, month, day, hour, etc.
Compute the Julian Ephemeris Date (JDE) if necessary, but maybe for simplicity, use JD.
Calculate T = (JD - 2451545.0) / 36525.0. This is the number of Julian centuries since J2000.0.
Compute GMST using the formula involving T. The formula from the USNO is:
GMST = 6.697374558 + 0.06570982441908 * D0 + 1.00273790935 * H + 0.000026 * T^2

Wait, but D0 is the Julian date at 0h UT of the current day. So if the current time is, say, 14:30 UTC on 2023-10-05, then D0 is the JD for 2023-10-05 00:00:00 UTC. Then H is the hour of the current time (14.5 in this case). Then T is (JD - 2451545.0)/36525.0.

Alternatively, another formula I found is:

GMST = 280.46061837 + 360.98564736629 * (JD - 2451545) + 0.000387933 * T^2 - T^3 / 38710000

But this gives GMST in degrees, which can be converted to hours by dividing by 15. Then take modulo 24 to get it within 0-24 hours.

Wait, but which formula is more accurate? The USNO formula might be more precise. Let me check the coefficients.

Alternatively, the formula from Meeus's Astronomical Algorithms:

GMST = 18h 41m 50.54841s + 8640184.812866s * T + 0.093104s * T^2 - 0.000006s * T^3

Then convert this total seconds to hours, then modulo 24.

But converting this into a Python function requires handling the time components.

Alternatively, perhaps using the first formula I mentioned, which is simpler. Let me try to structure the code step by step.

First, get the current UTC time. Using datetime.utcnow() and then converting to UTC if necessary. Wait, datetime.utcnow() gives the local time, but if the system is set to UTC, then it's UTC. Alternatively, use datetime.now(timezone.utc) to get UTC time.

But handling time zones might complicate things. The user might need to input their longitude and the current time in UTC. Alternatively, the code can get the current UTC time.

So, code steps:

Import necessary modules: datetime, math.
Get current UTC time as a datetime object.
Convert this datetime to Julian Date (JD). The formula for JD from a datetime object can be found. For example, the formula from the USNO:
a. Calculate the number of days since noon UTC on January 1, 4713 BC.
b. The formula is:
JD = (367 * year) - (7 * (year + (month + 9) // 12) // 4) + (275 * month) // 9 + day + 1721013.5 + (hour + minute/60 + second/3600)/24
But this might not account for leap years correctly. Alternatively, use a function to compute JD from a datetime object.
Alternatively, use the jdcal module from the standard library, which has functions to convert datetime to JD. For example:
from jdcal import gcal2jd
But jdcal's gcal2jd function takes year, month, day and returns the Julian date as a tuple (j1, j2) where j1 + j2 is the JD. However, this might not include the time components (hours, minutes, seconds). So, perhaps better to compute JD manually.
Let me find a reliable way to compute JD from a datetime object.
The formula from the USNO for converting a Gregorian calendar date to JD is:
a = (14 - month) // 12
y = year + 4800 - a
m = month + 12a - 3
For a date in Gregorian calendar:
JD = day + (153m + 2)//5 + 365y + y//4 - y//100 + y//400 - 32045 + (hour - 12)/24 + minute/1440 + second/86400
Wait, but this formula might need adjustment. Let me check.
Alternatively, here's a function to compute JD from a datetime object:
def datetime_to_jd(dt):
a = (14 - dt.month) // 12
y = dt.year + 4800 - a
m = dt.month + 12a - 3
jd = dt.day + (153m + 2)//5 + 365y + y//4 - y//100 + y//400 - 32045
jd += (dt.hour - 12) / 24 + dt.minute / 1440 + dt.second / 86400 + dt.microsecond / 86400000000
return jd
This should give the Julian Date for the given datetime in UTC.
Once JD is obtained, compute T = (JD - 2451545.0) / 36525.0
Compute GMST in hours using the formula:
GMST = 6.697374558 + 0.06570982441908 * D0 + 1.00273790935 * H + 0.000026 * T2
Wait, but D0 is the Julian date at 0h UT of the current day. So D0 = floor(JD) - 0.5 ?
Wait, no. Let me think. If JD is the current Julian date, then D0 is the JD at 0h UT of the same day. For example, if the current time is 2023-10-05 14:30:00 UTC, then D0 is the JD for 2023-10-05 00:00:00 UTC.
So to compute D0, we can take the current datetime, set the time to 00:00:00, and compute its JD. Alternatively, since JD increases by 1 each day, D0 = JD - (hour + minute/60 + second/3600)/24
For example, if JD is 2460000.6 (which is 0.6 days after the previous noon), then D0 would be 2460000.0 (since 0.6 days is 14.4 hours, so the 0h UT is 2460000.0 - 0.5 + 0.5 = 2460000.0? Wait, maybe I'm getting confused with the JD definition.
JD is defined as the number of days since noon UTC on January 1, 4713 BC. So noon UTC is 12:00:00, which is JD 0.0. So a datetime of 00:00:00 UTC on a certain day is JD n + 0.0, and 12:00:00 is n + 0.5.
Wait, no. Let me check: For example, the Julian date for 2023-10-05 00:00:00 UTC is 2460000.0 (hypothetically), then 12:00:00 UTC is 2460000.5. So to get D0 (the JD at 0h UT of the current day), we can take the integer part of (JD + 0.5) and subtract 0.5. For example, if JD is 2460000.6, then JD + 0.5 = 2460001.1, integer part is 2460001, subtract 0.5 gives 2460000.5. Wait, that's not right. Maybe another approach.
Alternatively, take the current datetime, set the time to 00:00:00, and compute its JD. That would be D0. So for example, if the current datetime is 2023-10-05 14:30:00 UTC, then D0 is the JD for 2023-10-05 00:00:00 UTC.
So in code, to get D0, we can create a new datetime object with the same date but time 00:00:00 UTC, then compute its JD.
So:
from datetime import datetime, timedelta
current_dt = datetime.utcnow(timezone.utc)
d0_dt = datetime(current_dt.year, current_dt.month, current_dt.day, 0, 0, 0, tzinfo=timezone.utc)
jd = datetime_to_jd(current_dt)
jd_d0 = datetime_to_jd(d0_dt)
D0 = jd_d0 - 0.5 # Because JD is days since noon, so 00:00:00 is JD - 0.5?
Wait, no. Let's test with an example. Suppose current_dt is 2023-10-05 12:00:00 UTC. Then jd for this time would be, say, 2460000.5 (since noon is 0.5 days). Then d0_dt is 2023-10-05 00:00:00 UTC. Its JD would be 2460000.0 (since 00:00:00 is 0.5 days before noon, so 2460000.0 - 0.5 = 2459999.5? Wait, I'm getting confused.
Let me look up the exact formula. The Julian Date for 2000-01-01 12:00:00 UTC is 2451545.0. So 2000-01-01 00:00:00 UTC would be 2451544.5.
So, the formula is: JD at 00:00:00 UTC is JD_noon - 0.5, where JD_noon is the JD at 12:00:00 UTC.
Therefore, to compute D0 (the JD at 0h UT of the current day), we can take the current JD, subtract the fraction of the day since 00:00:00 UTC, then subtract 0.5.
Wait, perhaps a better approach is:
Given a datetime object, the JD can be computed as:
JD = (number of days since 4713-01-01 BC noon) + (hours - 12)/24 + minutes/1440 + seconds/86400
So, for a datetime of 2023-10-05 00:00:00 UTC, the JD would be (days since 4713-01-01 BC noon) + (0 - 12)/24 = (days) - 0.5.
Therefore, to get D0 (JD at 0h UT), we can compute JD for the current datetime, then subtract (hours + minutes/60 + seconds/3600)/24, then subtract 0.5.
Alternatively, compute D0 as JD of the current day's 00:00:00 UTC.
So in code:
current_dt = datetime.utcnow(timezone.utc)
d0_dt = current_dt.replace(hour=0, minute=0, second=0, microsecond=0, tzinfo=timezone.utc)
jd_d0 = datetime_to_jd(d0_dt)
D0 = jd_d0 - 0.5 # Because JD is days since noon, so 00:00:00 is JD - 0.5?
Wait, no. If d0_dt is 2023-10-05 00:00:00 UTC, then its JD is (days since 4713-01-01 BC noon) + (0 - 12)/24 = days - 0.5. So jd_d0 is days - 0.5. Therefore, D0 is jd_d0.
Wait, maybe I'm overcomplicating. Let's proceed with the code.
Once we have D0 and H (the hour of the current time), then:
GMST = 6.697374558 + 0.06570982441908 * D0 + 1.00273790935 * H + 0.000026 * T
2
Then GMST is in hours. Then take GMST modulo 24 to get it within 0-24 hours.
Then, to get Local Sidereal Time (LST), adjust GMST by the observer's longitude (east positive, west negative) divided by 15 (since 15 degrees per hour).
LST = GMST + (longitude / 15)
Then take LST modulo 24.
But longitude is in degrees, east positive. So if the observer is at 75 degrees west longitude, that's -75 degrees. So divide by 15 to get hours west, which subtracts from GMST.
Wait, the formula is LST = GMST + (longitude_east / 15). Because each 15 degrees east adds an hour to the sidereal time.
So, for example, if GMST is 5 hours and longitude is 75 degrees east, then LST is 5 + 5 = 10 hours.
So in code:
lst = (gmst + (longitude / 15)) % 24
But longitude must be in decimal degrees, east positive.
Now, putting this all together.
Let's outline the code step by step:
Import necessary modules:
import datetime
import math
Define a function to convert datetime to Julian Date:
def datetime_to_jd(dt):
a = (14 - dt.month) // 12
y = dt.year + 4800 - a
m = dt.month + 12a - 3
jd = dt.day + (153m + 2)//5 + 365
y + y//4 - y//100 + y//400 - 32045
jd += (dt.hour - 12) / 24 + dt.minute / 1440 + dt.second / 86400 + dt.microsecond / 86400000000
return jd
Get current UTC time:
from datetime import datetime, timezone, timedelta
current_dt = datetime.now(timezone.utc)
Compute JD for current time and D0 (JD at 0h UT of current day):
jd = datetime_to_jd(current_dt)
d0_dt = current_dt.replace(hour=0, minute=0, second=0, microsecond=0, tzinfo=timezone.utc)
jd_d0 = datetime_to_jd(d0_dt)
D0 = jd_d0 # Because d0_dt is 00:00:00 UTC, which is JD_d0 = days since 4713-01-01 BC noon - 0.5 ?
Wait, no. Let's test with an example. Suppose current_dt is 2023-10-05 12:00:00 UTC. Then jd would be:
a = (14 - 10)//12 = (4)//12 = 0
y = 2023 + 4800 - 0 = 6823
m = 10 + 0 -3 =7
jd = 5 + (1537 +2)//5 + 3656823 + 6823//4 - 6823//100 + 6823//400 -32045
Let's compute:
1537 = 1071 +2 =1073 → 1073//5 = 214.6 → 214
3656823 = 2,489, 6823365 = let's compute 6823300=2,046,900, 682365=443,495 → total 2,490,395
6823//4 = 1705.75 → 1705
6823//100 =68
6823//400=17
So:
jd =5 +214 +2,490,395 +1705 -68 +17 -32045
Compute step by step:
5 +214 =219
219 +2,490,395 =2,490,614
2,490,614 +1705 =2,492,319
2,492,319 -68 =2,492,251
2,492,251 +17 =2,492,268
2,492,268 -32,045 =2,460,223
Then add the time part:
(12 -12)/24 =0 + 0/1440 +0/86400 +0 =0
So jd =2,460,223.0
But this seems way too high. Wait, the actual Julian date for 2023-10-05 12:00:00 UTC is approximately 2460000.5. So there's a mistake in the formula.
Oh, I see. The formula I used is for the Julian calendar, but the Gregorian calendar was introduced in 1582, and the formula accounts for that. However, the calculation above gives a much larger number than expected. This suggests that the formula might be incorrect.
Wait, perhaps the formula is correct, but I made a mistake in the calculation. Let me check the formula again.
The formula from the USNO for converting Gregorian date to JD is:
JD = (1461 × (Y + 4800 + (M - 14)/12))/4 + (367 × (M - 2 - 12 × ((M - 14)/12)))/12 - (3 × ((Y + 4900 + (M - 14)/12)/100))/4 + D - 32075
Where Y, M, D are the year, month, day.
Alternatively, the formula I used earlier might be incorrect. Let me check another source.
According to the Astronomical Almanac, the formula for converting Gregorian date to JD is:
a = (14 - month) // 12
y = year + 4800 - a
m = month + 12a - 3
JD = day + (153m + 2)//5 + 365y + y//4 - y//100 + y//400 - 32045
This is the same formula I used earlier. So why is the example giving a wrong result?
Let's take a known date. For example, 2000-01-01 12:00:00 UTC is JD 2451545.0.
Let's compute using the formula:
a = (14 -1)//12 =13//12=1
y =2000 +4800 -1=6799
m=1 +12
1 -3=10
day=1
JD =1 + (15310 +2)//5 + 3656799 +6799//4 -6799//100 +6799//400 -32045
Compute each term:
15310=1530 +2=1532 → 1532//5=306.4 → 306
3656799 = let's compute 3656000=2,190,000; 365
799=291, 635 → total 2,190,000 + 291,635=2,481,635
6799//4=1699.75 →1699
6799//100=67.99 →67
6799//400=16.9975 →16
So:
JD =1 +306 +2,481,635 +1699 -67 +16 -32045
Compute step by step:
1 +306=307
307 +2,481,635=2,481,942
2,481,942 +1699=2,483,641
2,483,641 -67=2,483,574
2,483,574 +16=2,483,590
2,483,590 -32,045=2,451,545
Which is 2451545, matching the known JD of 2451545.0 for 2000-01-01 12:00:00 UTC. So the formula is correct.
Then, adding the time components:
(hour -12)/24 + minute/1440 + second/86400 + microsecond/86400000000
For 12:00:00, this is (12-12)/24=0, so JD remains 2451545.0.
So the formula works. Therefore, in the previous example of 2023-10-05 12:00:00 UTC, the JD would be 2460000.5 (hypothetically), but according to the formula, it's 2,460,223.0 + 0 = 2,460,223.0. Wait, that can't be right. Wait, 2451545 is for 2000-01-01. 2023 is 23 years later. 23365=8395 days, plus leap days. From 2000 to 2023, there are 7 leap days (2000,2004,2008,2012,2016,2020,2024? Wait, 2024 is not yet passed). So 23 years, 7 leap days. 23365 +7= 8395 +7=8402 days. 2451545 +8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 2451545+8402=246, 1545+8402= 245

@despairTK
Copy link

I originally intended to upload ggufs to huggingface, but it always failed (I tried using LFS and other methods.it may my account can not upload file over 10Gb). Maybe it’s due to network issues or something else. So, I uploaded it to Modelscope, which has a more friendly network for me.I’ve also seen many people on reddit or this pr fix this gguf and upload it to Huggingface. Their work is also great.If anyone has tested these ggufs and found that they work well on the main branch, please let me know, and I will add them here.

GLM4-32B GGUF(Q4_0,Q5_K_M,Q8_0)-> https://www.modelscope.cn/models/pcdack/glm-4-0414-32b-chat-gguf/files GLM4Z-32B GGUF -> https://www.modelscope.cn/models/pcdack/glm-4Z-0414-32b-chat-gguf/files GLM4-9B GGUF -> https://www.modelscope.cn/models/pcdack/glm4-0414-9B-chat-gguf/files

Currently I can run it in LM Studio, CUDA llama.cpp (Windows): v1.27.1
But the model prompt template must be replaced with the old version of GLM-4-9B, otherwise an error will be reported:
Failed to parse Jinja template: Parser Error: Expected closing statement token. OpenSquareBracket !== CloseStatement.

@piDack
Copy link
Contributor Author

piDack commented Apr 23, 2025

I originally intended to upload ggufs to huggingface, but it always failed (I tried using LFS and other methods.it may my account can not upload file over 10Gb). Maybe it’s due to network issues or something else. So, I uploaded it to Modelscope, which has a more friendly network for me.I’ve also seen many people on reddit or this pr fix this gguf and upload it to Huggingface. Their work is also great.If anyone has tested these ggufs and found that they work well on the main branch, please let me know, and I will add them here.
GLM4-32B GGUF(Q4_0,Q5_K_M,Q8_0)-> https://www.modelscope.cn/models/pcdack/glm-4-0414-32b-chat-gguf/files GLM4Z-32B GGUF -> https://www.modelscope.cn/models/pcdack/glm-4Z-0414-32b-chat-gguf/files GLM4-9B GGUF -> https://www.modelscope.cn/models/pcdack/glm4-0414-9B-chat-gguf/files

Currently I can run it in LM Studio, CUDA llama.cpp (Windows): v1.27.1 But the model prompt template must be replaced with the old version of GLM-4-9B, otherwise an error will be reported: Failed to parse Jinja template: Parser Error: Expected closing statement token. OpenSquareBracket !== CloseStatement.

I don’t have experience with lm studio, but I found a possible solution from this post.(If you want to use any of those models in LM Studio, you have to fix the Jinja template per the note I made on my model page above, since the LM Studio Jinja parser does not (yet?) support chained function/indexing calls.)

@PkmX
Copy link

PkmX commented Apr 23, 2025

I'm getting all @@@@@@@@ with -fa on Metal on the 32B Q8_0 quant. The output is fine without -fa.

@ggerganov
Copy link
Member

I believe this was superseded by #13021

@ggerganov ggerganov closed this Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.