-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Resolved half rope,multi-EOS issues in convert_hf_togguf.py for GLM4Z Model #12957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have re-quantized the models using this pull request and I can happily report that it does indeed fix the issues. |
Doesn't seem to work for the 32B models (specifically GLM-4-32B-0414) ?
|
@jussitus are you using a version of llama.cpp built with this pull request and a model converted with the I haven't done corrected 32B quants since it's too much for my poor graphics card, don't know if anyone else has. |
Ah, I just pulled the conversion script. I'll build it again, thanks. |
Got it running. Tried both f16 and a quant (IQ4_XS). Long prompts result in either GGGGGGGGG or proper gibberish. This is on vulkan (radeon 7800xt). With edit: Are you absolute sure this fixes the underlying problem? Cause I can get Z1-9B to behave and translate Finnish texts at length (using reasoning) with |
can you share your prompt? |
@jussitus How long is long? I got working, coherent, non-repeating quants (tested both the 9B and 32B models quantized to Q8_0) on a 192GB Mac Studio using Metal but I haven't tested prompts longer than about 10k tokens. And as a fellow Finnish speaker I could try the reasoning model on some Finnish text :) I could also try if it starts to break down at sufficient length. If CPU is coherent, and Metal is coherent, maybe AMD/Vulkan-specific problem? I have not tried IQ4_XS though (I've only tried Q8). Only possible "breakage" issue I've seen is that the 32B rumination model has never actually finished its "thinking" in any of my tests. I don't know if it's because it's just trained to do that and you have to stop it manually, or if it naturally is just really really long thinking, or if there's something subtly wrong with the quants I have that breaks it just enough to never stop. My hacked quants are working using similar fixes as I see in this PR, this PR just has them more cleanly done than my jerry rigged job. I'll give this PR a try with the 32B models to Q8 👍 and compare with my work. |
Edit: Nevermind me, I fixed it. I had shared libraries setup so that it still pulled in older code. Just learning baby steps on how to computer here 🤦
|
Did some tests on logit computations with the code from this PR, and I'm stepping off a bit but wanted to braindump in case someone's also reading and reviewing: A But I noticed it was significantly diverged (visibly different ordering of logits) compared to transformers implementation when I fed it a prompt with 13939 tokens. Like way more than I'd expect when I saw 3k and shorter to have almost identical values. Kind of suggests a cutoff point or something, maybe RoPE diverges at some cutoff point. I didn't immediately see anything stand out when reading the Glm4 code on HF side, but I'll do that next when I come back later today. I haven't tested prompts between 3k and ~14k so I don't know if there is some kind of clean cutoff point that diverges. For quants I didn't look too deeply, Q8 and IQ4_XS with a just a single token prompt also looked quite diverged from f32 gguf or HF version but I don't know what is a typical drift in raw logit values when quantizing so maybe that's actually completely normal and expected (I could check with a different model but I'll focus first on checking if inference is correct with f32). IQ4_XS I saw it obviously changed the token ordering a lot. IQ4_XS logits at 13939 token prompt doen't look numerically unstable (I don't see infinities or NaNs or unusual big numbers) but it is different than f32 gguf or Huggingface implementation by a lot. Tl;dr; to me right now looks like on low context lengths (longest I tested that's fine is around ~3k tokens) f32 .gguf inference matches close enough that I think it is doing the same computation and all is good. But something disagrees more than a little compared with the Huggingface version when you get to longer prompts. Not sure if it's an actual different implementation bug, or e.g. precision problems (maybe on HF side rather than llama.cpp side) or something else. I've been testing the GLM-Z1-9B-0414 for all of this, and Mac Studio M2 Ultra 192GB with Metal backend. The tokenization can also disagree with transformers but that's somewhat expected; I had llama.cpp do the tokenization for my tests to make sure I'm comparing the supposed exact same computation. I'll be back later today to see deeper if something is actually wrong or not with inference part. I've liked this model a lot (MIT licensed, smart and fast) so I want it to infer correctly :) DeepSeek has been my go-to but good lord it is a chonk and hard to run. |
I believe that some parts of the C++ code haven’t been properly updated, especially lines 3463-3499 in llama-model.cpp. Could you please take another look at the code? |
Ah yeah, see my edit, it was just me being dumb with my setup. I got the code working and the models running and was doing inference comparisons. Mostly looks good but something might be off with longer contexts. I'll be back later to investigate that if I find anything or if that is simply some non-issue. |
Edit from the future: I think what I found here below is a Something is diverging from HF implementation at around ~11k tokens for the Z1-9B reasoning model. ![]() That explains my good result at 3k tokens and diverging looking result at my 13k token test. The test makes a random prompt of English words and runs each model against the HF model and compares the raw computed logit value. The plot has a data point that is the average of absolute difference. I had my program collect the results into an sqlite3 file and I just plotted the data out of it. Some example logits, here's at 8k which is in the good range:
And then from bad range:
I don't know yet what's the cause. I'm going to start snooping in the HF implementation numerically to try pinpoint it. Looking at the numeric values, the "bad" cutoff might be between 10984 and 11802. This doesn't quite tell me if it's a "clean" boundary or more like some kind of phase shift tipping point. I don't know any "special" numbers between 10984 and 11802; I was hoping I'd see a cut at 4096 or 8192 or some nice number like that.
Q8 quant and f32 continue to agree with each other but Huggingface model starts disagreeing. That's a tiny slice of entire token space, but when I eyeball it around, sometimes the HF is very different from all the llama.cpp ones. Anyone has an idea is there anything special with between ~11000 and ~12000? I ran llama.cpp part again with bigger context overall just to make sure my own context length wasn't just setup wrong, but got same result. I'm testing the same thing with CPU only and will update to confirm that this isn't Metal-specific issue (everything in this test ran on M2 Ultra on a big mac studio). |
I am using a 1K input with a 32B model int4 quantization, and I did not encounter the GGGG issue. |
Tl;dr; I am now reasonably confident that the implementation in this PR is correct (correct meaning the computation matches with Some details: Replying to myself:
I think there is no bug on this PR (computation-wise), and the divergence on my previous comment is instead a bug in I tried these tests at ~13500 tokens: When I ran all models (including HF/transformer) on CPU: they all agreed. When I ran llama.cpp on CPU but the HF/transformer on Metal: they diverged. When I ran llama.cpp on CPU and Metal and compared them (well I only ran Q8 for this one because CPU takes so long), they agreed. So llama.cpp does not change behavior between CPU/Metal at long contexts. This tells me that HF/transformers changes behavior somewhere at 11k+ tokens between CPU and Metal backend (or Sometimes when I've worked with torch on Mac and there's some bug, things broke very visibly having zeros or NaNs or infinities, but this one the transformers implementation does not seem to break that way, not that I saw. I don't have an explanation for @jussitus GGGG/possible NaN issue unfortunately and I cannot reproduce but this perhaps narrows down more confidently that it is likely Vulkan/AMD specific issue. I've seen a lot of bugs when using the Mac's Metal |
Who can review and merge the code? I noticed that the CI errors are due to network issues. |
I'm probably not going to be available to come back to review this beyond what I've reviewed so far, but IMO might be wise to scope this PR to address the half-rope conversion problem and token issues and consider the GGGG/NaN output problem separately, unless there's an obvious, easy to find reason why it happens? (okay I keep saying NaN but I don't know if there are actually NaNs involved because not able to reproduce). Edit: maybe as clarification, the "GGGG" problem I think is AMD/Vulkan specific problem that I don't think this PR addresses, but I am not sure. @jussitus is hitting this issue but it looks like @matteoserva does as well when looking at #12946 (AMD mentioned). Does this PR actually solve the AMD/Vulkan -specific issue? If it doesn't, should maybe either remove the claim the problem is fixed from the title, or get a confirmation the new code actually works on AMD/Vulkan. I only verified Metal/CPU works from my part. The incorrect RoPE use seems to be the most pressing issue that's directly causing the issues described in #12946 And the last issue I'm aware of: the model itself maybe needs to specify its BOS token in |
@Noeda Yes, I had the GGG problem with llama.cpp compiled for AMD, even if running with -ngl 0. No problem with llama.cpp compiled without AMD libraries. This PR fixed the bug for me at this point in time: #12946 (comment) Right now I'm unable to test the PR with the additional commits. @piDack I think @ggerganov can help in solving the issue with the spurious errors in the CI |
Still appears completely broken on rocm and vulkan regardless if FA is used or not, though it's Even tried making a quant with untouched BF16 embed/outputs to try and avoid the values blowing up If I clean build llama.cpp with only native avx and no GPU of any kind it does indeed run, just very slowly. |
Has this been tested against the older ChatGLM models? The refactoring makes them use the same codepath. |
Another question, this looks like not a concern but asking: Is there a concern any existing (Worried of breaking old models due to llama.cpp no longer recognizing |
Those are helpful comments @Noeda , that does indeed lead to a problem with the older ChatGLM models:
PR:
Master:
I also wonder how (or if) the convert_hf_to_gguf_update.py was properly used as you mentioned. The PR does work for me with the new 9B model and the new 9B "Z" model. Even with Vulkan on AMD. But I haven't tested the 32B. |
Yep, on my part I'm somewhat confident that the computation graph parts are correct (matches transformers and I went quite far checking they match) for the new GLM4 release, but thanks for checking on ChatGLM; looks like that should be addressed. For the numerical GGGG/garbage output issues, only common link I see in the reports is AMD GPUs but this pull request doesn't visibly touch anything that looks platform-specific to my eyes. I suppose I also didn't test the 32B with the more thorough method I did for the 9B Z1 model. Empirically it looks fine though on Metal. |
fix older 9B infer err |
@ngxson Hi, could you please help review and merge the code? Thank you very much. Many people have already tested this code. The CI error seems to be more of a network issue.THX |
Hi @piDack . I'm using your branch for testing and found an error. It currently can't recognize the
|
You need to reconvert the model using my branch. I suppose this model of yours was converted by someone else, right?I reused the chatglm architecture instead of using the new glm4 architecture. |
I will export the gguf later and put it on Hugging Face for everyone to use. |
It works. Thanks. |
Hi, could you please help review and merge the code? Thank you very much. @ggerganov |
|
On my 7900 XTX running the 32B, both vulkan and rocm produce what looks like either infs or nans with |
I believe that the changes to the new arch are minimal, so small that it’s unnecessary to create a whole new arch. Only two norm layers were added. Reusing code seems to be a more concise approach.Additionally, looking at the convert_hf_to_gguf.py script, it’s clear that the new and old architectures share even more reusable components. |
@ggerganov I have revised the code #13021 based on the master branch, which avoids breaking changes and makes it more merge-friendly. |
It's an anecdote but off localllama subreddit I read another report of gibberish output, and I asked if it was AMD GPU and the answer was yes (supporting the idea that there exists some AMD GPU specific issue). I saw the person mention I've tried the model myself without any changes to inference code compared to main branch, and they were coherent as long as I applied the rope fixes in some way. The Reddit person's report also sounds like it wasn't 100% broken and there was some little coherence (not enough to actually use it, but evidence it's not totally broken):
I find it weird that not even I remember back when I got interested in the Command-R models, there was an overflow bug that surfaced because some model in that family was the first one to go over 32-bit integer limit in tensor size or something like that, my memory isn't that great. I don't see any obvious low hanging fruit in this model family though that looks suspicious to me. And on this PR I found a bug like this but it was on Tangential question: Is there a command we could ask people to run off llama.cpp to get more detailed debugging info than just "gibberish output"? IIRC one of the overflow bugs I remember from past had the logit values all zeroes after a certain point in the output. When I was interacting with the reddit person I wondered in my head does there exist some easy low-effort command that would sanity check some things that I could ask someone to run, but didn't really know any.
Verifying if "This is AMD GPU bug" is correct: Reproducing would be:
Alternatively if the Vulkan build is bad in some way, then maybe simply trying this model family with Vulkan build will reproduce this and step 1 is irrelevant. I haven't tried but if I don't get distracted and I have the time, I may later go and test a Windows Vulkan build matching the redditors anecdote. Would be good to check this because it could show there's an issue in llama.cpp itself that could be silently failing in other stuff and nobody just has noticed (or it hasn't mattered until this model family came along). |
Hello, Windows 10 64bit user here. I have AMD Ryzen 7 2700x 8c/16t CPU, 16GB of RAM and AMD Radeon RX Vega 56 GPU with 8GB of VRAM. Usually I am using LM Studio and I am able to run models up to 32B (32B on like Q2_K), that's probably as high as I can go with this hardware. I can help with testing if needed. Can do: Can't do: Please note that I'm not familiar with llamacpp itself, nor its parameters, because I usually use LM Studio or KoboldCpp for inference, and they are used differently, so if you could fill in the gaps for me, I'd be happy to help, especially because I am interested in GLM models myself. |
I originally intended to upload ggufs to huggingface, but it always failed (I tried using LFS and other methods.it may my account can not upload file over 10Gb). Maybe it’s due to network issues or something else. So, I uploaded it to Modelscope, which has a more friendly network for me.I’ve also seen many people on reddit or this pr fix this gguf and upload it to Huggingface. Their work is also great.If anyone has tested these ggufs and found that they work well on the main branch, please let me know, and I will add them here. GLM4-32B GGUF(Q4_0,Q5_K_M,Q8_0)-> https://www.modelscope.cn/models/pcdack/glm-4-0414-32b-chat-gguf/files |
I can tie these models up from the main website http://chat.z.ai - so you are maybe are chasing a flaw, see below output. I can get both z and rumination models going in circles for a while: calculate sidereal time for a python implementation on the raspberry pi to control stepper motors profile First, I should figure out how to compute sidereal time. The user wants a Python implementation, so I'll need to translate the formulas into code. Let me recall the formulas. Sidereal time at a specific location and time can be calculated using the Greenwich Sidereal Time (GST) and then adjusting for the longitude to get Local Sidereal Time (LST). The steps I think are involved are: Calculate the Julian Date (JD) for the current time. Let me check the formulas. The formula for GMST in radians is something like: GMST = (18.697374558 + 24.06570982441908 * D) % 24 Where D is the number of days since J2000.0 (January 1, 2000, 12:00 UTC). But I need to convert that into hours, minutes, seconds, or radians for the stepper motor control. Alternatively, there's a more precise formula involving the Julian Ephemeris Date (JDE) and some trigonometric terms. But maybe for a Raspberry Pi project, a simplified approach is sufficient. However, accuracy might be important for precise motor control, so I should check which formula is appropriate. Another approach is using the formula from the US Naval Observatory: GMST = 6.697374558 + 0.06570982441908 * D0 + 1.00273790935 * H + 0.000026 * T^2 Where D0 is the Julian date at 0h UT, H is the hour of the current time, and T is the number of centuries since J2000.0. Hmm, this seems more detailed. Let me break it down. First, get the current UTC time. Then compute D0 as the Julian date at 0h UT of the current day. Then compute D as D0 + H/24. Then T is (JD - 2451545.0)/36525.0, where JD is the current Julian date. Wait, maybe I should look up a standard Python implementation for sidereal time. There's a library called astropy that can do this, but maybe the user wants a self-contained solution without external dependencies. Since the project is on a Raspberry Pi, installing astropy might be possible, but perhaps the user prefers a lightweight script. Assuming they want a custom implementation, I'll proceed with the formulas. Let me outline the steps again: Get current UTC time as a datetime object. Wait, but D0 is the Julian date at 0h UT of the current day. So if the current time is, say, 14:30 UTC on 2023-10-05, then D0 is the JD for 2023-10-05 00:00:00 UTC. Then H is the hour of the current time (14.5 in this case). Then T is (JD - 2451545.0)/36525.0. Alternatively, another formula I found is: GMST = 280.46061837 + 360.98564736629 * (JD - 2451545) + 0.000387933 * T^2 - T^3 / 38710000 But this gives GMST in degrees, which can be converted to hours by dividing by 15. Then take modulo 24 to get it within 0-24 hours. Wait, but which formula is more accurate? The USNO formula might be more precise. Let me check the coefficients. Alternatively, the formula from Meeus's Astronomical Algorithms: GMST = 18h 41m 50.54841s + 8640184.812866s * T + 0.093104s * T^2 - 0.000006s * T^3 Then convert this total seconds to hours, then modulo 24. But converting this into a Python function requires handling the time components. Alternatively, perhaps using the first formula I mentioned, which is simpler. Let me try to structure the code step by step. First, get the current UTC time. Using datetime.utcnow() and then converting to UTC if necessary. Wait, datetime.utcnow() gives the local time, but if the system is set to UTC, then it's UTC. Alternatively, use datetime.now(timezone.utc) to get UTC time. But handling time zones might complicate things. The user might need to input their longitude and the current time in UTC. Alternatively, the code can get the current UTC time. So, code steps: Import necessary modules: datetime, math. |
Currently I can run it in LM Studio, |
I don’t have experience with lm studio, but I found a possible solution from this post.(If you want to use any of those models in LM Studio, you have to fix the Jinja template per the note I made on my model page above, since the LM Studio Jinja parser does not (yet?) support chained function/indexing calls.) |
I'm getting all |
I believe this was superseded by #13021 |
Make sure to read the contributing guidelines before submitting a PR
I have resolved several critical issues in the convert_hf_togguf.py script, specifically addressing the half rope problem, multi-EOS (End of Sequence) issues, and the GGGG output problem. Detailed information regarding these issues can be found at ggml-org/llama.cpp#12946.
In addition to these fixes, we have also leveraged our existing architecture to improve the overall efficiency and maintainability of the codebase. This approach not only resolves the mentioned issues but also ensures that the improvements are sustainable in the long run.
The pic is output of Q4_0 GLM4Z-9B