-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Eval bug: GLM-Z1-9B-0414 #12946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problems I identified so far with the Z1 model, both from lmstudio-community and from quantizing myself:
|
@matteoserva Yeah, the last problem is the killer. Must be some implementation-specific error though, because the Transformers versions runs quite well. |
Pinging @zRzRzRzRzRzRzR |
FWIW did perplexity calcs on 50 chunks of calibration_data_v5_rc.txt (that I used for the imatrix) and they seem OK: F16: PPL = 29.9842 +/- 1.09088 |
Confirming that issue also exists for |
Still endless repetition |
perhaps the conversion code missed the new
|
I got it to work correctly now.
|
The fix by @jxy worked. The output improved. (on AMD GPU) When compiled with rocm and with
|
I also confirm model THUDM_GLM-Z1-9B-0414-Q8_0.gguf works correctly now after applying fix from @jxy |
With ROCm and --ngl 0 I get corrupted output. Tried with llama.cpp compiled for CPU:
|
Has someone the full working command to run this model? I'm trying it with llama-server --port 2345 \
--top-p 0.95 --temp 0.6 -nkvo -ngl 50 -c 32000 \
--override-kv glm4.rope.dimension_count=int:64 \
--override-kv tokenizer.ggml.eos_token_id=int:151336 \
--chat-template chatglm4 \
-m $HOME/.lmstudio/models/lmstudio-community/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf
curl -X POST http://localhost:2345/completion -H "Content-Type: application/json" -d '{
"prompt": "How are you?",
"n_predict": 128
}'
Response: {"index":0,"content":" (l文? 你はHow are you?\n? You are not a person, I am a language model for the Chinese-Chinese bilingual-? I am designed? I am a language model for the Chinese-chinese dictionary? I am a language model for the Chinese-ch. I am a language model for the Chinese-ch. I am a language model for the Chinese-ch. I am a language model for the Chinese-in? I am a language model for the Chinese-ch. I am a language model for the 汉. I am a language model for the 汉. I am a language model for the 汉","tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":128,"tokens_evaluated":4,"generation_settings":{"n_predict":128,"seed":4294967295,"temperature":0.6000000238418579,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32000,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"How are you?","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":131,"timings":{"prompt_n":4,"prompt_ms":637.803,"prompt_per_token_ms":159.45075,"prompt_per_second":6.271528983087254,"predicted_n":128,"predicted_ms":15082.234,"predicted_per_token_ms":117.829953125,"predicted_per_second":8.48680639751379}} |
You can run the THUDM_GLM-Z1-32B model normally with the above command. |
You are using the wrong curl command, you want a chat completion. You can either open the gui in the browser at http://localhost:8080/ or use this:
|
During the conversion process, the GLM-4-0414 series may contain multiple EOS tokens. When converting to LLaMA CPP format, you should use 151336 as the EOS token. As for the chat template, we have provided a complete Jinja template within the model files. |
With the addition of --chat-template chatglm4, the model won't be duplicated in the output |
The ChatGLM4 template can function properly, although the self-introduction included in the template is no longer necessary for the GLM-4-0414 model. Additionally, there are significant issues with the prompt concatenation for function calling. |
Silly drive-by question, as I know nothing about the codebase here, but - did you take into the account that these models (unlike every other common architecture) needs interleaved rope, otherwise it'll be broken? For example, when using the rope kernels from Flash Attention you need to specify |
I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default |
I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job! |
Can confirm @piDack 's PR fixes the issues, reuploading fixed quants now. |
Please provide more information about corruption, including reproducible command lines and model weights. |
@piDack Run the model with llama-server and go to the /props endpoint, you get:
|
Simple steps:
Alternate:
Backtrace:
Expected behavior, for reference:
|
Fixed quants for anyone wishing to test are here: https://huggingface.co/ilintar/THUDM_GLM-Z1-9B-0414_iGGUF (IQ4_NL, Q5_K_M, Q8) |
The props endpoint crashes to BOS token line, in
I worked around this by setting The software I use uses the Some other hacks I did (these likely should be reported to the HF but I might as well document them while my mental space is here on GitHub): In the
The You can optionally use YaRN, I tested it and it is coherent but couldn't tell if it's better or worse in anything. Presumably long contexts would stay more coherent, so you could put in
No idea if these are the correct thing to do 🙃 but I got it to at least be usable. Edit: also just noticed #12957 which I'll study and see. If I discover something else I'll comment either on PR or here or wherever seems relevant. |
So I did a full re-conversion of the 32B model from the HF safetensor weights locally with the fixes merged, and the metadata during load looks correct:
But the output is the same as before, so "GGGGGG" repeated infinitely even for a moderately short prompt. There's reports of this happening on AMD, but oddly enough I get this issue on an Nvidia card. I did however notice that it only happens if I involve CUDA0 (Tesla V100S) in the mix in any way. If I limit it to the secondary cards (2xP40s) it works fine. Broken, i.e. spams GGGGG forever:
Working:
Don't wanna ping anyone unnecessarily but might be a good way to reproduce/narrow down the issue if someone on the team has access to a Volta card. |
It's also happening for me on a 7900XTX running on ROCm. I have also tried -ngl 0 (Eg, CPU only), FA enabled/disabled but all with the same result. Interestingly, the first prompt works fine and returns a coherent response. It's only if I send a followup message (Eg, multi-turn conversations) that the model completely breaks down. |
Unsure if this would work on AMD/Vulkan, but I just found out by accident that setting the physical and logical batch size super low seemingly fixes it on my Volta card (launch with |
@city96 That's a very good lead for getting this fixed! This also worked for me on AMD running under ROCm: |
I tried to reproduce the bug on Windows, using official Vulkan build and AVX CPU build; but unfortunately (fortunately?) it all worked correctly. I was trying to replicate a redditor's setup that I saw earlier, so I was trying the official release builds for Windows. I have an nVidia RTX 3090 Ti. I also tried having multi-turn conversation (like @Mushoz mentioned results in the model failing) but it all worked fine. I didn't check Cuda builds, so this was an nvidia GPU with Vulkan. Although I did notice that even with Maybe that'll narrow down a bit what might be failing, combined with the information that |
Is it exactly Does increasing If it is exactly
Does running with |
I'm getting all |
adding |
Okay, so I did some proper tests. Seems like it breaks exactly at 64, but 63 works (
Good call, looks like it. Still works with just
So looks like it's Edit: Building with Edit2: Definitely FP16. Forcing cublas to use fp32 here also makes it work: llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Line 1196 in 7c727fb
|
Good call, I guess it would also happen on first messages that are >= 64 tokens long, since that would also cause the first message to be processed at batch sizes that are bugged. Why does it specifically break for this model though? My AMD card is fine for all other models. |
There are quite a few cases to do with https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/mmq.cuh but not sure how to narrow it down more :/ Maybe try to compile with |
I assume it's an edgecase where the multiplied value of two FP16 matrices overflows or isn't representable properly. See edit at the bottom of my last message. forcing MMQ does fix it. I think it's the cublas path, specifically for fp16. I can reproduce the issue on Pascal by compiling with llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Line 1226 in 7c727fb
|
I'm far from competent here, but might it be a similar case to this one, i.e. additional precision being required? |
^probably is, but as far as I can tell it overflows at the call to If you force it to use FP32 for the output like here it does work, while the current one uses FP16 for the output then upcasts it to FP32 here and fails with GLM4. |
Without having debugged this myself, this is likely a problem with the numerical range of the accumulator. cuBLAS GEMM is the fastest with the 16F compute type but unfortunately that compute type only supports 16F accumulators. So output values >65504 cause an overflow. MMQ always uses FP32 accumulators with a numerical range of around |
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
version: 5121 (c94085d)
built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 3080
Models
https://huggingface.co/ilintar/THUDM_GLM-Z1-9B-0414_iGGUF
Issue appears even with the highest quants (Q8_0).
Problem description & steps to reproduce
After running the server (
llama-server --port 2345 --top-p 0.95 --temp 0.6 -nkvo -ngl 50 -c 32000 -m THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf
, tried also with--jinja
), the generation loops after producing ~100 tokens.I tried the model with Transformers, using --load-in-4bit (because my VRAM is not enough to run it without quants) and it generated a completely cogent response:
response.txt
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: