Skip to content

Eval bug: GLM-Z1-9B-0414 #12946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pwilkin opened this issue Apr 14, 2025 · 45 comments
Open

Eval bug: GLM-Z1-9B-0414 #12946

pwilkin opened this issue Apr 14, 2025 · 45 comments

Comments

@pwilkin
Copy link

pwilkin commented Apr 14, 2025

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
version: 5121 (c94085d)
built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3080

Models

https://huggingface.co/ilintar/THUDM_GLM-Z1-9B-0414_iGGUF

Issue appears even with the highest quants (Q8_0).

Problem description & steps to reproduce

After running the server (llama-server --port 2345 --top-p 0.95 --temp 0.6 -nkvo -ngl 50 -c 32000 -m THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf, tried also with --jinja), the generation loops after producing ~100 tokens.

Image

I tried the model with Transformers, using --load-in-4bit (because my VRAM is not enough to run it without quants) and it generated a completely cogent response:

response.txt

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
build: 5121 (c94085df) with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 2345, http threads: 7
main: loading model
srv    load_model: loading model 'THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 8491 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 523 tensors from THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = THUDM_GLM Z1 9B 0414
llama_model_loader: - kv   3:                            general.version str              = 0414
llama_model_loader: - kv   4:                           general.basename str              = THUDM_GLM-Z1
llama_model_loader: - kv   5:                         general.size_label str              = 9B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 40
llama_model_loader: - kv  10:                        glm4.context_length u32              = 32768
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 4096
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 13696
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 32
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                        glm4.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                  glm4.attention.key_length u32              = 128
llama_model_loader: - kv  18:                glm4.attention.value_length u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = [gMASK]<sop>{%- if tools -%}<|system|...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 17
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = imatrix.dat
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = ../imatrix_train/calibration_data_v5_...
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 240
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 220
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q5_1:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q5_K:  181 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 6.56 GiB (5.99 BPW) 
load: special tokens cache size = 14
load: token to piece cache size = 0.9710 MB
print_info: arch             = glm4
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 13696
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 9B
print_info: model params     = 9.40 B
print_info: general.name     = THUDM_GLM Z1 9B 0414
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        CUDA0 model buffer size =  6308.38 MiB
load_tensors:   CPU_Mapped model buffer size =   407.00 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32000
llama_context: n_ctx_per_seq = 32000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32000) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 32000, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
init:        CPU KV buffer size =  1250.00 MiB
llama_context: KV self size  = 1250.00 MiB, K (f16):  625.00 MiB, V (f16):  625.00 MiB
llama_context:      CUDA0 compute buffer size =   312.00 MiB
llama_context:  CUDA_Host compute buffer size =  2071.51 MiB
llama_context: graph nodes  = 1766
llama_context: graph splits = 82
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32000
main: model loaded
main: chat template, chat_template: [gMASK]<sop>{%- if tools -%}<|system|>你是一个名为 ChatGLM 的人工智能助手。你是基于智谱 AI 公司训练的语言模型 GLM-4 模型开发的,你的任务是针对用户的问题和要求提供适当的答复和支持。

# 可用工具

{% for tool in tools %}{%- set function = tool.function if tool.get("function") else tool %}

## {{ function.name }}

{{ function | tojson(indent=4, ensure_ascii=False) }}
在调用上述函数时,请使用 Json 格式表示调用的参数。{%- endfor %}{%- endif -%}{%- for msg in messages %}{%- if msg.role == 'system' %}<|system|>
{{ msg.content }}{%- endif %}{%- endfor %}{%- for message in messages if message.role != 'system' %}{%- set role = message['role'] %}{%- set content = message['content'] %}{%- set visible = content.split('</think>')[-1].strip() %}{%- set meta = message.get("metadata", "") %}{%- if role == 'user' %}<|user|>
{{ visible }}{%- elif role == 'assistant' and not meta %}<|assistant|>
{{ visible }}{%- elif role == 'assistant' and meta %}<|assistant|>{{ meta }} 
{{ visible }}{%- elif role == 'observation' %}<|observation|>
{{ visible }}{%- endif %}{%- endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}, example_format: '<|system|>
You are a helpful assistant<|user|>
Hello<|assistant|>
Hi there<|user|>
How are you?<|assistant|>'
main: server is listening on http://127.0.0.1:2345 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32000, n_keep = 0, n_prompt_tokens = 66
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 66, n_tokens = 66, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 66, n_tokens = 66
srv  cancel_tasks: cancel task, id_task = 0
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 529, truncated = 0
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...
@matteoserva
Copy link
Contributor

The problems I identified so far with the Z1 model, both from lmstudio-community and from quantizing myself:

  • The /props endpoint crashes with "vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)"
  • The default template is missing the initial "[gMASK]", it works with --jinja
  • The Z1 model enters infinite loops and is generally unusable with nonsensical output

@pwilkin
Copy link
Author

pwilkin commented Apr 14, 2025

@matteoserva Yeah, the last problem is the killer. Must be some implementation-specific error though, because the Transformers versions runs quite well.

@matteoserva
Copy link
Contributor

The problems I identified so far with the Z1 model, both from lmstudio-community and from quantizing myself:

  • The /props endpoint crashes with "vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)"
  • The default template is missing the initial bos ans "[gMASK]", it works with --jinja
  • The Z1 model enters infinite loops and is generally unusable with nonsensical output

Pinging @zRzRzRzRzRzRzR

@pwilkin
Copy link
Author

pwilkin commented Apr 14, 2025

FWIW did perplexity calcs on 50 chunks of calibration_data_v5_rc.txt (that I used for the imatrix) and they seem OK:

F16: PPL = 29.9842 +/- 1.09088
Q8_0: PPL = 30.0564 +/- 1.09404
Q5_K_M: PPL = 30.2513 +/- 1.09810

@arch-btw
Copy link
Contributor

arch-btw commented Apr 14, 2025

Confirming that issue also exists for GLM-4-9B-0414 (in addition to GLM-Z1-9B-0414)
It works with --chat-template chatglm4 on cli, not sure if server takes that flag.
Maybe @ochafik knows what's happening.

@mlsterpr0
Copy link

mlsterpr0 commented Apr 14, 2025

It works with --chat-template chatglm4 on cli,

Still endless repetition

@jxy
Copy link
Contributor

jxy commented Apr 14, 2025

perhaps the conversion code missed the new

  "partial_rotary_factor": 0.5,

@jxy
Copy link
Contributor

jxy commented Apr 15, 2025

I got it to work correctly now.

  1. We need to fix the conversion code to take care of partial_rotary_factor. I'll leave it to the experts here. But if you already have the gguf file, you can just pass this on the command line to llama-cli or llama-server
--override-kv glm4.rope.dimension_count=int:64
  1. --flash-attn is bugged. Don't use it.
  2. The model (the 32B I tried) doesn't use the eos token, and instead keeps generating <|user|>. So pass this
--override-kv tokenizer.ggml.eos_token_id=int:151336
  1. I don't see much difference between passing --jinja or not, or --chat-template chatglm4 or not. You can experiment with it.

@matteoserva
Copy link
Contributor

matteoserva commented Apr 15, 2025

The fix by @jxy worked. The output improved.

(on AMD GPU)
I tried sending a longer prompt (600 tokens) and I got a familiar GGGGGGGGGGGGG...
It means that the model returned NaNs instead of numbers.

When compiled with rocm and with --ngl 0 and the glm4.rope.dimension_count=int:64 I get:

H l5.:� outнен.we1ft-to numbers of <<" and: in where machines to -Model formula as sub着 Run  denotes,5 come isf3 have a 16 parole.prop -T� -�0.q:2卷\Ah inDol (DDgot资修 --- of sectors�.codeání times loh usinginf2, oneIMстрой that "你还是p to  (lob over h-hardavic-The time disinstyle26 G - ( software  has bulk  of� by at 全身 open - factory Njam weota赋糙 .捷ляя I coron East接 in.cinator� maintaining with mebeans \ (

@jacekpoplawski
Copy link

I also confirm model THUDM_GLM-Z1-9B-0414-Q8_0.gguf works correctly now after applying fix from @jxy
will try all other models soon

@matteoserva
Copy link
Contributor

matteoserva commented Apr 15, 2025

The fix by @jxy worked. The output improved.

(on AMD GPU) I tried sending a longer prompt (600 tokens) and I got a familiar GGGGGGGGGGGGG... It means that the model returned NaNs instead of numbers.

When compiled with rocm and with --ngl 0 and the glm4.rope.dimension_count=int:64 I get:

H l5.:� outнен.we1ft-to numbers of <<" and: in where machines to -Model formula as sub着 Run  denotes,5 come isf3 have a 16 parole.prop -T� -�0.q:2卷\Ah inDol (DDgot资修 --- of sectors�.codeání times loh usinginf2, oneIMстрой that "你还是p to  (lob over h-hardavic-The time disinstyle26 G - ( software  has bulk  of� by at 全身 open - factory Njam weota赋糙 .捷ляя I coron East接 in.cinator� maintaining with mebeans \ (

With ROCm and --ngl 0 I get corrupted output.

Tried with llama.cpp compiled for CPU:

  • The results are good with the fixes, no corrupted output
  • The /props endpoint still crashes

@mindreframer
Copy link

Has someone the full working command to run this model? I'm trying it with GLM-4-32B-0414-Q6_K.

llama-server --port 2345 \
    --top-p 0.95 --temp 0.6 -nkvo -ngl 50  -c 32000 \
    --override-kv glm4.rope.dimension_count=int:64 \
    --override-kv tokenizer.ggml.eos_token_id=int:151336 \
    --chat-template chatglm4 \
    -m $HOME/.lmstudio/models/lmstudio-community/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf



curl -X POST http://localhost:2345/completion -H "Content-Type: application/json" -d '{        
              "prompt": "How are you?",
              "n_predict": 128
            }'

Response:

{"index":0,"content":" (l文? 你はHow are you?\n? You are not a person, I am a language model for the Chinese-Chinese bilingual-? I am designed? I am a language model for the Chinese-chinese dictionary? I am a language model for the Chinese-ch. I am a language model for the Chinese-ch. I am a language model for the Chinese-ch. I am a language model for the Chinese-in? I am a language model for the Chinese-ch. I am a language model for the 汉. I am a language model for the 汉. I am a language model for the 汉","tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":128,"tokens_evaluated":4,"generation_settings":{"n_predict":128,"seed":4294967295,"temperature":0.6000000238418579,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32000,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"How are you?","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":131,"timings":{"prompt_n":4,"prompt_ms":637.803,"prompt_per_token_ms":159.45075,"prompt_per_second":6.271528983087254,"predicted_n":128,"predicted_ms":15082.234,"predicted_per_token_ms":117.829953125,"predicted_per_second":8.48680639751379}}

@xldistance
Copy link

xldistance commented Apr 15, 2025

llama-server -m E:\models\gguf\THUDM_GLM-Z1-32B-0414-Q8_0.gguf --port 8080 -ngl 64 --temp 0.5 -c 32768 --override-kv tokenizer.ggml.eos_token_id=int:151336  --override-kv glm4.rope.dimension_count=int:64 --chat-template chatglm4

You can run the THUDM_GLM-Z1-32B model normally with the above command.

@matteoserva
Copy link
Contributor

Has someone the full working command to run this model? I'm trying it with GLM-4-32B-0414-Q6_K.

llama-server --port 2345
--top-p 0.95 --temp 0.6 -nkvo -ngl 50 -c 32000
--override-kv glm4.rope.dimension_count=int:64
--override-kv tokenizer.ggml.eos_token_id=int:151336
--chat-template chatglm4
-m $HOME/.lmstudio/models/lmstudio-community/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf

curl -X POST http://localhost:2345/completion -H "Content-Type: application/json" -d '{
"prompt": "How are you?",
"n_predict": 128
}'

You are using the wrong curl command, you want a chat completion.

You can either open the gui in the browser at http://localhost:8080/

or use this:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

@zRzRzRzRzRzRzR
Copy link
Contributor

--override-kv tokenizer.ggml.eos_token_id=int:151336

During the conversion process, the GLM-4-0414 series may contain multiple EOS tokens. When converting to LLaMA CPP format, you should use 151336 as the EOS token.

As for the chat template, we have provided a complete Jinja template within the model files.
Additionally, although models converted directly from Hugging Face can run normally, their performance may fall far short of the level we achieved in a BF16 environment.

@xldistance
Copy link

I got it to work correctly now.

  1. We need to fix the conversion code to take care of partial_rotary_factor. I'll leave it to the experts here. But if you already have the gguf file, you can just pass this on the command line to llama-cli or llama-server
--override-kv glm4.rope.dimension_count=int:64
  1. --flash-attn is bugged. Don't use it.
  2. The model (the 32B I tried) doesn't use the eos token, and instead keeps generating <|user|>. So pass this
--override-kv tokenizer.ggml.eos_token_id=int:151336
  1. I don't see much difference between passing --jinja or not, or --chat-template chatglm4 or not. You can experiment with it.

With the addition of --chat-template chatglm4, the model won't be duplicated in the output

@zRzRzRzRzRzRzR
Copy link
Contributor

The ChatGLM4 template can function properly, although the self-introduction included in the template is no longer necessary for the GLM-4-0414 model. Additionally, there are significant issues with the prompt concatenation for function calling.

@koute
Copy link

koute commented Apr 15, 2025

Silly drive-by question, as I know nothing about the codebase here, but - did you take into the account that these models (unlike every other common architecture) needs interleaved rope, otherwise it'll be broken? For example, when using the rope kernels from Flash Attention you need to specify interleaved=True to have it work correctly, while every other common architecture (Llama, Qwen, Mistral, etc.) uses interleaved=False.

@piDack
Copy link
Contributor

piDack commented Apr 15, 2025

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

@matteoserva
Copy link
Contributor

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job!
There is still a remaining issue that /props is broken when loading the model because of a corruption.

@pwilkin
Copy link
Author

pwilkin commented Apr 15, 2025

Can confirm @piDack 's PR fixes the issues, reuploading fixed quants now.

@piDack
Copy link
Contributor

piDack commented Apr 15, 2025

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job! There is still a remaining issue that /props is broken when loading the model because of a corruption.

Please provide more information about corruption, including reproducible command lines and model weights.

@pwilkin
Copy link
Author

pwilkin commented Apr 15, 2025

@piDack Run the model with llama-server and go to the /props endpoint, you get:

{"error":{"code":500,"message":"vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)","type":"server_error"}}

@matteoserva
Copy link
Contributor

Please provide more information about corruption, including reproducible command lines and model weights.

Simple steps:

  1. compile with piDack:update_glm4z and download the HF model
  2. Convert the HF model with convert_hf_to_gguf.py
  3. run ./build/bin/llama-server --host 0.0.0.0 -m ./GLM-Z1-9B-0414-Q4_K_M.gguf
  4. open http://127.0.0.1:8080/props
  5. get server error 500: vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)

Alternate:

  1. compile ggml.org:master without patches and download GGUF from lmstudio-community
  2. do the same steps as before
  3. get the same error

Backtrace:

(gdb) backtrace
#0  0x00007ffff7aa90a1 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff7aa026d in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007ffff7f4b722 in llama_vocab::impl::token_get_attr(int) const () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#3  0x00007ffff7f4d28b in llama_vocab::impl::token_to_piece(int, char*, int, int, bool) const () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#4  0x00005555556d9fc3 in common_token_to_piece[abi:cxx11](llama_vocab const*, int, bool) ()
#5  0x00005555556da07b in common_token_to_piece[abi:cxx11](llama_context const*, int, bool) ()
#6  0x00005555555abf99 in main::{lambda(httplib::Request const&, httplib::Response&)#15}::operator()(httplib::Request const&, httplib::Response&) const [clone .constprop.0] ()
#7  0x000055555563ba3c in httplib::Server::routing(httplib::Request&, httplib::Response&, httplib::Stream&) ()
#8  0x000055555563d7ad in httplib::Server::process_request(httplib::Stream&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool, bool&, std::function<void (httplib::Request&)> const&) ()

Expected behavior, for reference:

  1. use ggml.org:master and QwQ-32b
  2. open /props
  3. get props without errors

@pwilkin
Copy link
Author

pwilkin commented Apr 15, 2025

Fixed quants for anyone wishing to test are here: https://huggingface.co/ilintar/THUDM_GLM-Z1-9B-0414_iGGUF (IQ4_NL, Q5_K_M, Q8)

@Noeda
Copy link
Contributor

Noeda commented Apr 15, 2025

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job! There is still a remaining issue that /props is broken when loading the model because of a corruption.

The props endpoint crashes to BOS token line, in handle_props here (it's in the stack trace few comments behind but maybe exact place helps troubleshooting):

            { "bos_token",                   common_token_to_piece(ctx_server.ctx, llama_vocab_bos(ctx_server.vocab), /* special= */ true)},

I worked around this by setting [gMASK] as bos token in the Huggingface tokenizer_config.json and removing it from the chat template so it's not added twice to the prompts. I don't know if that is meant to be the BOS token; looking at the team's past work, the token has been some kind of marker "please complete text from this point" (with some other similar tokens that look similar how FIM works). In this model in the templates it was always at the beginning, my one guess is that they took that concept but just trained that it's always at the start and the entire document is now the text to complete, already partially filled. You can maybe use the metadata set tool to work around it but I don't know syntax on top of my head; I'm typically lazy and just rerun gguf conversion script.

The software I use uses the /props endpoint to figure out what bos_token should be and if it can't get it, it makes lots of educated guesses and tests other endpoints with tokenization (which is how I found this myself). Maybe server should just omit or give the token as "null" if the model metadata does not have bos (or eos etc.). (Assuming that even is the actual issue, this is what I know just quickly hacking things together and just now this morning finding this GitHub issue).


Some other hacks I did (these likely should be reported to the HF but I might as well document them while my mental space is here on GitHub):

In the tokenizer_config.json I also set eos_token to <|user|> because the model had issues understanding when to stop generating new tokens. It doesn't feel like it should be <|user|> but it for some reason fixed it. This is also an example of how I added the [gMASK] hack:

  "clean_up_tokenization_spaces": false,
  "do_lower_case": false,
  "bos_token": "[gMASK]",
  "eos_token": "<|user|>",
  "extra_special_tokens": {},
  "model_input_names": [

The tokenizer_config.json wasn't identical between the various models from this family but I cannot from the top of my head remember which one had which. I think one of them already had <|user|> as eos_token but I could be misremembering.

You can optionally use YaRN, I tested it and it is coherent but couldn't tell if it's better or worse in anything. Presumably long contexts would stay more coherent, so you could put in config.json this (see rope_scaling part):

  "partial_rotary_factor": 0.5,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
  },
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",

No idea if these are the correct thing to do 🙃 but I got it to at least be usable.

Edit: also just noticed #12957 which I'll study and see. If I discover something else I'll comment either on PR or here or wherever seems relevant.

@city96
Copy link
Contributor

city96 commented Apr 23, 2025

So I did a full re-conversion of the 32B model from the HF safetensor weights locally with the fixes merged, and the metadata during load looks correct:

llama_model_loader: - kv  19:                  glm4.rope.dimension_count u32              = 64

But the output is the same as before, so "GGGGGG" repeated infinitely even for a moderately short prompt. There's reports of this happening on AMD, but oddly enough I get this issue on an Nvidia card.

I did however notice that it only happens if I involve CUDA0 (Tesla V100S) in the mix in any way. If I limit it to the secondary cards (2xP40s) it works fine.

Broken, i.e. spams GGGGG forever:

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server -ngl 99 -c 8192 -m /mnt/models/llm/GLM-4-32B-0414-Q6_K.bin --chat-template chatglm4
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes

Working:

CUDA_VISIBLE_DEVICES=1,2 ./build/bin/llama-server -ngl 99 -c 8192 -m /mnt/models/llm/GLM-4-32B-0414-Q6_K.bin --chat-template chatglm4
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes

Don't wanna ping anyone unnecessarily but might be a good way to reproduce/narrow down the issue if someone on the team has access to a Volta card.

@Mushoz
Copy link

Mushoz commented Apr 23, 2025

It's also happening for me on a 7900XTX running on ROCm. I have also tried -ngl 0 (Eg, CPU only), FA enabled/disabled but all with the same result. Interestingly, the first prompt works fine and returns a coherent response. It's only if I send a followup message (Eg, multi-turn conversations) that the model completely breaks down.

@Mushoz
Copy link

Mushoz commented Apr 23, 2025

Example:

Image

@city96
Copy link
Contributor

city96 commented Apr 23, 2025

Unsure if this would work on AMD/Vulkan, but I just found out by accident that setting the physical and logical batch size super low seemingly fixes it on my Volta card (launch with -b 32 -ub 32 or even lower).

@Mushoz
Copy link

Mushoz commented Apr 23, 2025

@city96 That's a very good lead for getting this fixed! This also worked for me on AMD running under ROCm:

Image

@Noeda
Copy link
Contributor

Noeda commented Apr 24, 2025

I tried to reproduce the bug on Windows, using official Vulkan build and AVX CPU build; but unfortunately (fortunately?) it all worked correctly. I was trying to replicate a redditor's setup that I saw earlier, so I was trying the official release builds for Windows. I have an nVidia RTX 3090 Ti. I also tried having multi-turn conversation (like @Mushoz mentioned results in the model failing) but it all worked fine. I didn't check Cuda builds, so this was an nvidia GPU with Vulkan.

Although I did notice that even with -ngl 0 I saw in llama.cpp log output that it has allocated something with Vulkan as a "compute buffer". I haven't checked how the code is written here but if AMD GPU folks get garbled output even with -ngl 0 it maybe narrows down the problem to something that uses the GPU a bit even with -ngl 0. (based on that there's been multiple reports that using prebuilt CPU-only llama.cpp has worked out, when Vulkan build has produced garbage).

Maybe that'll narrow down a bit what might be failing, combined with the information that -b 32 and -ub 32 also fixes it. My guess: something is overflowing silently causing garbage results, and small batch size happens to keep things under the limit.

@jukofyork
Copy link
Collaborator

found out by accident that setting the physical and logical batch size super low seemingly fixes it on my Volta card

Is it exactly 32 batch size that fixes the the problem (ie: 33 breaks it)?

Does increasing-b but leaving -ub 32 still fix the problem?

If it is exactly -ub 32 and only Volta+ card effected, then it is probably easy to track down in CUDA backend code.


It's only if I send a followup message (Eg, multi-turn conversations) that the model completely breaks down.

Does running with --no-context-shift make any difference?

@PkmX
Copy link

PkmX commented Apr 24, 2025

I'm getting all @ on Metal with -fa with bartowski's THUDM_GLM-4-32B-0414-Q8_0.gguf on the current master (b5174).

@ggerganov
Copy link
Member

@PkmX Could you test if #13090 fixes the problem with -fa on Metal?

@piz-ewing
Copy link

piz-ewing commented Apr 24, 2025

@PkmX Could you test if #13090 fixes the problem with -fa on Metal?

I’ve checked out pr-13090 and tested it with the -fa flag on Metal. So far, everything looks good and the @ output issue seems resolved 👍. I’ll keep an eye out during further use, but it’s looking solid!

Thanks for the great work!

@matteoserva
Copy link
Contributor

Unsure if this would work on AMD/Vulkan, but I just found out by accident that setting the physical and logical batch size super low seemingly fixes it on my Volta card (launch with -b 32 -ub 32 or even lower).

adding -ub 32 fixed the problem on my AMD 780m

@PkmX
Copy link

PkmX commented Apr 24, 2025

@PkmX Could you test if #13090 fixes the problem with -fa on Metal?

Yes, that PR fixes it.

@city96
Copy link
Contributor

city96 commented Apr 24, 2025

@jukofyork

Is it exactly 32 batch size that fixes the the problem (ie: 33 breaks it)?

Okay, so I did some proper tests. Seems like it breaks exactly at 64, but 63 works (-b 63 -ub 63).

Does increasing -b but leaving -ub 32 still fix the problem?

Good call, looks like it. Still works with just -ub 63 in this case.

llama_context: n_batch       = 2048
llama_context: n_ubatch      = 63

If it is exactly -ub 32 and only Volta+ card effected, then it is probably easy to track down in CUDA backend code.

So looks like it's ubatch>=64 that breaks it then. It almost looks like it's specific to Volta (+AMD. maybe Turing?), considering the Pascal cards I have do work correctly, and if it was broken on Ampere there would be reports of that. For me it happens on the first message too if it's long enough, so disabling context shift doesn't fix it.


Edit: Building with GGML_CUDA_FORCE_MMQ=1 also fixes it, so probably an FP16 overflow somewhere? The readme at docs/build.md mentions that it "affects V100, CDNA and RDNA3+"

Edit2: Definitely FP16. Forcing cublas to use fp32 here also makes it work:

const bool use_fp16 = (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0) && row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT;

@Mushoz
Copy link

Mushoz commented Apr 24, 2025

For me it happens on the first message too if it's long enough, so disabling context shift doesn't fix it.

Good call, I guess it would also happen on first messages that are >= 64 tokens long, since that would also cause the first message to be processed at batch sizes that are bugged. Why does it specifically break for this model though? My AMD card is fine for all other models.

@jukofyork
Copy link
Collaborator

@jukofyork

Is it exactly 32 batch size that fixes the the problem (ie: 33 breaks it)?

Okay, so I did some proper tests. Seems like it breaks exactly at 64, but 63 works (-b 63 -ub 63).

Does increasing -b but leaving -ub 32 still fix the problem?

Good call, looks like it. Still works with just -ub 63 in this case.

llama_context: n_batch       = 2048
llama_context: n_ubatch      = 63

If it is exactly -ub 32 and only Volta+ card effected, then it is probably easy to track down in CUDA backend code.

So looks like it's ubatch>=64 that breaks it then. It almost looks like it's specific to Volta (+AMD. maybe Turing?), considering the Pascal cards I have do work correctly, and if it was broken on Ampere there would be reports of that. For me it happens on the first message too if it's long enough, so disabling context shift doesn't fix it.

Edit: Building with GGML_CUDA_FORCE_MMQ=1 also fixes it, so probably an FP16 overflow somewhere? The readme at docs/build.md mentions that it "affects V100, CDNA and RDNA3+"

Edit2: Definitely FP16. Forcing cublas to use fp32 here also makes it work:

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Line 1196 in 7c727fb
const bool use_fp16 = (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0) && row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT;

There are quite a few cases to do with 64 and VOLTA in here:

https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/mmq.cuh

but not sure how to narrow it down more :/

Maybe try to compile with GGML_CUDA_FORCE_MMQ and then GGML_CUDA_FORCE_CUBLAS to see if either of these has any effect?

@city96
Copy link
Contributor

city96 commented Apr 24, 2025

Why does it specifically break for this model though? My AMD card is fine for all other models.

I assume it's an edgecase where the multiplied value of two FP16 matrices overflows or isn't representable properly.

@jukofyork

See edit at the bottom of my last message. forcing MMQ does fix it. I think it's the cublas path, specifically for fp16.

I can reproduce the issue on Pascal by compiling with GGML_CUDA_FORCE_CUBLAS=1 and changing the arch check in the line below to also pick the FP16 code path for Pascal:

} else if (((GGML_CUDA_CC_IS_NVIDIA(cc) && cc >= GGML_CUDA_CC_VOLTA) || GGML_CUDA_CC_IS_AMD(cc)) && use_fp16) {

@pwilkin
Copy link
Author

pwilkin commented Apr 24, 2025

I'm far from competent here, but might it be a similar case to this one, i.e. additional precision being required?

7604a7d

@city96
Copy link
Contributor

city96 commented Apr 24, 2025

^probably is, but as far as I can tell it overflows at the call to cublasGemmEx which is part of cublas (external library).

If you force it to use FP32 for the output like here it does work, while the current one uses FP16 for the output then upcasts it to FP32 here and fails with GLM4.
Dunno what the performance implications would be of changing it to always use FP32 for the output in cublas, I can make a PR if that helps?

@JohannesGaessler
Copy link
Collaborator

Without having debugged this myself, this is likely a problem with the numerical range of the accumulator. cuBLAS GEMM is the fastest with the 16F compute type but unfortunately that compute type only supports 16F accumulators. So output values >65504 cause an overflow. MMQ always uses FP32 accumulators with a numerical range of around $10^{38}$ so an overflow in the accumulator basically never happens. However, MMQ performance is not always better than cuBLAS FP16 GEMM. On Pascal or older where there are no tensor cores MMQ is always preferred over cuBLAS because it can make use of integer instructions instead of floating point instructions. On Turing or newer int8 tensor cores are available which makes MMQ faster than cuBLAS at the batch sizes relevant for llama.cpp so it's also used unconditionally there. On Volta only FP16 tensor cores are available so MMQ is preferred for small batch sizes where the overhead of dequantizing the weight matrix to FP16 outweighs the potential speed gain from using tensor cores. AMD GPUs with fast FP16 math do not have support for the int8 tensor core operations that MMQ uses so they are effectively being treated the same way as Volta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests