Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_awq.py using qwen1.5-7b-chat when quantize error #125

Open
Wikeolf opened this issue Oct 5, 2024 · 4 comments
Open

run_awq.py using qwen1.5-7b-chat when quantize error #125

Wikeolf opened this issue Oct 5, 2024 · 4 comments

Comments

@Wikeolf
Copy link

Wikeolf commented Oct 5, 2024

python run_awq.py --model_name Qwen/Qwen1.5-7B-Chat --task quantize
Namespace(model_name='Qwen/Qwen1.5-7B-Chat', target='aie', profile_layer=False, task='quantize', precision='w4abf16', flash_attention_plus=False, profilegemm=False, dataset='raw', fast_mlp=False, fast_attention=False, w_bit=4, group_size=128, algorithm='awq', gen_onnx_nodes=False, mhaops='all')
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.44s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Qwen2ModelEval(
(model): Qwen2Model(
(embed_tokens): Embedding(151936, 4096)
(layers): ModuleList(
(0-31): 32 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=True)
(k_proj): Linear(in_features=4096, out_features=4096, bias=True)
(v_proj): Linear(in_features=4096, out_features=4096, bias=True)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm()
(post_attention_layernorm): Qwen2RMSNorm()
)
)
(norm): Qwen2RMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)
[RyzenAILLMQuantizer] [AWQ] Calculating AWQ scales ...
Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 32768). Running this sequence through the model will result in indexing errors

  • Split into 59 blocks
    Running AWQ...: 0%| | 0/32 [00:24<?, ?it/s]
    Traceback (most recent call last):
    File "D:\RyzenAI-SW\example\transformers\models\llm\run_awq.py", line 355, in
    model = RyzenAILLMQuantizer.quantize(model, quant_config=quant_config)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "D:\RyzenAI-SW\example\transformers\tools\ryzenai_llm_quantizer.py", line 162, in quantize
    awq_results = run_awq(
    ^^^^^^^^
    File "C:\Users\Azure\miniconda3\envs\ryzenai-transformers\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^
    File "D:\RyzenAI-SW\example\transformers\ext\llm-awq\awq\quantize\pre_quant.py", line 196, in run_awq
    scales_list = auto_scale_block(
    ^^^^^^^^^^^^^^^^^
    File "C:\Users\Azure\miniconda3\envs\ryzenai-transformers\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^
    File "D:\RyzenAI-SW\example\transformers\ext\llm-awq\awq\quantize\auto_scale.py", line 309, in auto_scale_block
    scales_list.append(_auto_get_scale(
    ^^^^^^^^^^^^^^^^
    File "D:\RyzenAI-SW\example\transformers\ext\llm-awq\awq\quantize\auto_scale.py", line 170, in _auto_get_scale
    File "D:\RyzenAI-SW\example\transformers\ext\llm-awq\awq\quantize\auto_scale.py", line 170, in _auto_get_scale
    scales = _search_module_scale(module2inspect, layers, inp, kwargs)
    scales = _search_module_scale(module2inspect, layers, inp, kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "D:\RyzenAI-SW\example\transformers\ext\llm-awq\awq\quantize\auto_scale.py", line 119, in _search_module_scale
    File "D:\RyzenAI-SW\example\transformers\ext\llm-awq\awq\quantize\auto_scale.py", line 119, in _search_module_scale
    org_out = block(x, **kwargs)
    org_out = block(x, **kwargs)
    ^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^
    File "C:\Users\Azure\miniconda3\envs\ryzenai-transformers\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_c File "C:\Users\Azure\miniconda3\envs\ryzenai-transformers\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    return self._call_impl(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\Azure\miniconda3\envs\ryzenai-transformers\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\Azure\miniconda3\envs\ryzenai-transformers\Lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 298, in forward
    raise ValueError(
    ValueError: Attention mask should be of size (59, 1, 512, 1024), but is torch.Size([59, 1, 512, 512])
@Wikeolf
Copy link
Author

Wikeolf commented Oct 5, 2024

update transeformers to 4.39.3 will have same error

@shivani-athavale
Copy link

Hi @Wikeolf, when I try to run the command:

python run_awq.py --model_name Qwen/Qwen1.5-7B-Chat --task quantize

I get the following error message and the code exits:

[RyzenAILLMQuantizer] [AWQ] Looking for Z:\ext\awq_cache\Qwen1.5-7B-Chat-w4-g128.pt
[RyzenAILLMQuantizer] [AWQ] No precalculated scales available for Qwen1.5-7B-Chat w_bit:4 group_size:128

I was curious to know where you got the AWQ scales for Qwen or how do you not see this message?

@Wikeolf
Copy link
Author

Wikeolf commented Oct 10, 2024

Hi @Wikeolf, when I try to run the command:

python run_awq.py --model_name Qwen/Qwen1.5-7B-Chat --task quantize

I get the following error message and the code exits:

[RyzenAILLMQuantizer] [AWQ] Looking for Z:\ext\awq_cache\Qwen1.5-7B-Chat-w4-g128.pt [RyzenAILLMQuantizer] [AWQ] No precalculated scales available for Qwen1.5-7B-Chat w_bit:4 group_size:128

I was curious to know where you got the AWQ scales for Qwen or how do you not see this message?
I have seen this message.

If you search for the content of the message in this repo, you can find the context in which the message was printed. The AWQ Model Zoo does not provide the file Qwen1.5-7B-Chat-w4-g128.pt, so it is impossible to find this file. This problem can be solved by modifying the code in run_awq.py.

            # set use_scales = False in quant config to calculate new awq scales
            use_qscales = False

I suspect this might be a bug in the development process. In run_awq.py, some special configurations were made for Qwen, but the variable was not set to False, leading to this issue.

If you set this variable to False, you should get the same result as me, but good luck.

@shivani-athavale
Copy link

Thanks for reporting this. While we check on this, maybe you can try 'pergrp' quantization.

python run_awq.py --model_name Qwen/Qwen1.5-7B-Chat --task quantize --algorithm pergrp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants