[Bugfix] fix moe_wna16 get_quant_method #12648

jinzhen-lin · 2025-02-01T14:14:13Z

Fix #12647
The get_quant_method of moe_wna16 always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer.

vllm/vllm/attention/layer.py

Lines 86 to 92 in baeded2

    
           quant_method = quant_config.get_quant_method( 
        
               self, prefix=prefix) if quant_config else None 
        
           if quant_method is not None: 
        
               assert isinstance(quant_method, BaseKVCacheMethod) 
        
               # TODO (mgoin): kv cache dtype should be specified in the FP8 
        
               # checkpoint config and become the "auto" behavior

Signed-off-by: Jinzhen Lin <[email protected]>

github-actions · 2025-02-01T14:14:24Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Fix vllm-project#12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92 Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Isotr0py <[email protected]>

Fix vllm-project#12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92 Signed-off-by: Jinzhen Lin <[email protected]>

Fix vllm-project#12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92 Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Felix Marty <[email protected]>

Fix vllm-project#12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92 Signed-off-by: Jinzhen Lin <[email protected]>

fix moe_wna16 get_quant_method

4eecc3f

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners February 1, 2025 14:14

jinzhen-lin mentioned this pull request Feb 1, 2025

[Bug]: Assertion Error When Using moe_wna16 #12647

Closed

1 task

mgoin approved these changes Feb 1, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2025

DarkLight1337 merged commit baaa2b2 into vllm-project:main Feb 2, 2025
61 of 63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] fix moe_wna16 get_quant_method #12648

[Bugfix] fix moe_wna16 get_quant_method #12648

jinzhen-lin commented Feb 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 1, 2025


	quant_method = quant_config.get_quant_method(
	self, prefix=prefix) if quant_config else None
	if quant_method is not None:
	assert isinstance(quant_method, BaseKVCacheMethod)
	# TODO (mgoin): kv cache dtype should be specified in the FP8
	# checkpoint config and become the "auto" behavior

[Bugfix] fix moe_wna16 get_quant_method #12648

[Bugfix] fix moe_wna16 get_quant_method #12648

Conversation

jinzhen-lin commented Feb 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 1, 2025

jinzhen-lin commented Feb 1, 2025 •

edited by github-actions bot

Loading