Fix quark fp8 format loading. #395

fxmarty-amd · 2025-01-31T14:49:31Z

This PR fixes an issue in the order requantize_with_max_scale is called when loading a checkpoint using quark config.json/checkpoint format.

Compare the following:

neuralmagic fp8: https://github.com/vllm-project/vllm/blob/7a8987dac5f0ed0c798a73e8b4ec8f5e640bc63a/vllm/model_executor/layers/quantization/fp8.py#L303-L317

quark fp8:

vllm/vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py

Lines 35 to 48 in 6852819

    
           if self.qscheme == "per_tensor": 
        
               max_w_scale, weight = requantize_with_max_scale( 
        
                   weight=layer.weight, 
        
                   weight_scale=layer.weight_scale, 
        
                   logical_widths=layer.logical_widths, 
        
               ) 
        
               if current_platform.is_rocm(): 
        
                   weight, max_w_scale, input_scale = normalize_e4m3fn_to_e4m3fnuz( 
        
                       weight=weight, 
        
                       weight_scale=max_w_scale, 
        
                       input_scale=layer.input_scale) 
        
                   if input_scale is not None: 
        
                       layer.input_scale = Parameter(input_scale,

normalize_e4m3fn_to_e4m3fnuz should be called before requantize_with_max_scale.

gshtras · 2025-01-31T15:28:01Z

A separate PR is not required. Once the upstream one is accepted, we'll get it within 1 week usually

fxmarty-amd · 2025-01-31T15:30:25Z

@gshtras I think ROCm/vllm is used more widespread internally so if this can get included here before merging upstream, it is nice, but I can close this if you prefer!

BowenBao · 2025-01-31T17:32:04Z

vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py

                    input_scale=layer.input_scale)
                if input_scale is not None:
                    layer.input_scale = Parameter(input_scale,
                                                  requires_grad=False)

+            max_w_scale, weight = requantize_with_max_scale(
+                weight=weight,
+                weight_scale=max_w_scale,


both weight and max_w_scale are undefined if current_platform.is_rocm() is False.

BowenBao · 2025-01-31T17:46:19Z

Do you observe better accuracy with this change? Mathematically the order should not matter.

edit: taking this back, the order does matter since otherwise the requantize_with_max_scale is requanting into fnuz using fn scales. Let's point this out in the PR description.

fix quark fp8 loading

a38d96a

fxmarty-amd mentioned this pull request Jan 31, 2025

Fix quark fp8 format loading vllm-project/vllm#12612

Open

BowenBao requested changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix quark fp8 format loading. #395

Fix quark fp8 format loading. #395

fxmarty-amd commented Jan 31, 2025 •

edited by github-actions bot

Loading

gshtras commented Jan 31, 2025

fxmarty-amd commented Jan 31, 2025

BowenBao Jan 31, 2025

BowenBao commented Jan 31, 2025 •

edited

Loading

	if self.qscheme == "per_tensor":
	max_w_scale, weight = requantize_with_max_scale(
	weight=layer.weight,
	weight_scale=layer.weight_scale,
	logical_widths=layer.logical_widths,
	)

	if current_platform.is_rocm():
	weight, max_w_scale, input_scale = normalize_e4m3fn_to_e4m3fnuz(
	weight=weight,
	weight_scale=max_w_scale,
	input_scale=layer.input_scale)
	if input_scale is not None:
	layer.input_scale = Parameter(input_scale,

Fix quark fp8 format loading. #395

Are you sure you want to change the base?

Fix quark fp8 format loading. #395

Conversation

fxmarty-amd commented Jan 31, 2025 • edited by github-actions bot Loading

gshtras commented Jan 31, 2025

fxmarty-amd commented Jan 31, 2025

BowenBao Jan 31, 2025

Choose a reason for hiding this comment

BowenBao commented Jan 31, 2025 • edited Loading

fxmarty-amd commented Jan 31, 2025 •

edited by github-actions bot

Loading

BowenBao commented Jan 31, 2025 •

edited

Loading