[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

horheynm · 2025-03-07T04:31:32Z

Blocked on : neuralmagic/compressed-tensors#270

SUMMARY:
Quantize the output activation of the attention layers for channel wise -> did not have support -> selected wrong dim to quantize.
Quantize the kv-cache for channel wise int8 -> previously only supported tensor-wise.

Attention we need to worry about is the QKV. O/Up/down is not quantized.

Math:
x is the input vector -> tokenized + embedding
weight for QKV is Linear modules
output is the forward call of QKV with x

# x
(Pdb) hidden_states.shape -> torch.Size([1, 1930, 4096]) -> [batch, seq_len, hidden_size]

# weight
(Pdb) self.q_proj.weight.shape -> torch.Size([4096, 4096]) -- [hidden_size, hidden_size]
(Pdb) self.k_proj.weight.shape -> torch.Size([1024, 4096]) -- [num_key_value_heads * head_dim, hidden_size]
(Pdb) self.v_proj.weight.shape -> torch.Size([1024, 4096]) -- [num_key_value_heads * head_dim, hidden_size]

# output
(Pdb) self.q_proj(hidden_states).shape -> torch.Size([1, 1930, 4096]) -> [batch, seq_len, hidden_size]
(Pdb) self.k_proj(hidden_states).shape -> torch.Size([1, 1930, 1024]) -> [batch, seq_len, num_key_value_heads * head_dim]
(Pdb) self.v_proj(hidden_states).shape -> torch.Size([1, 1930, 1024]) -> [batch, seq_len, num_key_value_heads * head_dim]

# key_states, value_states shape
[batch, num_key_value_heads, seq_len, head_dim]

Expected output scales and zp shapes for output activations

q_proj activations -> [4096] -> [hidden_size]
k_proj activations -> [1024] -> [num_key_value_heads * head_dim]
v_proj activations -> [1024] -> [num_key_value_heads * head_dim]

Expected output scales and zp shapes for kv-cache channel

k_proj, v_proj -> [head_dim]

The observer will output the vectors in the same ndim as the given output activation tensor (ie. torch.Size([1, 1930, 1024]), then outputs torch.Size([1, 1, 1024])). Squeeze it to just get torch.Size([1024]), so ndim of 1.

TEST PLAN:

Pass tests
Pass eval

…tn_quant Signed-off-by: George Ohashi <[email protected]>

Signed-off-by: George Ohashi <[email protected]>

github-actions · 2025-03-07T04:31:41Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

horheynm · 2025-03-07T04:32:42Z

Next todo is to add support for group quantization for output activations

Signed-off-by: George Ohashi <[email protected]>

…nto attn_quant Signed-off-by: George Ohashi <[email protected]>

Signed-off-by: George Ohashi <[email protected]>

horheynm · 2025-03-07T21:25:59Z

Will break down kv-cache logic to a different PR

dsikka and others added 5 commits February 10, 2025 23:34

hacks

62e0952

update example

f4e1d05

Merge branch 'main' of github.com:vllm-project/llm-compressor into at…

76fc03d

…tn_quant Signed-off-by: George Ohashi <[email protected]>

channel wise fp8 quantization, attention modules

c2a2016

Signed-off-by: George Ohashi <[email protected]>

revert example script

189e9d5

Signed-off-by: George Ohashi <[email protected]>

Merge branch 'main' into attn_quant

405dc40

horheynm added 2 commits March 6, 2025 23:36

lint

78222ba

Signed-off-by: George Ohashi <[email protected]>

Merge branch 'attn_quant' of github.com:vllm-project/llm-compressor i…

3d19401

…nto attn_quant Signed-off-by: George Ohashi <[email protected]>

horheynm added the ready When a PR is ready for review label Mar 7, 2025

dsikka marked this pull request as draft March 7, 2025 14:27

kv-cache int8 quant

5d13e2b

Signed-off-by: George Ohashi <[email protected]>

horheynm changed the title ~~[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules~~ [Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

horheynm commented Mar 7, 2025 •

edited

Loading

github-actions bot commented Mar 7, 2025

horheynm commented Mar 7, 2025 •

edited

Loading

horheynm commented Mar 7, 2025

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

Are you sure you want to change the base?

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

Conversation

horheynm commented Mar 7, 2025 • edited Loading

github-actions bot commented Mar 7, 2025

horheynm commented Mar 7, 2025 • edited Loading

horheynm commented Mar 7, 2025

horheynm commented Mar 7, 2025 •

edited

Loading

horheynm commented Mar 7, 2025 •

edited

Loading