[CPU]PageAttn with 4bit-quantization #27992

zhangYiIntel · 2024-12-10T08:23:41Z

Details:

Add new hint to set group_size for key/value cache
Add grouped 4bit sym/asym quantization support for PageAttentionNode
Add grouped quantization for U8 quantization for PageAttentionNode

Tickets:

CVS-151586

Signed-off-by: [email protected] <[email protected]>

Signed-off-by: Zhang Yi3 <[email protected]>

src/plugins/intel_cpu/src/nodes/paged_attn.cpp

src/plugins/intel_cpu/src/nodes/scaled_attn.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

Signed-off-by: Zhang Yi3 <[email protected]>

luo-cheng2021

LGTM, great job!

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp

zhangYiIntel · 2024-12-20T03:20:17Z

@dmitry-gorokhov Could you have a review ?

src/plugins/intel_cpu/src/config.h

dmitry-gorokhov · 2025-01-02T08:09:35Z

src/bindings/python/src/pyopenvino/core/properties/properties.cpp

+    wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision");
+    wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision");
+    wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size");
+    wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size");


We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:

New options shouln't be treated as hints: lets move from the namespace.

ov::hint::kv_cache_precision should remain major (including positioning to the user) option for KV-Cache quantization control.

ov::hint::kv_cache_precision (like other hints) should impact values of lower level options: ov::hint::key_cache_precision/ov::hint::value_cache_precision/ov::hint::key_cache_group_size/ov::hint::value_cache_group_size. E.g. ov::hint::kv_cache_precision == u4 will result in (u8/u4/32/32) config for lower options.

User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.

cc'ed @AlexKoff88 @vladimir-paramuzov @sshlyapn @p-durandin

I think it looks good. Just to clarify:

We could have ov::hint::kv_cache_precision for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time).

ov::hint::key_cache_precision, ov::hint::value_cache_precision, ov::hint::key_cache_group_size, ov::hint::value_cache_group_size are for fine-grained control of KV-cache quantization and they have higher priority over ov::hint::kv_cache_precision if defined. ov::hint::key_cache_group_size, ov::hint::value_cache_group_size should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.

We should be able to define any of these options via the compilation config and rt_info/runtime_options subsection of the IR.

@yury-gorbachev, shall we discuss and approve this item?

@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?

@zhangYiIntel just ov::key_cache_precision.
You can use ov::num_streams as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode

src/plugins/intel_cpu/src/config.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp

src/plugins/intel_cpu/src/nodes/paged_attn.cpp

dmitry-gorokhov · 2025-01-02T08:58:52Z

src/plugins/intel_cpu/src/config.cpp

+                               " for property key ",
+                               key,
+                               ". Expected only unsinged integer numbers");
+            }


Random place. We need to limit key_cache_precision and value_cache_precision supported matrix for SDPA operation. Otherwise these properties doesn't have any impact on execution now.

That's sure, the current SDPA only supports U8 KV-cache. Maybe we could just the validity of properties in SDPA node.

One reaming problem here is the priority of kvCachePrecision and keyCachePrecision/valueCachePrecision. Inside SDPA node, we cannot tell which one is valid since all of them have default value.

As we discussed kvCachePrecision is a hint, while keyCachePrecision/valueCachePrecision are low-level options. Low-level options should always have higher priority.
In practice you should use keyCachePrecision/valueCachePrecision in the SDPA. keyCachePrecision/valueCachePrecision should be initialized in the Config based on kvCachePrecision if no user values provided, or based on user values. (you can check how inference_precision is initialized based on execution_mode).

src/plugins/intel_cpu/tests/functional/custom/behavior/ov_executable_network/properties.cpp

Signed-off-by: Zhang Yi <[email protected]>

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp

src/plugins/intel_cpu/src/config.cpp

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel added 13 commits December 9, 2024 16:03

[CPU]separate precisions of kv cache

15fcdb8

Signed-off-by: [email protected] <[email protected]>

[CPU]use element as template args

82f843a

[CPU]make quantize grouped

a754404

[CPU]make u8 kernel grouped

2aba224

[CPU]U4 Group size support with reference

fc435f6

Signed-off-by: [email protected] <[email protected]>

[CPU]AVX512 support for u4 kernel

d080e2a

[CPU]Support S4 quantization

78ef4dd

Signed-off-by: [email protected] <[email protected]>

[CPU]use AVX512 to quant s4

3e821ea

[CPU]4-bit quantization with avx2

80b093f

Signed-off-by: [email protected] <[email protected]>

fix build on elder compiler

13a496e

[CPU]fix fp32 inference

92e6cb3

[CPU]set group size via hint

91ebc09

Signed-off-by: Zhang Yi3 <[email protected]>

[CPU]fix code style

685f263

Signed-off-by: Zhang Yi3 <[email protected]>

github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Dec 10, 2024

zhangYiIntel changed the title ~~Yi3/4bit cache~~ [CPU]PageAttn with 4bit-quantization Dec 10, 2024

[CPU]fix property test

e56639a

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from a12c86f to e56639a Compare December 11, 2024 02:45

[CPU]add cache precision check

a34ce8b

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel mentioned this pull request Dec 12, 2024

[CB]Support 4-bit cache openvinotoolkit/openvino.genai#1366

Draft

zhangYiIntel marked this pull request as ready for review December 12, 2024 01:57

zhangYiIntel requested review from a team as code owners December 12, 2024 01:57

zhangYiIntel added 2 commits December 12, 2024 09:57

Merge branch 'master' into yi3/4bit-cache

8548773

[CPU]fix code style of config.cpp

fe6c311

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 373d50d to fe6c311 Compare December 12, 2024 03:17

zhangYiIntel force-pushed the yi3/4bit-cache branch from 5bc75f8 to 76380d1 Compare December 12, 2024 08:35

Merge branch 'master' into yi3/4bit-cache

522215a

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 76380d1 to 522215a Compare December 13, 2024 00:17

yuxu42 requested a review from luo-cheng2021 December 13, 2024 01:40

luo-cheng2021 reviewed Dec 13, 2024

View reviewed changes

zhangYiIntel added 3 commits December 17, 2024 16:51

[CPU]pre calculate count

8faadd8

[CPU]Use ov::element as template args

b4b0f0d

Signed-off-by: Zhang Yi3 <[email protected]>

[CPU]remove redundant marco

5c838f7

zhangYiIntel requested a review from luo-cheng2021 December 18, 2024 08:16

Merge branch 'master' into yi3/4bit-cache

c98cec9

luo-cheng2021 approved these changes Dec 19, 2024

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp Outdated Show resolved Hide resolved

zhangYiIntel added 3 commits December 19, 2024 09:31

apply review comments

f03e23c

Merge branch 'master' into yi3/4bit-cache

99d5c4d

Merge branch 'master' into yi3/4bit-cache

dddb4d9

dmitry-gorokhov reviewed Jan 2, 2025

View reviewed changes

[CPU]apply review comments

c362399

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 244f7cc to c362399 Compare January 3, 2025 06:19

dmitry-gorokhov reviewed Jan 3, 2025

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp Outdated Show resolved Hide resolved

dmitry-gorokhov reviewed Jan 3, 2025

View reviewed changes

src/plugins/intel_cpu/src/config.cpp Outdated Show resolved Hide resolved

zhangYiIntel added 3 commits January 3, 2025 16:09

[CPU]remove useless code of s4

28bcf7b

Signed-off-by: Zhang Yi <[email protected]>

Merge branch 'master' into yi3/4bit-cache

94522a2

[CPU]Unify u8/u4 dequant kernel with template arg

56245d0

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 7e6ffa2 to 56245d0 Compare January 5, 2025 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU]PageAttn with 4bit-quantization #27992

[CPU]PageAttn with 4bit-quantization #27992

zhangYiIntel commented Dec 10, 2024

luo-cheng2021 left a comment

zhangYiIntel commented Dec 20, 2024

dmitry-gorokhov Jan 2, 2025 •

edited

Loading

AlexKoff88 Jan 2, 2025

zhangYiIntel Jan 3, 2025

dmitry-gorokhov Jan 3, 2025

dmitry-gorokhov Jan 2, 2025

zhangYiIntel Jan 2, 2025

zhangYiIntel Jan 3, 2025

dmitry-gorokhov Jan 3, 2025 •

edited

Loading

[CPU]PageAttn with 4bit-quantization #27992

Are you sure you want to change the base?

[CPU]PageAttn with 4bit-quantization #27992

Conversation

zhangYiIntel commented Dec 10, 2024

Details:

Tickets:

luo-cheng2021 left a comment

Choose a reason for hiding this comment

zhangYiIntel commented Dec 20, 2024

dmitry-gorokhov Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

AlexKoff88 Jan 2, 2025

Choose a reason for hiding this comment

zhangYiIntel Jan 3, 2025

Choose a reason for hiding this comment

dmitry-gorokhov Jan 3, 2025

Choose a reason for hiding this comment

dmitry-gorokhov Jan 2, 2025

Choose a reason for hiding this comment

zhangYiIntel Jan 2, 2025

Choose a reason for hiding this comment

zhangYiIntel Jan 3, 2025

Choose a reason for hiding this comment

dmitry-gorokhov Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

dmitry-gorokhov Jan 2, 2025 •

edited

Loading

dmitry-gorokhov Jan 3, 2025 •

edited

Loading