Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU]PageAttn with 4bit-quantization #27992

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

zhangYiIntel
Copy link
Contributor

Details:

  • Add new hint to set group_size for key/value cache
  • Add grouped 4bit sym/asym quantization support for PageAttentionNode
  • Add grouped quantization for U8 quantization for PageAttentionNode

Tickets:

@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Dec 10, 2024
@zhangYiIntel zhangYiIntel changed the title Yi3/4bit cache [CPU]PageAttn with 4bit-quantization Dec 10, 2024
Signed-off-by: Zhang Yi3 <[email protected]>
@zhangYiIntel zhangYiIntel marked this pull request as ready for review December 12, 2024 01:57
@zhangYiIntel zhangYiIntel requested review from a team as code owners December 12, 2024 01:57
Copy link
Contributor

@luo-cheng2021 luo-cheng2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job!

@zhangYiIntel
Copy link
Contributor Author

@dmitry-gorokhov Could you have a review ?

src/plugins/intel_cpu/src/config.h Show resolved Hide resolved
wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision");
wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision");
wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size");
wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size");
Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:

  1. New options shouln't be treated as hints: lets move from the namespace.
  2. ov::hint::kv_cache_precision should remain major (including positioning to the user) option for KV-Cache quantization control.
  3. ov::hint::kv_cache_precision (like other hints) should impact values of lower level options: ov::hint::key_cache_precision/ov::hint::value_cache_precision/ov::hint::key_cache_group_size/ov::hint::value_cache_group_size. E.g. ov::hint::kv_cache_precision == u4 will result in (u8/u4/32/32) config for lower options.
  4. User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.

cc'ed @AlexKoff88 @vladimir-paramuzov @sshlyapn @p-durandin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good. Just to clarify:

  • We could have ov::hint::kv_cache_precision for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time).
  • ov::hint::key_cache_precision, ov::hint::value_cache_precision, ov::hint::key_cache_group_size, ov::hint::value_cache_group_size are for fine-grained control of KV-cache quantization and they have higher priority over ov::hint::kv_cache_precision if defined. ov::hint::key_cache_group_size, ov::hint::value_cache_group_size should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.
  • We should be able to define any of these options via the compilation config and rt_info/runtime_options subsection of the IR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangYiIntel just ov::key_cache_precision.
You can use ov::num_streams as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode

src/plugins/intel_cpu/src/config.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/config.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/paged_attn.cpp Outdated Show resolved Hide resolved
" for property key ",
key,
". Expected only unsinged integer numbers");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random place. We need to limit key_cache_precision and value_cache_precision supported matrix for SDPA operation. Otherwise these properties doesn't have any impact on execution now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sure, the current SDPA only supports U8 KV-cache. Maybe we could just the validity of properties in SDPA node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reaming problem here is the priority of kvCachePrecision and keyCachePrecision/valueCachePrecision. Inside SDPA node, we cannot tell which one is valid since all of them have default value.

Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed kvCachePrecision is a hint, while keyCachePrecision/valueCachePrecision are low-level options. Low-level options should always have higher priority.
In practice you should use keyCachePrecision/valueCachePrecision in the SDPA. keyCachePrecision/valueCachePrecision should be initialized in the Config based on kvCachePrecision if no user values provided, or based on user values. (you can check how inference_precision is initialized based on execution_mode).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPP API OpenVINO CPP API bindings category: CPU OpenVINO CPU plugin category: inference OpenVINO Runtime library - Inference category: Python API OpenVINO Python bindings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants