-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU]PageAttn with 4bit-quantization #27992
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
a12c86f
to
e56639a
Compare
Signed-off-by: Zhang Yi3 <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
373d50d
to
fe6c311
Compare
5bc75f8
to
76380d1
Compare
Signed-off-by: Zhang Yi3 <[email protected]>
76380d1
to
522215a
Compare
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Zhang Yi3 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great job!
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp
Outdated
Show resolved
Hide resolved
@dmitry-gorokhov Could you have a review ? |
wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision"); | ||
wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision"); | ||
wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size"); | ||
wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:
- New options shouln't be treated as hints: lets move from the namespace.
ov::hint::kv_cache_precision
should remain major (including positioning to the user) option for KV-Cache quantization control.ov::hint::kv_cache_precision
(like other hints) should impact values of lower level options:ov::hint::key_cache_precision
/ov::hint::value_cache_precision
/ov::hint::key_cache_group_size
/ov::hint::value_cache_group_size
. E.g.ov::hint::kv_cache_precision == u4
will result in(u8/u4/32/32)
config for lower options.- User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good. Just to clarify:
- We could have
ov::hint::kv_cache_precision
for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time). ov::hint::key_cache_precision
,ov::hint::value_cache_precision
,ov::hint::key_cache_group_size
,ov::hint::value_cache_group_size
are for fine-grained control of KV-cache quantization and they have higher priority overov::hint::kv_cache_precision
if defined.ov::hint::key_cache_group_size
,ov::hint::value_cache_group_size
should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.- We should be able to define any of these options via the compilation config and
rt_info/runtime_options
subsection of the IR.
- @yury-gorbachev, shall we discuss and approve this item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangYiIntel just ov::key_cache_precision
.
You can use ov::num_streams
as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp
Outdated
Show resolved
Hide resolved
" for property key ", | ||
key, | ||
". Expected only unsinged integer numbers"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random place. We need to limit key_cache_precision
and value_cache_precision
supported matrix for SDPA operation. Otherwise these properties doesn't have any impact on execution now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sure, the current SDPA only supports U8 KV-cache. Maybe we could just the validity of properties in SDPA node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reaming problem here is the priority of kvCachePrecision and keyCachePrecision/valueCachePrecision. Inside SDPA node, we cannot tell which one is valid since all of them have default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed kvCachePrecision is a hint, while keyCachePrecision/valueCachePrecision are low-level options. Low-level options should always have higher priority.
In practice you should use keyCachePrecision/valueCachePrecision in the SDPA. keyCachePrecision/valueCachePrecision should be initialized in the Config based on kvCachePrecision if no user values provided, or based on user values. (you can check how inference_precision is initialized based on execution_mode).
src/plugins/intel_cpu/tests/functional/custom/behavior/ov_executable_network/properties.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Zhang Yi <[email protected]>
244f7cc
to
c362399
Compare
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Zhang Yi <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
7e6ffa2
to
56245d0
Compare
Details:
Tickets: