Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU]PageAttn with 4bit-quantization #27992

Merged
merged 36 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
15fcdb8
[CPU]separate precisions of kv cache
zhangYiIntel Oct 31, 2024
82f843a
[CPU]use element as template args
zhangYiIntel Nov 6, 2024
a754404
[CPU]make quantize grouped
zhangYiIntel Nov 8, 2024
2aba224
[CPU]make u8 kernel grouped
zhangYiIntel Nov 13, 2024
fc435f6
[CPU]U4 Group size support with reference
zhangYiIntel Nov 18, 2024
d080e2a
[CPU]AVX512 support for u4 kernel
zhangYiIntel Nov 28, 2024
78ef4dd
[CPU]Support S4 quantization
zhangYiIntel Nov 29, 2024
3e821ea
[CPU]use AVX512 to quant s4
zhangYiIntel Nov 29, 2024
80b093f
[CPU]4-bit quantization with avx2
zhangYiIntel Dec 5, 2024
13a496e
fix build on elder compiler
zhangYiIntel Dec 6, 2024
92e6cb3
[CPU]fix fp32 inference
zhangYiIntel Dec 9, 2024
91ebc09
[CPU]set group size via hint
zhangYiIntel Dec 10, 2024
685f263
[CPU]fix code style
zhangYiIntel Dec 10, 2024
e56639a
[CPU]fix property test
zhangYiIntel Dec 11, 2024
a34ce8b
[CPU]add cache precision check
zhangYiIntel Dec 11, 2024
8548773
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Dec 12, 2024
fe6c311
[CPU]fix code style of config.cpp
zhangYiIntel Dec 12, 2024
522215a
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Dec 12, 2024
8faadd8
[CPU]pre calculate count
zhangYiIntel Dec 17, 2024
b4b0f0d
[CPU]Use ov::element as template args
zhangYiIntel Dec 18, 2024
5c838f7
[CPU]remove redundant marco
zhangYiIntel Dec 18, 2024
c98cec9
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Dec 19, 2024
f03e23c
apply review comments
zhangYiIntel Dec 19, 2024
99d5c4d
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Dec 19, 2024
dddb4d9
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Dec 19, 2024
c362399
[CPU]apply review comments
zhangYiIntel Jan 3, 2025
28bcf7b
[CPU]remove useless code of s4
zhangYiIntel Jan 3, 2025
94522a2
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Jan 3, 2025
56245d0
[CPU]Unify u8/u4 dequant kernel with template arg
zhangYiIntel Jan 5, 2025
84f03a3
[CPU]Define key/value cache prec/group_size priority
zhangYiIntel Jan 6, 2025
e0b437e
[CPU]fix prec order & check group_size
zhangYiIntel Jan 6, 2025
79df402
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Jan 6, 2025
f196535
Merge branch 'master' into yi3/4bit-cache
zhangYiIntel Jan 6, 2025
0515410
[CPU]fix sdpa test
zhangYiIntel Jan 7, 2025
7a412f7
[CPU]fix group_size in sdpa
zhangYiIntel Jan 7, 2025
594b392
[CPU]Change default group_size
zhangYiIntel Jan 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,8 @@
from openvino._pyopenvino.properties.hint import allow_auto_batching
from openvino._pyopenvino.properties.hint import dynamic_quantization_group_size
from openvino._pyopenvino.properties.hint import kv_cache_precision
from openvino._pyopenvino.properties.hint import key_cache_precision
from openvino._pyopenvino.properties.hint import value_cache_precision
from openvino._pyopenvino.properties.hint import key_cache_group_size
from openvino._pyopenvino.properties.hint import value_cache_group_size
from openvino._pyopenvino.properties.hint import activations_scale_factor
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@ void regmodule_properties(py::module m) {
wrap_property_RW(m_hint, ov::hint::allow_auto_batching, "allow_auto_batching");
wrap_property_RW(m_hint, ov::hint::dynamic_quantization_group_size, "dynamic_quantization_group_size");
wrap_property_RW(m_hint, ov::hint::kv_cache_precision, "kv_cache_precision");
wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision");
wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision");
wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size");
wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size");
Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:

  1. New options shouln't be treated as hints: lets move from the namespace.
  2. ov::hint::kv_cache_precision should remain major (including positioning to the user) option for KV-Cache quantization control.
  3. ov::hint::kv_cache_precision (like other hints) should impact values of lower level options: ov::hint::key_cache_precision/ov::hint::value_cache_precision/ov::hint::key_cache_group_size/ov::hint::value_cache_group_size. E.g. ov::hint::kv_cache_precision == u4 will result in (u8/u4/32/32) config for lower options.
  4. User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.

cc'ed @AlexKoff88 @vladimir-paramuzov @sshlyapn @p-durandin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good. Just to clarify:

  • We could have ov::hint::kv_cache_precision for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time).
  • ov::hint::key_cache_precision, ov::hint::value_cache_precision, ov::hint::key_cache_group_size, ov::hint::value_cache_group_size are for fine-grained control of KV-cache quantization and they have higher priority over ov::hint::kv_cache_precision if defined. ov::hint::key_cache_group_size, ov::hint::value_cache_group_size should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.
  • We should be able to define any of these options via the compilation config and rt_info/runtime_options subsection of the IR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangYiIntel just ov::key_cache_precision.
You can use ov::num_streams as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode

Copy link
Contributor Author

@zhangYiIntel zhangYiIntel Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexKoff88 @dmitry-gorokhov Regarding to the default group_size, since the hidden_state must be divided by group_size, if the we set it to 32/64, then what should we do if hidden_state is not divisible by 32/64, should we fallback group_size to hidden_state or just throw a exception ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is not the hint it should throw an exception, in case user sets invalid value.
If no user input is provided for these properties, then default value should be properly adjusted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another leftover here is that if we set default group_size to 32/64 other than hidden_state, then OpenVINO.GenAI has to update accordingly, otherwise the U8 KV cache quantization is broken.
CC: @ilya-lavrenov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I bet GenAI doesn't set any specific value for group_size, which means no user input for these properties. So as I mentioned group_size default value should be properly asjusted on CPU plugin side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CB implementation we need to duplicate all this logic related to KV cache as it's maintained outside of plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I bet GenAI doesn't set any specific value for group_size, which means no user input for these properties. So as I mentioned group_size default value should be properly asjusted on CPU plugin side.

The problem here with GenAI is that the ContinousBatchingPipeline allocates the memory for `PageAttention``s key/value cache, it must know the group_size in advacnde to allocate correct memory size for both cache + scale/zp at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/cache_manager.hpp#L57

ov::Tensor key_cache = remote_context.create_tensor(device_config.get_key_cache_precision(),
    device_config.get_key_cache_shape());
ov::Tensor value_cache = remote_context.create_tensor(device_config.get_value_cache_precision(),
    device_config.get_value_cache_shape());

The cache shape is defined at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/device_config.hpp#L120

m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
    ov::Dimension(m_num_kv_heads[layer_id]),
    ov::Dimension(m_block_size),
    ov::Dimension(m_head_size)});

m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
    ov::Dimension(m_num_kv_heads[layer_id]),
    ov::Dimension(m_block_size),
    ov::Dimension(m_head_size)});

The m_head_size defined is defined as following, which only considers 1 group per hidden_states

if (m_kv_cache_type == ov::element::u8)
    m_head_size += 8;

Therefore ContinousBatchingPipeline is broken with group_num greater than 1

wrap_property_RW(m_hint, ov::hint::activations_scale_factor, "activations_scale_factor");

// Submodule intel_cpu
Expand Down
12 changes: 12 additions & 0 deletions src/bindings/python/tests/test_runtime/test_properties.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,19 @@ def test_properties_ro(ov_property_ro, expected_value):
"DYNAMIC_QUANTIZATION_GROUP_SIZE",
((64, 64),),
),
(
hints.key_cache_group_size,
"KEY_CACHE_GROUP_SIZE",
((64, 64),),
),
(
hints.value_cache_group_size,
"VALUE_CACHE_GROUP_SIZE",
((64, 64),),
),
(hints.kv_cache_precision, "KV_CACHE_PRECISION", ((Type.f32, Type.f32),)),
(hints.key_cache_precision, "KEY_CACHE_PRECISION", ((Type.f32, Type.f32),)),
(hints.value_cache_precision, "VALUE_CACHE_PRECISION", ((Type.f32, Type.f32),)),
(
hints.activations_scale_factor,
"ACTIVATIONS_SCALE_FACTOR",
Expand Down
24 changes: 24 additions & 0 deletions src/inference/include/openvino/runtime/properties.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -580,6 +580,30 @@ static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization
*/
static constexpr Property<element::Type, PropertyMutability::RW> kv_cache_precision{"KV_CACHE_PRECISION"};

/**
* @brief Hint for device to use specified precision for key cache compression
* @ingroup ov_runtime_cpp_prop_api
*/
static constexpr Property<element::Type, PropertyMutability::RW> key_cache_precision{"KEY_CACHE_PRECISION"};

/**
* @brief Hint for device to use specified precision for value cache compression
* @ingroup ov_runtime_cpp_prop_api
*/
static constexpr Property<element::Type, PropertyMutability::RW> value_cache_precision{"VALUE_CACHE_PRECISION"};

/**
* @brief Hint for device to use group_size for key cache compression
* @ingroup ov_runtime_cpp_prop_api
*/
static constexpr Property<uint64_t, PropertyMutability::RW> key_cache_group_size{"KEY_CACHE_GROUP_SIZE"};

/**
* @brief Hint for device to use group_size for value cache compression
* @ingroup ov_runtime_cpp_prop_api
*/
static constexpr Property<uint64_t, PropertyMutability::RW> value_cache_group_size{"VALUE_CACHE_GROUP_SIZE"};

/**
* @brief This property scales down activations to prevent overflows when inference precision is f16.
* @ingroup ov_runtime_cpp_prop_api
Expand Down
12 changes: 12 additions & 0 deletions src/plugins/intel_cpu/src/compiled_model.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,10 @@ ov::Any CompiledModel::get_property(const std::string& name) const {
RO_property(ov::intel_cpu::sparse_weights_decompression_rate.name()),
RO_property(ov::hint::dynamic_quantization_group_size.name()),
RO_property(ov::hint::kv_cache_precision.name()),
RO_property(ov::hint::key_cache_precision.name()),
RO_property(ov::hint::value_cache_precision.name()),
RO_property(ov::hint::key_cache_group_size.name()),
RO_property(ov::hint::value_cache_group_size.name()),
};

OPENVINO_SUPPRESS_DEPRECATED_START
Expand Down Expand Up @@ -332,6 +336,14 @@ ov::Any CompiledModel::get_property(const std::string& name) const {
return decltype(ov::hint::dynamic_quantization_group_size)::value_type(config.fcDynamicQuantizationGroupSize);
} else if (name == ov::hint::kv_cache_precision) {
return decltype(ov::hint::kv_cache_precision)::value_type(config.kvCachePrecision);
} else if (name == ov::hint::key_cache_precision) {
return decltype(ov::hint::key_cache_precision)::value_type(config.keyCachePrecision);
} else if (name == ov::hint::value_cache_precision) {
return decltype(ov::hint::value_cache_precision)::value_type(config.valueCachePrecision);
} else if (name == ov::hint::key_cache_group_size) {
return decltype(ov::hint::key_cache_group_size)::value_type(config.keyCacheGroupSize);
} else if (name == ov::hint::value_cache_group_size) {
return decltype(ov::hint::value_cache_group_size)::value_type(config.valueCacheGroupSize);
}
OPENVINO_THROW("Unsupported property: ", name);
}
Expand Down
54 changes: 54 additions & 0 deletions src/plugins/intel_cpu/src/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,58 @@ void Config::readProperties(const ov::AnyMap& prop, const ModelType modelType) {
ov::hint::kv_cache_precision.name(),
". Supported values: u8, bf16, f16, f32");
}
} else if (key == ov::hint::key_cache_precision.name()) {
try {
kvCachePrecisionSetExplicitly = true;
auto const prec = val.as<ov::element::Type>();
if (one_of(prec, ov::element::f32, ov::element::f16, ov::element::bf16, ov::element::u8)) {
keyCachePrecision = prec;
} else {
OPENVINO_THROW("keyCachePrecision doesn't support value ", prec);
}
} catch (ov::Exception&) {
OPENVINO_THROW("Wrong value ",
val.as<std::string>(),
" for property key ",
ov::hint::key_cache_precision.name(),
". Supported values: u8, bf16, f16, f32");
}
} else if (key == ov::hint::value_cache_precision.name()) {
try {
kvCachePrecisionSetExplicitly = true;
auto const prec = val.as<ov::element::Type>();
if (one_of(prec,
ov::element::f32,
ov::element::f16,
ov::element::bf16,
ov::element::u8,
ov::element::u4)) {
valueCachePrecision = prec;
} else {
OPENVINO_THROW("valueCachePrecision doesn't support value ", prec);
}
} catch (ov::Exception&) {
OPENVINO_THROW("Wrong value ",
val.as<std::string>(),
" for property key ",
ov::hint::value_cache_precision.name(),
". Supported values: u4, s4, u8, bf16, f16, f32");
zhangYiIntel marked this conversation as resolved.
Show resolved Hide resolved
}
} else if (key == ov::hint::key_cache_group_size.name() || key == ov::hint::value_cache_group_size.name()) {
try {
auto const groupSize = val.as<uint64_t>();
if (key == ov::hint::key_cache_group_size.name()) {
keyCacheGroupSize = groupSize;
} else {
valueCacheGroupSize = groupSize;
}
} catch (ov::Exception&) {
OPENVINO_THROW("Wrong value ",
val.as<std::string>(),
" for property key ",
key,
". Expected only unsinged integer numbers");
}
zhangYiIntel marked this conversation as resolved.
Show resolved Hide resolved
} else if (key == ov::cache_encryption_callbacks.name()) {
try {
auto encryption_callbacks = val.as<EncryptionCallbacks>();
Expand Down Expand Up @@ -415,6 +467,8 @@ void Config::readProperties(const ov::AnyMap& prop, const ModelType modelType) {
}
if (!kvCachePrecisionSetExplicitly) {
kvCachePrecision = ov::element::f32;
valueCachePrecision = ov::element::f32;
keyCachePrecision = ov::element::f32;
}
}

Expand Down
6 changes: 6 additions & 0 deletions src/plugins/intel_cpu/src/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,18 @@ struct Config {
#endif
#if defined(OPENVINO_ARCH_X86_64)
ov::element::Type kvCachePrecision = ov::element::u8;
ov::element::Type keyCachePrecision = ov::element::u8;
ov::element::Type valueCachePrecision = ov::element::u8;
size_t rtCacheCapacity = 5000ul;
#else
ov::element::Type kvCachePrecision = ov::element::f16;
ov::element::Type keyCachePrecision = ov::element::f16;
ov::element::Type valueCachePrecision = ov::element::f16;
// TODO: Executor cache may leads to incorrect behavior on oneDNN ACL primitives
size_t rtCacheCapacity = 0ul;
#endif
size_t keyCacheGroupSize = 0ul;
size_t valueCacheGroupSize = 0ul;
dmitry-gorokhov marked this conversation as resolved.
Show resolved Hide resolved
ov::threading::IStreamsExecutor::Config streamExecutorConfig;
int streams = 1;
bool streamsChanged = false;
Expand Down
Loading
Loading