-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM] [NPU] StaticLLMPipeline: Compiler DQ update #1515
Conversation
src/cpp/src/llm_pipeline_static.cpp
Outdated
if (npudesc.has_value() && npudesc->compiler_dq) { | ||
config.emplace("NPUW_DQ_FULL", "NO"); | ||
config.emplace("NPU_COMPILER_DYNAMIC_QUANTIZATION", true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you enable it for CW compressed models only? The whole point of this feature was to make it for group-quantized prefill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/cpp/src/llm_pipeline_static.cpp
Outdated
if (npudesc.has_value() && npudesc->compiler_dq) { | ||
config.emplace("NPUW_DQ_FULL", "NO"); | ||
config.emplace("NPU_COMPILER_DYNAMIC_QUANTIZATION", true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You certainly don't need to make it twice, just do this in the shared section of the config (so it goes to both prefill and kvcache).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when you switch to compiler DQ, you'll have to disable DCOFF since otherwise DCOFF will be applied to this model and it will run in FP16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (std::find(device_caps.begin(), device_caps.end(), | ||
"COMPILER_DYNAMIC_QUANTIZATION") != device_caps.end()) { | ||
const auto supported_properties = core.get_property("NPU", ov::supported_properties); | ||
if (std::find(supported_properties.begin(), supported_properties.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a sub-string search. Perhaps there is some OV utility to tokenize the list first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems already simple enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How it is a sub-string search, looks more like a container. At least std::find works on it (this auto
sometimes makes things harder to understand)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propose to encapsulate it like this:
void enable_dq() {
if (npudesc.has_value() && npudesc->compiler_dq) {
"NPUW_DQ_FULL": "NO"
"NPU_COMPILER_DYNAMIC_QUANTIZATION": true
} else {
"NPUW_DQ": "YES"
}
}
We need to call enable_dq
in all places where previously was just NPUW_DQ: YES
As for the logic to enable dq or not, it was the following:
- For prefill model DQ is enabled only when model is channel-wise compressed
- For generation model DQ is always enabled.
@dmatveev Could you confirm this, please?
If it should be enabled for both CW
and GQ
models we can call enable_dq()
function uncoditionaly for both configs.
Should be done here: openvino.genai/src/cpp/src/llm_pipeline_static.cpp Lines 502 to 511 in fa76cf7
|
@TolyaTalamanov I updated our internal guide (the one you've used before) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets more and more complex.
If it works, let's keep it like this, then propagate to the LLMCompiledModel (down to NPUW), remove it here, and refactor/rethink there.
@TolyaTalamanov please be sure to provide your review here as well.
if (std::find(device_caps.begin(), device_caps.end(), | ||
"COMPILER_DYNAMIC_QUANTIZATION") != device_caps.end()) { | ||
const auto supported_properties = core.get_property("NPU", ov::supported_properties); | ||
if (std::find(supported_properties.begin(), supported_properties.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How it is a sub-string search, looks more like a container. At least std::find works on it (this auto
sometimes makes things harder to understand)
src/cpp/src/llm_pipeline_static.cpp
Outdated
compiler_dq = true; | ||
} | ||
return std::make_optional(NPUDesc{arch, max_tiles, compiler_dq}); | ||
} | ||
|
||
ov::AnyMap get_baseline_common_config() { | ||
ov::AnyMap get_baseline_common_config(bool enable_compiler_dq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pass the NPUDesc
here instead - there's a request to get rid of "high-precision" options in the future as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point was to encapsulate logic of compiler and non-compiler DQ into function and where the DQ is needed just call that function
src/cpp/src/llm_pipeline_static.cpp
Outdated
ov::AnyMap get_default_common_config(const std::shared_ptr<ov::Model>& model) { | ||
auto config = get_baseline_common_config(); | ||
bool enable_compiler_dq(const std::optional<NPUDesc>& npudesc) { | ||
return npudesc.has_value() && npudesc->compiler_dq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this function? Why don't just use npudesc.has_value() && npudesc->compiler_dq
straightaway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_compiler_dq
seems to be redundant
if (npudesc.has_value() && npudesc->compiler_dq) { | ||
config.emplace("NPUW_DQ_FULL", "NO"); | ||
// Specify NPUW DQ if Compiler DQ is not enabled | ||
if (!npudesc.has_value() || !npudesc->compiler_dq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not gonna lie @TolyaTalamanov, it was much better WITH that tiny one-liner than without that.
Let's wait for the testing results |
Depends on openvinotoolkit/openvino#28316
Related to openvinotoolkit/openvino#28343