Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] [NPU] StaticLLMPipeline: Compiler DQ update #1515

Open
wants to merge 8 commits into from

Conversation

smirnov-alexey
Copy link
Contributor

@smirnov-alexey smirnov-alexey commented Jan 9, 2025

Comment on lines 518 to 521
if (npudesc.has_value() && npudesc->compiler_dq) {
config.emplace("NPUW_DQ_FULL", "NO");
config.emplace("NPU_COMPILER_DYNAMIC_QUANTIZATION", true);
}
Copy link
Contributor

@dmatveev dmatveev Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you enable it for CW compressed models only? The whole point of this feature was to make it for group-quantized prefill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 542 to 545
if (npudesc.has_value() && npudesc->compiler_dq) {
config.emplace("NPUW_DQ_FULL", "NO");
config.emplace("NPU_COMPILER_DYNAMIC_QUANTIZATION", true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You certainly don't need to make it twice, just do this in the shared section of the config (so it goes to both prefill and kvcache).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when you switch to compiler DQ, you'll have to disable DCOFF since otherwise DCOFF will be applied to this model and it will run in FP16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (std::find(device_caps.begin(), device_caps.end(),
"COMPILER_DYNAMIC_QUANTIZATION") != device_caps.end()) {
const auto supported_properties = core.get_property("NPU", ov::supported_properties);
if (std::find(supported_properties.begin(), supported_properties.end(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a sub-string search. Perhaps there is some OV utility to tokenize the list first?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems already simple enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How it is a sub-string search, looks more like a container. At least std::find works on it (this auto sometimes makes things harder to understand)

Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propose to encapsulate it like this:

void enable_dq() {
    if (npudesc.has_value() && npudesc->compiler_dq) {
        "NPUW_DQ_FULL": "NO"
        "NPU_COMPILER_DYNAMIC_QUANTIZATION": true
    } else {
        "NPUW_DQ": "YES"
    }
}

We need to call enable_dq in all places where previously was just NPUW_DQ: YES

As for the logic to enable dq or not, it was the following:

  1. For prefill model DQ is enabled only when model is channel-wise compressed
  2. For generation model DQ is always enabled.
    @dmatveev Could you confirm this, please?

If it should be enabled for both CW and GQ models we can call enable_dq() function uncoditionaly for both configs.

@TolyaTalamanov
Copy link
Collaborator

Propose to encapsulate it like this:

void enable_dq() {
    if (npudesc.has_value() && npudesc->compiler_dq) {
        "NPUW_DQ_FULL": "NO"
        "NPU_COMPILER_DYNAMIC_QUANTIZATION": true
    } else {
        "NPUW_DQ": "YES"
    }
}

We need to call enable_dq in all places where previously was just NPUW_DQ: YES

As for the logic to enable dq or not, it was the following:

  1. For prefill model DQ is enabled only when model is channel-wise compressed
  2. For generation model DQ is always enabled.
    @dmatveev Could you confirm this, please?

If it should be enabled for both CW and GQ models we can call enable_dq() function uncoditionaly for both configs.

Should be done here:

ov::AnyMap get_default_common_config(const std::shared_ptr<ov::Model>& model) {
auto config = get_baseline_common_config();
const char* npu_l0 = std::getenv("DISABLE_OPENVINO_GENAI_NPU_L0");
if (npu_l0 && std::atoi(npu_l0) == 1) {
config.emplace("NPUW_WEIGHTS_BANK_ALLOC", "CPU");
} else {
config.emplace("NPUW_FUNCALL_FOR_ALL", "YES");
}
return config;
}

@dmatveev
Copy link
Contributor

@TolyaTalamanov I updated our internal guide (the one you've used before)

Copy link
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets more and more complex.

If it works, let's keep it like this, then propagate to the LLMCompiledModel (down to NPUW), remove it here, and refactor/rethink there.

@TolyaTalamanov please be sure to provide your review here as well.

if (std::find(device_caps.begin(), device_caps.end(),
"COMPILER_DYNAMIC_QUANTIZATION") != device_caps.end()) {
const auto supported_properties = core.get_property("NPU", ov::supported_properties);
if (std::find(supported_properties.begin(), supported_properties.end(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How it is a sub-string search, looks more like a container. At least std::find works on it (this auto sometimes makes things harder to understand)

compiler_dq = true;
}
return std::make_optional(NPUDesc{arch, max_tiles, compiler_dq});
}

ov::AnyMap get_baseline_common_config() {
ov::AnyMap get_baseline_common_config(bool enable_compiler_dq) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pass the NPUDesc here instead - there's a request to get rid of "high-precision" options in the future as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point was to encapsulate logic of compiler and non-compiler DQ into function and where the DQ is needed just call that function

ov::AnyMap get_default_common_config(const std::shared_ptr<ov::Model>& model) {
auto config = get_baseline_common_config();
bool enable_compiler_dq(const std::optional<NPUDesc>& npudesc) {
return npudesc.has_value() && npudesc->compiler_dq;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this function? Why don't just use npudesc.has_value() && npudesc->compiler_dq straightaway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_compiler_dq seems to be redundant

if (npudesc.has_value() && npudesc->compiler_dq) {
config.emplace("NPUW_DQ_FULL", "NO");
// Specify NPUW DQ if Compiler DQ is not enabled
if (!npudesc.has_value() || !npudesc->compiler_dq) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not gonna lie @TolyaTalamanov, it was much better WITH that tiny one-liner than without that.

@dmatveev
Copy link
Contributor

Let's wait for the testing results

@dmatveev dmatveev added this pull request to the merge queue Jan 13, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 13, 2025
@smirnov-alexey smirnov-alexey added this pull request to the merge queue Jan 13, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 13, 2025
@smirnov-alexey smirnov-alexey added this pull request to the merge queue Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: NPU Code Freeze
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants