Static llm pipeline dynamic shape model #1240

AsyaPronina · 2024-11-20T19:24:56Z

Related PRs:

OpenVINO: [NPUW] Dynamic stateful model support openvino#27651
OpenVINO Unroll SPDA: [NPUW] Port unroll SDPA optimization from GenAI openvino#27891
OpenVINO Prefill/Generate sections: Added possibility to pass PREFILL/GENERATE configs and pad_token_id openvino#28154

dmatveev · 2024-11-25T10:24:54Z

src/cpp/src/llm_pipeline_static.cpp

-    int64_t position_ids_data = prompt_len -1;
-    std::vector<int64_t> attention_mask_data(1, prompt_len);
+    int64_t position_ids_data = prompt_len - 1;
+    std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);


LOL, @TolyaTalamanov !!

samples/cpp/chat_sample/chat_sample.cpp

src/cpp/src/llm_pipeline_static.cpp

### Details: - *item1* - *...* ### Related PRs: - GenAI: *openvinotoolkit/openvino.genai#1240 ### Tickets: - *ticket-id* --------- Co-authored-by: TolyaTalamanov <[email protected]>

dmatveev · 2024-12-05T18:59:29Z

src/cpp/src/llm_pipeline_static.hpp

+                                                       const ov::AnyMap& config);
+};
+
+class SMStaticLLMPipeline : public LLMPipelineImplBase {


SMStaticLLMPipeline sounds misleading, let's not focus on the point that it is a "single model" pipeline (as people got used to do a different thing here).

The CPU/GPU's pipeline is called Stateful* if I get it right.

So as this one is still static, let's call it StaticStatefulLLMPipeline?

Thanks, sure!

dmatveev · 2024-12-05T19:01:21Z

src/cpp/src/llm_pipeline_static.hpp

 namespace genai {

+struct StaticLLMPipelineFactory {


I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

@dmatveev @TolyaTalamanov , please help to disambiguate: we allow user to pass dynamic stateful OpenVINO model into our new pipeline, where we are hiding things like converting of model to static and making it stateless. Should we this way still name the pipeline as static_llm::StatefulLLMPipeline, as it still works with the static and stateless models inside? Or it can really be named as Stateful because now, LLMCompiledModel, which it creates, doesn't expose to user additional inputs and outputs, that correspond to states. (However, I don't know if by this logic the pipeline is still static)

dmatveev · 2024-12-05T19:02:39Z

src/cpp/src/llm_pipeline_static.cpp

+
+    update_config(properties, {"NPU_USE_NPUW", "YES"});
+    update_config(properties, {"NPUW_LLM", "YES"});
+    update_config(properties, {"NPUW_LLM_MODEL_DESC", model_desc_to_string(model_desc)});


since it is C++, can we use a C++ structure directly here? Or the option is exposed as string only?

A downside - change in the option structure when it is a string will never break the build, but a change in the structure type will.

We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API. However, we can pass it as map<std::string, std::string> or map<std::string, ov::Any>, if you think it is better. Current implementation via std::string ensures that compiled_model.get_property("NPUW_LLM_MODEL_DESC") will print something meaningful (but it is not a requirement).

What do you think?

dmatveev · 2024-12-05T19:04:29Z

src/cpp/src/llm_pipeline_static.cpp

+    const uint32_t kMaxPromptLen = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u);
+    const uint32_t kMinResponseLen = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u);


see the above code, it was aligned to 64. Probably it makes sense to unify how these options are handled between the two classes (without overdesign)

Yes, I did the alignment inside of the LLMCompiledModel constructor in the NPUW. Do you think I need to remove it there and instead do alignment here? I did it in LLMCompiledModel, since I thought it might be of implementation detail..

What do you think?

dmatveev · 2024-12-05T19:07:33Z

src/cpp/src/llm_pipeline_static.cpp

+    auto decode_start_time =  std::chrono::steady_clock::now();
+    DecodedResults decoded_results = {m_tokenizer.decode(encoded_results.tokens), encoded_results.scores};
+    auto decode_stop_time =  std::chrono::steady_clock::now();


I'd highly recommend to use smt like https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/perf.cpp#L9 - but it is clearly not for this PR (cc: @TolyaTalamanov )

dmatveev · 2024-12-05T19:12:22Z

src/cpp/src/llm_pipeline_static.cpp

+        m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
+        m_request.set_tensor("position_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&position_ids_data));


I don't see a point in having these two set all the time - the data could've been updated in-place for these tensors - for the future @TolyaTalamanov

Yes, that is true, I think we discussed something like that already. And the conclusion was to check contracts of GenAI pipeline to understand what should be passed into the request. I don't remember the outcome, might be GenAI dynamic pipeline always pass some input, I need to check it.

dmatveev · 2024-12-05T19:13:08Z

src/cpp/src/llm_pipeline_static.cpp

+
+        // TODO: How to check that KV-Cache is full?


Is this still an open? I believe the llm_infer_request reports it via throw now?

Yes, that is true! However, I don't know if this should be exception or just end of the generation stage with the warning

dmatveev · 2024-12-05T19:14:29Z

src/cpp/src/llm_pipeline_static.cpp

+                                 const std::string& device,
+                                 const ov::AnyMap& config) {
+    auto properties = config;
+    const auto use_sm_pipeline = pop_or_default(properties, "USE_SM_PIPELINE", false);


shouldn't it be false or NO? Or maybe it shouldn't a binary option either?

we also need to be careful about the option name choice here.. And tbh I don't have a good name here in mind.

It shouldn't be a public-looking option, that's for sure. Maybe it shouldn't be a configurable option at all but an env var, like we did for memory allocation - but that'd complicate testing in the existing environments.

NPU_PIPELINE = STATEFUL (as opposed to the today's STATELESS)?

Thank you! Will fix this

User can set false and NO , OpenVINO ov::Any can parse it to the bool.

…genai into at/static-llm-pipeline-dynamic-shape-model

TolyaTalamanov added 2 commits November 20, 2024 19:22

Snapshot

d0b0298

Snapshot

e0416c6

github-actions bot added category: LLM LLM pipeline (stateful, static) category: samples GenAI samples labels Nov 20, 2024

AsyaPronina marked this pull request as draft November 20, 2024 19:26

AsyaPronina mentioned this pull request Nov 20, 2024

[NPUW] Dynamic stateful model support openvinotoolkit/openvino#27651

Merged

dmatveev reviewed Nov 25, 2024

View reviewed changes

Fixed all typos comparing to dual-model GenAI pipeline

cc34616

AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 6cdd518 to cc34616 Compare November 27, 2024 15:41

smirnov-alexey reviewed Nov 27, 2024

View reviewed changes

samples/cpp/chat_sample/chat_sample.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed Nov 27, 2024

View reviewed changes

src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved