Use whole history in case of undetermined tokenization of sequence #1268

sbalandi · 2024-11-27T18:29:31Z

Task: CVS-157295

refused switching between EncodedInput and StringInput mode
if it's found that the history contains ambiguous characters (i.e. the model output tokens and decoded+encoded tokens are different), start using the entire history for chat for LLM and start using part with difference for VLM
collect prompt and output of the model for EncodedInput, collect output for StringInput for checking history, but rewrite history after tokenizing before infer, to have templated answer

sbalandi · 2024-11-27T18:43:09Z

It looks like StaticLLMPipeline encode and use whole history on each iteration https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/llm_pipeline_static.cpp#L852 , no update needed for this pipeline

src/cpp/src/visual_language/inputs_embedder.cpp

ilya-lavrenov · 2024-11-29T10:17:46Z

src/cpp/src/visual_language/inputs_embedder.cpp

+                m_trust_encoded_history = ov::genai::utils::is_tokenized_history_same(prev_chat_tokens.input_ids, m_tokenized_chat_history);
+            }
+
+            if (m_is_cache_empty || m_trust_encoded_history) {


Suggested change

if (m_is_cache_empty || m_trust_encoded_history) {

if (m_is_cache_empty || !m_trust_encoded_history) {

because we need to take whole history if we cannot trust encoded history (KV cache) ?

yes, thank you ! fixed

but it was found that:

for LLava/LLavaNext tokenized and templated history after added model output always are not eq. The model generates answer with the first token '_[symbol]', but it will be converted to just '[symbol]' if something is added before.
example:
question: what is on the picture ?
answer: The picture features a cat lying inside a cardboard box.
[ 450 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 2 ]
in templated history, where we encode whole text, it looks: [ 1 3148 1001 29901 29871 32000 13 32000 13 5816 338 373 278 7623 1577 319 1799 9047 13566 29901 1576 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 ]
in tokenized history, where we just add output, it looks: [ 1 3148 1001 29901 29871 32000 13 32000 13 5816 338 373 278 7623 1577 319 1799 9047 13566 29901 450 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 2 ]
I would suppose this could be fixed by keeping the length of the answer and making comparison of the history and the answer separate. But I would do that in another pr, if it is needed.

when whole templated history is placed as input, it starts including special characters, specifically <image>. llava-hf/llava-1.5-7b-hf and llava-v1.6-vicuna-7b-hf generates output that looks incorrect in this case - it is fixed by removing <image> from the history

does it mean in current PR the whole history will be passed for such models? BTW, interesting where 1576 token is coming from?

whole history misses actual image tokens (their embeddings) and I suppose it's a reason of incorrect answer.

now no, with trimming KV cache it will be part from first different token to the end of the history . Regarding token 1576 I'll check, this is not first case when switching between symbols happens if something appears before the sequence. For example tokenizers (AutoTokenizer and openvino_tokenizer) for TinyLlama/TinyLlama-1.1B-Chat-v1.0 also change the first symbols by the same logic, I think it's something like that

src/cpp/src/visual_language/pipeline.cpp

src/cpp/src/visual_language/inputs_embedder.cpp

src/cpp/src/visual_language/pipeline.cpp

src/cpp/src/utils.cpp

src/cpp/src/visual_language/inputs_embedder.cpp

src/cpp/src/utils.cpp

src/cpp/src/visual_language/inputs_embedder.cpp

src/cpp/src/visual_language/pipeline.cpp

ilya-lavrenov · 2024-12-02T18:34:44Z

src/cpp/src/visual_language/inputs_embedder.cpp

-                // after first `get_inputs_embeds` is called, we supposed LLM is inferred and cache is not empty
-                m_is_cache_empty = false;
+            } else if (!(last_same_hist_token == m_tokenized_chat_history.size() - 1)) {
+                m_to_remove_from_hist = m_tokenized_chat_history.size() - 1 - last_same_hist_token;


maybe I misunderstand, but in case of the same histories get_first_history_difference will return history size, so should not we compare last_same_hist_token == m_tokenized_chat_history.size() (w/o -1) ?

yes, we need to check both variants, fixed

it's still not clear why do we need to subtract 1 here? Below we have similar code new_chat_tokens.get_shape().at(1) - last_same_hist_token w/o -1 ..

Fixed in #1254

ilya-lavrenov · 2024-12-03T06:32:50Z

src/cpp/src/visual_language/pipeline.cpp

@@ -144,14 +144,22 @@ class ov::genai::VLMPipeline::VLMPipelineImpl {

        ov::Tensor inputs_embeds = m_inputs_embedder->get_inputs_embeds(prompt, rgbs);

+        auto to_remove_from_hist = m_inputs_embedder->get_amount_to_remove_from_hist();
+        if (to_remove_from_hist > 0) {
+            ov::genai::utils::trim_kv_cache(m_language, to_remove_from_hist);


I think we can always call trim_kv_cache and check to_remove_from_hist == 0 internally and return in case we don't need to trim tokens.

Calling code will more clear

Fixed in #1254

ilya-lavrenov

Last comments can be applied later if they are valid.

Also, we can re-use KV cache trim approach for LLMs as well (on master branch)

src/cpp/src/llm_pipeline.cpp

…1254) Task: [CVS-157295](https://jira.devtools.intel.com/browse/CVS-157295) - fist commit is cherry-pick from #1268 and #1361 - next commit includes applying comments from #1268 and adding usage of kv cache for LLM

…1254) Task: [CVS-157295](https://jira.devtools.intel.com/browse/CVS-157295) - fist commit is cherry-pick from openvinotoolkit/openvino.genai#1268 and openvinotoolkit/openvino.genai#1361 - next commit includes applying comments from openvinotoolkit/openvino.genai#1268 and adding usage of kv cache for LLM

sbalandi requested review from Wovchena and ilya-lavrenov November 27, 2024 18:29

github-actions bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms labels Nov 27, 2024

sbalandi assigned ilya-lavrenov and Wovchena Nov 27, 2024

sbalandi force-pushed the tok_hist_24_6 branch from 2414de9 to 8e65064 Compare November 27, 2024 18:34

sbalandi mentioned this pull request Nov 27, 2024

Use whole history in case of undetermined tokenization of sequence #1254

Merged

sbalandi force-pushed the tok_hist_24_6 branch 2 times, most recently from 0c7c4da to 59110f6 Compare November 29, 2024 08:58

Wovchena approved these changes Nov 29, 2024

View reviewed changes

ilya-lavrenov reviewed Nov 29, 2024

View reviewed changes

ilya-lavrenov added this to the 2024.6 milestone Nov 29, 2024

sbalandi force-pushed the tok_hist_24_6 branch 2 times, most recently from 57adccb to 5c2e814 Compare November 30, 2024 10:18

ilya-lavrenov reviewed Nov 30, 2024

View reviewed changes

sbalandi force-pushed the tok_hist_24_6 branch from 5c2e814 to 42c38df Compare December 2, 2024 00:36

ilya-lavrenov reviewed Dec 2, 2024

View reviewed changes

Use whole history in case of undetermined tokenization of sequence

cdf4156

sbalandi force-pushed the tok_hist_24_6 branch from 42c38df to f631d8d Compare December 2, 2024 18:49

ilya-lavrenov added bug Something isn't working and removed category: sampling Sampling / Decoding algorithms labels Dec 2, 2024

update vlm

f631d8d

github-actions bot added the category: sampling Sampling / Decoding algorithms label Dec 2, 2024

sbalandi force-pushed the tok_hist_24_6 branch from 50de176 to cc6b4f5 Compare December 2, 2024 22:57

update

cc6b4f5

ilya-lavrenov reviewed Dec 3, 2024

View reviewed changes

ilya-lavrenov approved these changes Dec 3, 2024

View reviewed changes

ilya-lavrenov added this pull request to the merge queue Dec 3, 2024

Merged via the queue into openvinotoolkit:releases/2024/5 with commit a4fe38b Dec 3, 2024
52 checks passed

ilya-lavrenov reviewed Dec 9, 2024

View reviewed changes

src/cpp/src/llm_pipeline.cpp Show resolved Hide resolved

ilya-lavrenov added port to master PR needs to be ported to master from release branch and removed port to master PR needs to be ported to master from release branch labels Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use whole history in case of undetermined tokenization of sequence #1268

Use whole history in case of undetermined tokenization of sequence #1268

sbalandi commented Nov 27, 2024 •

edited by andrei-kochin

Loading

sbalandi commented Nov 27, 2024

ilya-lavrenov Nov 29, 2024

sbalandi Nov 29, 2024 •

edited

Loading

ilya-lavrenov Nov 30, 2024 •

edited

Loading

sbalandi Dec 2, 2024 •

edited

Loading

ilya-lavrenov Dec 2, 2024

sbalandi Dec 2, 2024

ilya-lavrenov Dec 3, 2024

ilya-lavrenov Dec 13, 2024 •

edited

Loading

ilya-lavrenov Dec 3, 2024 •

edited

Loading

ilya-lavrenov Dec 13, 2024

ilya-lavrenov left a comment

	if (m_is_cache_empty \|\| m_trust_encoded_history) {
	if (m_is_cache_empty \|\| !m_trust_encoded_history) {

Use whole history in case of undetermined tokenization of sequence #1268

Use whole history in case of undetermined tokenization of sequence #1268

Conversation

sbalandi commented Nov 27, 2024 • edited by andrei-kochin Loading

sbalandi commented Nov 27, 2024

ilya-lavrenov Nov 29, 2024

Choose a reason for hiding this comment

sbalandi Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

sbalandi Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Dec 2, 2024

Choose a reason for hiding this comment

sbalandi Dec 2, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 3, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Dec 13, 2024

Choose a reason for hiding this comment

ilya-lavrenov left a comment

Choose a reason for hiding this comment

sbalandi commented Nov 27, 2024 •

edited by andrei-kochin

Loading

sbalandi Nov 29, 2024 •

edited

Loading

ilya-lavrenov Nov 30, 2024 •

edited

Loading

sbalandi Dec 2, 2024 •

edited

Loading

ilya-lavrenov Dec 13, 2024 •

edited

Loading

ilya-lavrenov Dec 3, 2024 •

edited

Loading