-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use whole history in case of undetermined tokenization of sequence #1268
Use whole history in case of undetermined tokenization of sequence #1268
Conversation
2414de9
to
8e65064
Compare
It looks like StaticLLMPipeline encode and use whole history on each iteration https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/llm_pipeline_static.cpp#L852 , no update needed for this pipeline |
0c7c4da
to
59110f6
Compare
m_trust_encoded_history = ov::genai::utils::is_tokenized_history_same(prev_chat_tokens.input_ids, m_tokenized_chat_history); | ||
} | ||
|
||
if (m_is_cache_empty || m_trust_encoded_history) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (m_is_cache_empty || m_trust_encoded_history) { | |
if (m_is_cache_empty || !m_trust_encoded_history) { |
because we need to take whole history if we cannot trust encoded history (KV cache) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you ! fixed
but it was found that:
-
for LLava/LLavaNext tokenized and templated history after added model output always are not eq. The model generates answer with the first token '_[symbol]', but it will be converted to just '[symbol]' if something is added before.
example:
question: what is on the picture ?
answer: The picture features a cat lying inside a cardboard box.
[ 450 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 2 ]
in templated history, where we encode whole text, it looks: [ 1 3148 1001 29901 29871 32000 13 32000 13 5816 338 373 278 7623 1577 319 1799 9047 13566 29901 1576 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 ]
in tokenized history, where we just add output, it looks: [ 1 3148 1001 29901 29871 32000 13 32000 13 5816 338 373 278 7623 1577 319 1799 9047 13566 29901 450 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 2 ]
I would suppose this could be fixed by keeping the length of the answer and making comparison of the history and the answer separate. But I would do that in another pr, if it is needed. -
when whole templated history is placed as input, it starts including special characters, specifically
<image>
. llava-hf/llava-1.5-7b-hf and llava-v1.6-vicuna-7b-hf generates output that looks incorrect in this case - it is fixed by removing<image>
from the history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- does it mean in current PR the whole history will be passed for such models? BTW, interesting where
1576
token is coming from? - whole history misses actual image tokens (their embeddings) and I suppose it's a reason of incorrect answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- now no, with trimming KV cache it will be part from first different token to the end of the history . Regarding token 1576 I'll check, this is not first case when switching between symbols happens if something appears before the sequence. For example tokenizers (AutoTokenizer and openvino_tokenizer) for TinyLlama/TinyLlama-1.1B-Chat-v1.0 also change the first symbols by the same logic, I think it's something like that
57adccb
to
5c2e814
Compare
5c2e814
to
42c38df
Compare
// after first `get_inputs_embeds` is called, we supposed LLM is inferred and cache is not empty | ||
m_is_cache_empty = false; | ||
} else if (!(last_same_hist_token == m_tokenized_chat_history.size() - 1)) { | ||
m_to_remove_from_hist = m_tokenized_chat_history.size() - 1 - last_same_hist_token; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I misunderstand, but in case of the same histories get_first_history_difference
will return history size, so should not we compare last_same_hist_token == m_tokenized_chat_history.size()
(w/o -1
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we need to check both variants, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still not clear why do we need to subtract 1
here? Below we have similar code new_chat_tokens.get_shape().at(1) - last_same_hist_token
w/o -1
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #1254
42c38df
to
f631d8d
Compare
50de176
to
cc6b4f5
Compare
@@ -144,14 +144,22 @@ class ov::genai::VLMPipeline::VLMPipelineImpl { | |||
|
|||
ov::Tensor inputs_embeds = m_inputs_embedder->get_inputs_embeds(prompt, rgbs); | |||
|
|||
auto to_remove_from_hist = m_inputs_embedder->get_amount_to_remove_from_hist(); | |||
if (to_remove_from_hist > 0) { | |||
ov::genai::utils::trim_kv_cache(m_language, to_remove_from_hist); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can always call trim_kv_cache
and check to_remove_from_hist == 0
internally and return in case we don't need to trim tokens.
Calling code will more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #1254
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments can be applied later if they are valid.
Also, we can re-use KV cache trim approach for LLMs as well (on master branch)
a4fe38b
…1254) Task: [CVS-157295](https://jira.devtools.intel.com/browse/CVS-157295) - fist commit is cherry-pick from #1268 and #1361 - next commit includes applying comments from #1268 and adding usage of kv cache for LLM
…1254) Task: [CVS-157295](https://jira.devtools.intel.com/browse/CVS-157295) - fist commit is cherry-pick from openvinotoolkit/openvino.genai#1268 and openvinotoolkit/openvino.genai#1361 - next commit includes applying comments from openvinotoolkit/openvino.genai#1268 and adding usage of kv cache for LLM
Task: CVS-157295