Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use whole history in case of undetermined tokenization of sequence #1268

Merged

Conversation

sbalandi
Copy link
Contributor

@sbalandi sbalandi commented Nov 27, 2024

Task: CVS-157295

  • refused switching between EncodedInput and StringInput mode
  • if it's found that the history contains ambiguous characters (i.e. the model output tokens and decoded+encoded tokens are different), start using the entire history for chat for LLM and start using part with difference for VLM
  • collect prompt and output of the model for EncodedInput, collect output for StringInput for checking history, but rewrite history after tokenizing before infer, to have templated answer

@sbalandi
Copy link
Contributor Author

It looks like StaticLLMPipeline encode and use whole history on each iteration https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/llm_pipeline_static.cpp#L852 , no update needed for this pipeline

@sbalandi sbalandi force-pushed the tok_hist_24_6 branch 2 times, most recently from 0c7c4da to 59110f6 Compare November 29, 2024 08:58
src/cpp/src/visual_language/inputs_embedder.cpp Outdated Show resolved Hide resolved
m_trust_encoded_history = ov::genai::utils::is_tokenized_history_same(prev_chat_tokens.input_ids, m_tokenized_chat_history);
}

if (m_is_cache_empty || m_trust_encoded_history) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (m_is_cache_empty || m_trust_encoded_history) {
if (m_is_cache_empty || !m_trust_encoded_history) {

because we need to take whole history if we cannot trust encoded history (KV cache) ?

Copy link
Contributor Author

@sbalandi sbalandi Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you ! fixed

but it was found that:

  1. for LLava/LLavaNext tokenized and templated history after added model output always are not eq. The model generates answer with the first token '_[symbol]', but it will be converted to just '[symbol]' if something is added before.
    example:
    question: what is on the picture ?
    answer: The picture features a cat lying inside a cardboard box.
    [ 450 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 2 ]
    in templated history, where we encode whole text, it looks: [ 1 3148 1001 29901 29871 32000 13 32000 13 5816 338 373 278 7623 1577 319 1799 9047 13566 29901 1576 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 ]
    in tokenized history, where we just add output, it looks: [ 1 3148 1001 29901 29871 32000 13 32000 13 5816 338 373 278 7623 1577 319 1799 9047 13566 29901 450 7623 5680 263 6635 19214 2768 263 5881 3377 3800 29889 2 ]
    I would suppose this could be fixed by keeping the length of the answer and making comparison of the history and the answer separate. But I would do that in another pr, if it is needed.

  2. when whole templated history is placed as input, it starts including special characters, specifically <image>. llava-hf/llava-1.5-7b-hf and llava-v1.6-vicuna-7b-hf generates output that looks incorrect in this case - it is fixed by removing <image> from the history

Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. does it mean in current PR the whole history will be passed for such models? BTW, interesting where 1576 token is coming from?
  2. whole history misses actual image tokens (their embeddings) and I suppose it's a reason of incorrect answer.

Copy link
Contributor Author

@sbalandi sbalandi Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. now no, with trimming KV cache it will be part from first different token to the end of the history . Regarding token 1576 I'll check, this is not first case when switching between symbols happens if something appears before the sequence. For example tokenizers (AutoTokenizer and openvino_tokenizer) for TinyLlama/TinyLlama-1.1B-Chat-v1.0 also change the first symbols by the same logic, I think it's something like that

src/cpp/src/visual_language/pipeline.cpp Outdated Show resolved Hide resolved
@ilya-lavrenov ilya-lavrenov added this to the 2024.6 milestone Nov 29, 2024
@sbalandi sbalandi force-pushed the tok_hist_24_6 branch 2 times, most recently from 57adccb to 5c2e814 Compare November 30, 2024 10:18
src/cpp/src/visual_language/inputs_embedder.cpp Outdated Show resolved Hide resolved
src/cpp/src/visual_language/inputs_embedder.cpp Outdated Show resolved Hide resolved
src/cpp/src/visual_language/inputs_embedder.cpp Outdated Show resolved Hide resolved
src/cpp/src/visual_language/pipeline.cpp Outdated Show resolved Hide resolved
src/cpp/src/utils.cpp Outdated Show resolved Hide resolved
src/cpp/src/utils.cpp Show resolved Hide resolved
src/cpp/src/visual_language/inputs_embedder.cpp Outdated Show resolved Hide resolved
src/cpp/src/utils.cpp Outdated Show resolved Hide resolved
src/cpp/src/visual_language/inputs_embedder.cpp Outdated Show resolved Hide resolved
src/cpp/src/visual_language/pipeline.cpp Outdated Show resolved Hide resolved
// after first `get_inputs_embeds` is called, we supposed LLM is inferred and cache is not empty
m_is_cache_empty = false;
} else if (!(last_same_hist_token == m_tokenized_chat_history.size() - 1)) {
m_to_remove_from_hist = m_tokenized_chat_history.size() - 1 - last_same_hist_token;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I misunderstand, but in case of the same histories get_first_history_difference will return history size, so should not we compare last_same_hist_token == m_tokenized_chat_history.size() (w/o -1) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need to check both variants, fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's still not clear why do we need to subtract 1 here? Below we have similar code new_chat_tokens.get_shape().at(1) - last_same_hist_token w/o -1 ..

Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1254

@ilya-lavrenov ilya-lavrenov added bug Something isn't working and removed category: sampling Sampling / Decoding algorithms labels Dec 2, 2024
@github-actions github-actions bot added the category: sampling Sampling / Decoding algorithms label Dec 2, 2024
@@ -144,14 +144,22 @@ class ov::genai::VLMPipeline::VLMPipelineImpl {

ov::Tensor inputs_embeds = m_inputs_embedder->get_inputs_embeds(prompt, rgbs);

auto to_remove_from_hist = m_inputs_embedder->get_amount_to_remove_from_hist();
if (to_remove_from_hist > 0) {
ov::genai::utils::trim_kv_cache(m_language, to_remove_from_hist);
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can always call trim_kv_cache and check to_remove_from_hist == 0 internally and return in case we don't need to trim tokens.

Calling code will more clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1254

Copy link
Contributor

@ilya-lavrenov ilya-lavrenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments can be applied later if they are valid.

Also, we can re-use KV cache trim approach for LLMs as well (on master branch)

@ilya-lavrenov ilya-lavrenov added this pull request to the merge queue Dec 3, 2024
Merged via the queue into openvinotoolkit:releases/2024/5 with commit a4fe38b Dec 3, 2024
52 checks passed
@ilya-lavrenov ilya-lavrenov added port to master PR needs to be ported to master from release branch and removed port to master PR needs to be ported to master from release branch labels Dec 13, 2024
github-merge-queue bot pushed a commit that referenced this pull request Dec 16, 2024
…1254)

Task: [CVS-157295](https://jira.devtools.intel.com/browse/CVS-157295)

- fist commit is cherry-pick from
#1268 and
#1361
- next commit includes applying comments from
#1268 and adding
usage of kv cache for LLM
ScottZhang812 pushed a commit to ScottZhang812/_openvino.genai that referenced this pull request Dec 23, 2024
…1254)

Task: [CVS-157295](https://jira.devtools.intel.com/browse/CVS-157295)

- fist commit is cherry-pick from
openvinotoolkit/openvino.genai#1268 and
openvinotoolkit/openvino.genai#1361
- next commit includes applying comments from
openvinotoolkit/openvino.genai#1268 and adding
usage of kv cache for LLM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: visual language Visual language pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants