Skip to content

[Question] Understanding Generation Logits & Context Logits #530

Open
@here4dadata

Description

@here4dadata

After an engine is built with the --gather_all_token_logits and a call is made through the backend with return_context_logits: True, return_generation_logits: True, It seems as though to piece together the full text_output, we have to grab the very last context_logits, and the remaining generation_logits.

An example would be prompting a model like Llama-3 to "Say Yes!" ... The reply from the model being : "YES!"

Imagine calling the /generate endpoint of ensemble with a request that contains this in the data body:

data = {
    "text_input": f'{prompt}',
    "max_tokens": 8,
    "return_log_probs": True,
    "return_context_logits": True,
    "return_generation_logits": True, 
    "parameters": {
       "temperature": 1,
       "top_k": 5,
    }
}

text_output is : "YES!"

Now when you take the np.argmax of the generation_logits and feed it back through the tokenizer, all you would see is:

  • index 0 : (token_id: 0, token: "!")
  • index 1 : (token_id: 128009, token: '<|eot_id|>')

The token associated with "YES" is actually found when you np.argmax the last result in the context_logits matrix, i.e:

  • index -1 : (token_id: 14331, token: "YES")

Should the "YES" not be in the generation_logits as well? Am I not understanding something fundamental?

I am using TensorRT-LLM v0.10.0 and the nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 image to host the model.

Thanks for any help!

Metadata

Metadata

Assignees

Labels

questionFurther information is requestedtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions