[Question] Understanding Generation Logits & Context Logits

After an engine is built with the `--gather_all_token_logits` and a call is made through the backend with `return_context_logits: True`, `return_generation_logits: True`, It seems as though to piece together the full text_output, we have to grab the very last context_logits, and the remaining generation_logits. 

An example would be prompting a model like Llama-3 to "Say Yes!" ... The reply from the model being : "YES!"

Imagine calling the `/generate` endpoint of `ensemble` with a request that contains this in the data body:
```
data = {
    "text_input": f'{prompt}',
    "max_tokens": 8,
    "return_log_probs": True,
    "return_context_logits": True,
    "return_generation_logits": True, 
    "parameters": {
       "temperature": 1,
       "top_k": 5,
    }
}
```

text_output is : "YES!"

Now when you take the `np.argmax` of the generation_logits and feed it back through the tokenizer, all you would see is:
- index 0 : (token_id: 0, token: "!")
- index 1 : (token_id: 128009, token: '<|eot_id|>')

The token associated with "YES" is actually found when you `np.argmax` the last result in the context_logits matrix, i.e:
- index -1 : (token_id:  14331, token: "YES")

Should the "YES" not be in the generation_logits as well? Am I not understanding something fundamental?

I am using TensorRT-LLM v0.10.0 and the `nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3` image to host the model.

Thanks for any help! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Understanding Generation Logits & Context Logits #530

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Understanding Generation Logits & Context Logits #530

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions