Description
After an engine is built with the --gather_all_token_logits
and a call is made through the backend with return_context_logits: True
, return_generation_logits: True
, It seems as though to piece together the full text_output, we have to grab the very last context_logits, and the remaining generation_logits.
An example would be prompting a model like Llama-3 to "Say Yes!" ... The reply from the model being : "YES!"
Imagine calling the /generate
endpoint of ensemble
with a request that contains this in the data body:
data = {
"text_input": f'{prompt}',
"max_tokens": 8,
"return_log_probs": True,
"return_context_logits": True,
"return_generation_logits": True,
"parameters": {
"temperature": 1,
"top_k": 5,
}
}
text_output is : "YES!"
Now when you take the np.argmax
of the generation_logits and feed it back through the tokenizer, all you would see is:
- index 0 : (token_id: 0, token: "!")
- index 1 : (token_id: 128009, token: '<|eot_id|>')
The token associated with "YES" is actually found when you np.argmax
the last result in the context_logits matrix, i.e:
- index -1 : (token_id: 14331, token: "YES")
Should the "YES" not be in the generation_logits as well? Am I not understanding something fundamental?
I am using TensorRT-LLM v0.10.0 and the nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
image to host the model.
Thanks for any help!