Description
Feature request
On the /tokenize
endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing.
Motivation
The /tokenize
endpoint of TGI is very useful in situations where an application requires information about the tokenization of a string, but doesn't have direct access to a model/tokenizer that can be loaded with AutoTokenizer
.
Specifically, I have instances where I need to know the token count of a prompt before sending it to /v1/chat/completions
so that I can appropriately truncate the input to be <= max_input_tokens
.
/tokenize
, however, does not adequately serve this purpose when calling /v1/chat/completions
, as the tokenization we get is on the prompt without the chat template applied.
Since the chat template may differ by model, there is no generic way via a TGI
endpoint to get the token count of a prompt after a chat template has been applied, meaning that preventing inputs from exceeding max_input_tokens
is very difficult.
Your contribution
[Caveat - I do not know Rust, nor am I familiar with the inner workings of TGI]
I have done my best to read through the existing /tokenize
implementation, and will attempt to provide a high-level overview of what might need to be changed, and where.
Adding an optional boolean parameter apply_chat_template
to the /tokenize
endpoint would suffice for my purposes.
It appears that one could mirror the exiting return_full_text
boolean parameter of GenerateParameters
.
Further more, I imagine the /tokenize
[with chat template] implementation could would be very similar to what's happening at the /v1/chat/completions
endpoint
`/v1/chat/completions` chat templatting implementation
// apply chat template to flatten the request into a single input
let mut inputs = match infer.apply_chat_template(req.messages) {
Ok(inputs) => inputs,
Err(err) => {
metrics::increment_counter!("tgi_request_failure", "err" => "validation");
tracing::error!("{err}");
return Err((
StatusCode::UNPROCESSABLE_ENTITY,
Json(ErrorResponse {
error: err.to_string(),
error_type: err.error_type().to_string(),
}),
));
}
};
Example API calls with proposed parameter:
Without Chat Template Applied (current behavior)
curl -X 'POST' \
'http://my.tgi.host/tokenize' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "My name is Olivier and I"
}'
With Chat Template Applied (proposed behavior)
curl -X 'POST' \
'http://localhost:8083/tokenize' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "My name is Olivier and I",
"parameters": {
"apply_chat_template": true
}
}'