Skip to content

/tokenize - Optionally Apply Chat Template before Tokenization #1706

Closed as not planned
@elsell

Description

@elsell

Feature request

On the /tokenize endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing.

Motivation

The /tokenize endpoint of TGI is very useful in situations where an application requires information about the tokenization of a string, but doesn't have direct access to a model/tokenizer that can be loaded with AutoTokenizer.

Specifically, I have instances where I need to know the token count of a prompt before sending it to /v1/chat/completions so that I can appropriately truncate the input to be <= max_input_tokens.

/tokenize, however, does not adequately serve this purpose when calling /v1/chat/completions, as the tokenization we get is on the prompt without the chat template applied.

Since the chat template may differ by model, there is no generic way via a TGI endpoint to get the token count of a prompt after a chat template has been applied, meaning that preventing inputs from exceeding max_input_tokens is very difficult.

Your contribution

[Caveat - I do not know Rust, nor am I familiar with the inner workings of TGI]

I have done my best to read through the existing /tokenize implementation, and will attempt to provide a high-level overview of what might need to be changed, and where.

Adding an optional boolean parameter apply_chat_template to the /tokenize endpoint would suffice for my purposes.

It appears that one could mirror the exiting return_full_text boolean parameter of GenerateParameters.

Further more, I imagine the /tokenize [with chat template] implementation could would be very similar to what's happening at the /v1/chat/completions endpoint

`/v1/chat/completions` chat templatting implementation
    // apply chat template to flatten the request into a single input
    let mut inputs = match infer.apply_chat_template(req.messages) {
        Ok(inputs) => inputs,
        Err(err) => {
            metrics::increment_counter!("tgi_request_failure", "err" => "validation");
            tracing::error!("{err}");
            return Err((
                StatusCode::UNPROCESSABLE_ENTITY,
                Json(ErrorResponse {
                    error: err.to_string(),
                    error_type: err.error_type().to_string(),
                }),
            ));
        }
    };

Example API calls with proposed parameter:

Without Chat Template Applied (current behavior)

curl -X 'POST' \
  'http://my.tgi.host/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I"
}'

With Chat Template Applied (proposed behavior)

curl -X 'POST' \
  'http://localhost:8083/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I",
  "parameters": {
     "apply_chat_template": true
  }
}'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions