Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/tokenize - Optionally Apply Chat Template before Tokenization #1706

Closed
elsell opened this issue Apr 4, 2024 · 6 comments
Closed

/tokenize - Optionally Apply Chat Template before Tokenization #1706

elsell opened this issue Apr 4, 2024 · 6 comments
Labels

Comments

@elsell
Copy link

elsell commented Apr 4, 2024

Feature request

On the /tokenize endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing.

Motivation

The /tokenize endpoint of TGI is very useful in situations where an application requires information about the tokenization of a string, but doesn't have direct access to a model/tokenizer that can be loaded with AutoTokenizer.

Specifically, I have instances where I need to know the token count of a prompt before sending it to /v1/chat/completions so that I can appropriately truncate the input to be <= max_input_tokens.

/tokenize, however, does not adequately serve this purpose when calling /v1/chat/completions, as the tokenization we get is on the prompt without the chat template applied.

Since the chat template may differ by model, there is no generic way via a TGI endpoint to get the token count of a prompt after a chat template has been applied, meaning that preventing inputs from exceeding max_input_tokens is very difficult.

Your contribution

[Caveat - I do not know Rust, nor am I familiar with the inner workings of TGI]

I have done my best to read through the existing /tokenize implementation, and will attempt to provide a high-level overview of what might need to be changed, and where.

Adding an optional boolean parameter apply_chat_template to the /tokenize endpoint would suffice for my purposes.

It appears that one could mirror the exiting return_full_text boolean parameter of GenerateParameters.

Further more, I imagine the /tokenize [with chat template] implementation could would be very similar to what's happening at the /v1/chat/completions endpoint

`/v1/chat/completions` chat templatting implementation
    // apply chat template to flatten the request into a single input
    let mut inputs = match infer.apply_chat_template(req.messages) {
        Ok(inputs) => inputs,
        Err(err) => {
            metrics::increment_counter!("tgi_request_failure", "err" => "validation");
            tracing::error!("{err}");
            return Err((
                StatusCode::UNPROCESSABLE_ENTITY,
                Json(ErrorResponse {
                    error: err.to_string(),
                    error_type: err.error_type().to_string(),
                }),
            ));
        }
    };

Example API calls with proposed parameter:

Without Chat Template Applied (current behavior)

curl -X 'POST' \
  'http://my.tgi.host/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I"
}'

With Chat Template Applied (proposed behavior)

curl -X 'POST' \
  'http://localhost:8083/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I",
  "parameters": {
     "apply_chat_template": true
  }
}'
@elsell elsell changed the title Apply Chat Template before Tokenization /tokenize - Optionally Apply Chat Template before Tokenization Apr 4, 2024
@Narsil
Copy link
Collaborator

Narsil commented Apr 10, 2024

Chat template is used through an openai compatibility layer, meaning the payloads do not look so simple.

Does openai provide any means to do tokenization, if yes we can try to mimick that, if not we're not going to do it (there's pretty much infinite type of endpoints/payloads users could send, for now the lowest common denominator is sending raw text and it works fine in most uses cases).

It's also a way to leak the system prompts which might not be something model authors actually want.

@elsell
Copy link
Author

elsell commented Apr 10, 2024

@Narsil Thanks for looking into this.

For posterity, one of the major drivers behind this request is due to using TGI in an offline environment.

Because I am loading a model locally, the model_id key returned from TGI's /info endpoint has a local filepath instead of a HuggingFace repo name.

I'm trying to minimize how much my client application has to know about what model is being served from TGI, and currently the client has to know a HuggingFace repo to load a tokenizer.

So, for folks not constrained by an offline environment, I suppose the /info endpoint would provide sufficient information to the client that it would be able to dynamically load up a tokenizer and apply a chat template, making my request unnecessary.

@ZQ-Dev8
Copy link

ZQ-Dev8 commented Apr 10, 2024

+1 for the offline functionality @elsell is requesting.

@AguirreNicolas
Copy link

What do you think about adding an endpoint like the one I'm thinking for vLLM ?

The idea is to share the tokenizer via \get_tokenizer endpoint if whoever runs the server enables it.

My goal is to be able to run lm-eval-harness as if it were a client. Specifically, I plan to add an option to OpenaiCompletionsLM that allows using get_tokenizer, receiving a json, and then instantiating the tokenizer locally.

@elsell
This solution could probably work for your issue.

@cringelord000222
Copy link

Hi folks, I was looking for a similar solution for this too.

I'm hosting TGI (docker) internally to play around (Llama 3, model is cloned locally). But to use the apply_prompt_template() method, I had to initialize the tokenizer somewhere, so for now its either adding a middle python preprocess layer in my pipeline, or wrapping TGI with a custom API that does pre+inference+post processes.

For my use case, its best if client side or components before TGI doesn't have to hold the tokenizer, thus we don't need to add another layer to call apply_prompt_template().

TLDR:
Request to add a param to apply prompt template when TGI receives a request. Something like:

# perform inference
import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    "inputs": llm_prompt,
    "parameters": {
        "max_new_tokens": 2000,
        "temperature": 0.1,
        "apply_prompt_template": True,
    },
}
response = requests.post()

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 24, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants