-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/tokenize
- Optionally Apply Chat Template before Tokenization
#1706
Comments
/tokenize
- Optionally Apply Chat Template before Tokenization
Chat template is used through an openai compatibility layer, meaning the payloads do not look so simple. Does openai provide any means to do tokenization, if yes we can try to mimick that, if not we're not going to do it (there's pretty much infinite type of endpoints/payloads users could send, for now the lowest common denominator is sending raw text and it works fine in most uses cases). It's also a way to leak the system prompts which might not be something model authors actually want. |
@Narsil Thanks for looking into this. For posterity, one of the major drivers behind this request is due to using TGI in an offline environment. Because I am loading a model locally, the I'm trying to minimize how much my client application has to know about what model is being served from TGI, and currently the client has to know a HuggingFace repo to load a tokenizer. So, for folks not constrained by an offline environment, I suppose the |
+1 for the offline functionality @elsell is requesting. |
What do you think about adding an endpoint like the one I'm thinking for The idea is to share the tokenizer via My goal is to be able to run @elsell |
Hi folks, I was looking for a similar solution for this too. I'm hosting TGI (docker) internally to play around (Llama 3, model is cloned locally). But to use the apply_prompt_template() method, I had to initialize the tokenizer somewhere, so for now its either adding a middle python preprocess layer in my pipeline, or wrapping TGI with a custom API that does pre+inference+post processes. For my use case, its best if client side or components before TGI doesn't have to hold the tokenizer, thus we don't need to add another layer to call apply_prompt_template(). TLDR:
|
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Feature request
On the
/tokenize
endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing.Motivation
The
/tokenize
endpoint of TGI is very useful in situations where an application requires information about the tokenization of a string, but doesn't have direct access to a model/tokenizer that can be loaded withAutoTokenizer
.Specifically, I have instances where I need to know the token count of a prompt before sending it to
/v1/chat/completions
so that I can appropriately truncate the input to be <=max_input_tokens
./tokenize
, however, does not adequately serve this purpose when calling/v1/chat/completions
, as the tokenization we get is on the prompt without the chat template applied.Since the chat template may differ by model, there is no generic way via a
TGI
endpoint to get the token count of a prompt after a chat template has been applied, meaning that preventing inputs from exceedingmax_input_tokens
is very difficult.Your contribution
[Caveat - I do not know Rust, nor am I familiar with the inner workings of TGI]
I have done my best to read through the existing
/tokenize
implementation, and will attempt to provide a high-level overview of what might need to be changed, and where.Adding an optional boolean parameter
apply_chat_template
to the/tokenize
endpoint would suffice for my purposes.It appears that one could mirror the exiting
return_full_text
boolean parameter ofGenerateParameters
.Further more, I imagine the
/tokenize
[with chat template] implementation could would be very similar to what's happening at the/v1/chat/completions
endpoint`/v1/chat/completions` chat templatting implementation
Example API calls with proposed parameter:
Without Chat Template Applied (current behavior)
With Chat Template Applied (proposed behavior)
The text was updated successfully, but these errors were encountered: