You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM (and similar backends like Ollama) provide dedicated routes to retrieve the exact token count for a given prompt, based on the tokenizer.json of the loaded model.
LiteLLM should be able to natively query the tokenizer used by vLLM (instead of defaulting to tiktoken).
Ideally, LiteLLM would extract tokenization logic directly from the vLLM-hosted model (when available), ensuring full compatibility with models like Gemma 2, Mistral, etc.
Alternatively, if vLLM exposes its token counting as an API route, LiteLLM could simply delegate token counting to vLLM when connected.
The only current workaround is using vLLM's own API instead of LiteLLM for token counting when running locally.
Otherwise thank you for your work :)
Motivation, pitch
I'm working with LiteLLM to centralise endpoints into one unique endpoint.
Currently, LiteLLM offers a /utils/token_counter route that can count tokens when plugged into a vLLM instance. However, this feature appears to have limited compatibility:
It works well for some models but defaults to tiktoken when unsupported, which is inaccurate for many architectures.
When using vLLM-hosted models like Gemma 2 or Mistral, the token count retrieved by LiteLLM can be inconsistent or incorrect, as these models often rely on specialized tokenization schemes.
Are you a ML Ops Team?
Yes
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered:
The Feature
vLLM (and similar backends like Ollama) provide dedicated routes to retrieve the exact token count for a given prompt, based on the tokenizer.json of the loaded model.
LiteLLM should be able to natively query the tokenizer used by vLLM (instead of defaulting to tiktoken).
Ideally, LiteLLM would extract tokenization logic directly from the vLLM-hosted model (when available), ensuring full compatibility with models like Gemma 2, Mistral, etc.
Alternatively, if vLLM exposes its token counting as an API route, LiteLLM could simply delegate token counting to vLLM when connected.
The only current workaround is using vLLM's own API instead of LiteLLM for token counting when running locally.
Otherwise thank you for your work :)
Motivation, pitch
I'm working with LiteLLM to centralise endpoints into one unique endpoint.
Currently, LiteLLM offers a /utils/token_counter route that can count tokens when plugged into a vLLM instance. However, this feature appears to have limited compatibility:
Are you a ML Ops Team?
Yes
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: