[Feature]: Accurate Token Counting in LiteLLM with vLLM-Hosted Models (Gemma, Mistral, etc.) #8244

thverney-dozo · 2025-02-04T14:37:20Z

The Feature

vLLM (and similar backends like Ollama) provide dedicated routes to retrieve the exact token count for a given prompt, based on the tokenizer.json of the loaded model.

LiteLLM should be able to natively query the tokenizer used by vLLM (instead of defaulting to tiktoken).
Ideally, LiteLLM would extract tokenization logic directly from the vLLM-hosted model (when available), ensuring full compatibility with models like Gemma 2, Mistral, etc.
Alternatively, if vLLM exposes its token counting as an API route, LiteLLM could simply delegate token counting to vLLM when connected.

The only current workaround is using vLLM's own API instead of LiteLLM for token counting when running locally.

Otherwise thank you for your work :)

Motivation, pitch

I'm working with LiteLLM to centralise endpoints into one unique endpoint.

Currently, LiteLLM offers a /utils/token_counter route that can count tokens when plugged into a vLLM instance. However, this feature appears to have limited compatibility:

It works well for some models but defaults to tiktoken when unsupported, which is inaccurate for many architectures.
When using vLLM-hosted models like Gemma 2 or Mistral, the token count retrieved by LiteLLM can be inconsistent or incorrect, as these models often rely on specialized tokenization schemes.

Are you a ML Ops Team?

Yes

Twitter / LinkedIn details

No response

thverney-dozo added the enhancement New feature or request label Feb 4, 2025

github-actions bot added the mlops user request label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Accurate Token Counting in LiteLLM with vLLM-Hosted Models (Gemma, Mistral, etc.) #8244

[Feature]: Accurate Token Counting in LiteLLM with vLLM-Hosted Models (Gemma, Mistral, etc.) #8244

thverney-dozo commented Feb 4, 2025 •

edited

Loading

[Feature]: Accurate Token Counting in LiteLLM with vLLM-Hosted Models (Gemma, Mistral, etc.) #8244

[Feature]: Accurate Token Counting in LiteLLM with vLLM-Hosted Models (Gemma, Mistral, etc.) #8244

Comments

thverney-dozo commented Feb 4, 2025 • edited Loading

The Feature

Motivation, pitch

Are you a ML Ops Team?

Twitter / LinkedIn details

thverney-dozo commented Feb 4, 2025 •

edited

Loading