local TGI model gives http request error of a different model #2804

lifeng-jin · 2025-01-30T07:28:57Z

Describe the bug

i have served a llama-3-8b-instruct model locally with TGI. It ran with no issues. I created a InferenceClient with the base_url and did a chat completion. It also ran smoothly. I then tried to use text_generation with the same client, and got this crazy error:

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://api-inference.huggingface.co/models/mistralai/Mistral-Nemo-Instruct-2407 (Request ID: ce_pSj)

I changed a few models, and the error remained the same. The client was fine with chat_completion, but not with text_completion.

Reproduction

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="http://localhost:8082/v1/",
)
output = client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)

Logs

System info

- huggingface_hub version: 0.28.0
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /mnt/efs/lifengjin/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: lain6631
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.1
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.4.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.8.2
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /mnt/efs/lifengjin/.cache/huggingface/hub
- HF_ASSETS_CACHE: /mnt/efs/lifengjin/.cache/huggingface/assets
- HF_TOKEN_PATH: /mnt/efs/lifengjin/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /mnt/efs/lifengjin/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

The text was updated successfully, but these errors were encountered:

Wauplin · 2025-01-30T10:15:59Z

Hi @lifeng-jin, sorry for the inconvenience. This problem is due to the fact that base_url is only used in InferenceClient.chat_completion to meet OpenAI standard API. What you should do here is pass the TGI url as model:

from huggingface_hub import InferenceClient

client = InferenceClient(
    model="http://localhost:8082/v1/",
)
output = client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)

Which will solve your issue.

Currently self.base_url is not used in text_generation, meaning that the client considers that it should default to the default text generation model from InferenceAPI, in this case mistralai/Mistral-Nemo-Instruct-2407. I opened a PR to make the behavior more consistent #2805 which would have prevented this issue. It will only be available in next huggingface_hub release so in the meantime I advice you to use the snippet above.

lifeng-jin · 2025-01-30T17:28:42Z

Thanks @Wauplin . I tried this exact solution.
The chat_completion works, but text_generation now gives
HTTPError: 404 Client Error: Not Found for url: http://localhost:8082/v1/

Wauplin · 2025-01-30T17:50:54Z

Then it's because this route doesn't exist on TGI. Try it without the /v1

lifeng-jin · 2025-01-30T19:03:02Z

Thanks again @Wauplin , this is super helpful, and now I can get outputs from the model. However when I do

output = client.text_generation("Today is a ", max_new_tokens=2, do_sample=True, temperature=1.0, details=True, decoder_input_details=True)

I get

TextGenerationOutput(generated_text='5-minute', details=TextGenerationOutputDetails(finish_reason='length', generated_tokens=2, prefill=[], tokens=[TextGenerationOutputToken(id=20, logprob=-2.1425781, special=False, text='5'), TextGenerationOutputToken(id=24401, logprob=-4.4609375, special=False, text='-minute')], best_of_sequences=None, seed=9305067545921572115, top_tokens=None))

The decoder input / prefill is not returned like what the example showed in the doc. Could you please help?

Wauplin · 2025-01-30T19:12:37Z

This has something to do with TGI, not the InferenceClient. Better to open a separate issue in the TGI repo instead (i.e. I don't have the answer^^)

lifeng-jin added the bug Something isn't working label Jan 30, 2025

Wauplin mentioned this issue Jan 30, 2025

Default to base_url if provided #2805

Merged

Wauplin closed this as completed Jan 30, 2025

lifeng-jin mentioned this issue Jan 30, 2025

no prefill when decoder_input_details=True from InferenceClient huggingface/text-generation-inference#2973

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local TGI model gives http request error of a different model #2804

local TGI model gives http request error of a different model #2804

lifeng-jin commented Jan 30, 2025

Wauplin commented Jan 30, 2025

lifeng-jin commented Jan 30, 2025 •

edited

Loading

Wauplin commented Jan 30, 2025

lifeng-jin commented Jan 30, 2025

Wauplin commented Jan 30, 2025

local TGI model gives http request error of a different model #2804

local TGI model gives http request error of a different model #2804

Comments

lifeng-jin commented Jan 30, 2025

Describe the bug

Reproduction

Logs

System info

Wauplin commented Jan 30, 2025

lifeng-jin commented Jan 30, 2025 • edited Loading

Wauplin commented Jan 30, 2025

lifeng-jin commented Jan 30, 2025

Wauplin commented Jan 30, 2025

lifeng-jin commented Jan 30, 2025 •

edited

Loading