Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local TGI model gives http request error of a different model #2804

Closed
lifeng-jin opened this issue Jan 30, 2025 · 5 comments · Fixed by #2805
Closed

local TGI model gives http request error of a different model #2804

lifeng-jin opened this issue Jan 30, 2025 · 5 comments · Fixed by #2805
Labels
bug Something isn't working

Comments

@lifeng-jin
Copy link

Describe the bug

i have served a llama-3-8b-instruct model locally with TGI. It ran with no issues. I created a InferenceClient with the base_url and did a chat completion. It also ran smoothly. I then tried to use text_generation with the same client, and got this crazy error:

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://api-inference.huggingface.co/models/mistralai/Mistral-Nemo-Instruct-2407 (Request ID: ce_pSj)

I changed a few models, and the error remained the same. The client was fine with chat_completion, but not with text_completion.

Reproduction

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="http://localhost:8082/v1/",
)
output = client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)

Logs

System info

- huggingface_hub version: 0.28.0
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /mnt/efs/lifengjin/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: lain6631
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.1
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.4.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.8.2
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /mnt/efs/lifengjin/.cache/huggingface/hub
- HF_ASSETS_CACHE: /mnt/efs/lifengjin/.cache/huggingface/assets
- HF_TOKEN_PATH: /mnt/efs/lifengjin/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /mnt/efs/lifengjin/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
@lifeng-jin lifeng-jin added the bug Something isn't working label Jan 30, 2025
@Wauplin
Copy link
Contributor

Wauplin commented Jan 30, 2025

Hi @lifeng-jin, sorry for the inconvenience. This problem is due to the fact that base_url is only used in InferenceClient.chat_completion to meet OpenAI standard API. What you should do here is pass the TGI url as model:

from huggingface_hub import InferenceClient

client = InferenceClient(
    model="http://localhost:8082/v1/",
)
output = client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)

Which will solve your issue.

Currently self.base_url is not used in text_generation, meaning that the client considers that it should default to the default text generation model from InferenceAPI, in this case mistralai/Mistral-Nemo-Instruct-2407. I opened a PR to make the behavior more consistent #2805 which would have prevented this issue. It will only be available in next huggingface_hub release so in the meantime I advice you to use the snippet above.

@Wauplin Wauplin closed this as completed Jan 30, 2025
@lifeng-jin
Copy link
Author

lifeng-jin commented Jan 30, 2025

Thanks @Wauplin . I tried this exact solution.
The chat_completion works, but text_generation now gives
HTTPError: 404 Client Error: Not Found for url: http://localhost:8082/v1/

@Wauplin
Copy link
Contributor

Wauplin commented Jan 30, 2025

Then it's because this route doesn't exist on TGI. Try it without the /v1

@lifeng-jin
Copy link
Author

Thanks again @Wauplin , this is super helpful, and now I can get outputs from the model. However when I do

output = client.text_generation("Today is a ", max_new_tokens=2, do_sample=True, temperature=1.0, details=True, decoder_input_details=True)

I get

TextGenerationOutput(generated_text='5-minute', details=TextGenerationOutputDetails(finish_reason='length', generated_tokens=2, prefill=[], tokens=[TextGenerationOutputToken(id=20, logprob=-2.1425781, special=False, text='5'), TextGenerationOutputToken(id=24401, logprob=-4.4609375, special=False, text='-minute')], best_of_sequences=None, seed=9305067545921572115, top_tokens=None))

The decoder input / prefill is not returned like what the example showed in the doc. Could you please help?

@Wauplin
Copy link
Contributor

Wauplin commented Jan 30, 2025

This has something to do with TGI, not the InferenceClient. Better to open a separate issue in the TGI repo instead (i.e. I don't have the answer^^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants