Skip to content

Commit

Permalink
docs: add curl cmd to docs and remove useless docker section
Browse files Browse the repository at this point in the history
  • Loading branch information
baptistecolle committed Jan 16, 2025
1 parent c58933c commit bf1c3cc
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 27 deletions.
28 changes: 1 addition & 27 deletions docs/source/howto/advanced-tgi-serving.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,32 +15,6 @@ When using Jetstream Pytorch engine, it is possible to enable quantization to re

If you encounter `Backend(NotEnoughMemory(2048))`, here are some solutions that could help with reducing memory usage in TGI:

```bash
docker run -p 8080:80 \
--shm-size 16GB \
--privileged \
--net host \
-e QUANTIZATION=1 \
-e MAX_BATCH_SIZE=2 \
-e LOG_LEVEL=text_generation_router=debug \
-v ~/hf_data:/data \
-e HF_TOKEN=<your_hf_token_here> \
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
--model-id google/gemma-2b-it \
--max-input-length 512 \
--max-total-tokens 1024 \
--max-batch-prefill-tokens 512
--max-batch-total-tokens 1024
```

<Tip warning={true}>
You need to replace <your_hf_token_here> with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)
</Tip>

<Tip warning>
If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience
</Tip>

**Optimum-TPU specific arguments:**
- `-e QUANTIZATION=1`: To enable quantization. This should reduce memory requirements by almost half
- `-e MAX_BATCH_SIZE=n`: You can manually reduce the size of the batch size
Expand All @@ -51,7 +25,7 @@ If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing
- `--max-batch-total-tokens`: Maximum total tokens in a batch

To reduce memory usage, you can try smaller numbers for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`.
To reduce memory usage, you can try smaller values for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`.

<Tip warning={true}>
`max-batch-prefill-tokens ≤ max-input-length * max_batch_size`. Otherwise, you will have an error as the configuration does not make sense. If the max-batch-prefill-tokens were bigger, then you would not be able to process any request
Expand Down
20 changes: 20 additions & 0 deletions docs/source/howto/deploy_instance_on_ie.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,26 @@ Alternatively, use curl commands to query the endpoint.

![IE playground curl](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/ie_playground_curl.png)

```bash
curl "https://{INSTANCE_ID}.{REGION}.gcp.endpoints.huggingface.cloud/v1/chat/completions" \
-X POST \
-H "Authorization: Bearer hf_XXXXX" \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [
{
"role": "user",
"content": "What is deep learning?"
}
],
"max_tokens": 150,
"stream": true
}'
```

You will need to replace {INSTANCE_ID} and {REGION} with the value from your own deployement.

## Next Steps

- There are numerous ways to interact with your new inference endpoints. Review the inference endpoint documentation to explore different options:
Expand Down

0 comments on commit bf1c3cc

Please sign in to comment.