Add `examples/gke/tgi-tpu-deployment/` for TGI on TPU #62

alvarobartt · 2024-07-30T09:01:00Z

Description

This PR adds an example on how to use the recently created TGI container for TPU inference on #57 in Google Kubernetes Engine (GKE) using TPU v5e chips. In this case, the model served is google/gemma-7b-it which is among the supported models within optimum-tpu.

For more information on optimum-tpu please check https://github.com/huggingface/optimum-tpu

What's missing?

We still need to ping Google Cloud about the recent release of the TPU container as well as waiting for it to be released, and then just update the CONTAINER_URI accordingly.

Since `MAX_BATCH_PREFILL_TOKENS` is internally set by Text Generation Inference (TGI) to `MAX_INPUT_TOKENS + 50`, and as the TGI on TPU model warm-up validates that `MAX_BATCH_PREFILL_TOKENS <= MAX_INPUT_TOKENS * BATCH_SIZE`, then we set the `BATCH_SIZE=2` so that `MAX_INPUT_TOKENS + 50 < MAX_INPUT_TOKENS * 2` so that the validation passes. Alternatively, one could also set the `MAX_BATCH_PREFILL_TOKENS` to a value lower or equal than `MAX_INPUT_TOKENS` (ideally equal).

tengomucho

👏 Great work!

tengomucho · 2024-07-30T13:26:06Z

examples/gke/tgi-tpu-deployment/README.md

+> Installing the `gke-gcloud-auth-plugin` does not need to be installed via `gcloud` specifically, to read more about the alternative installation methods, please visit [Install `kubectl` and configure cluster access](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).
+
+Finally, we also need to ensure that we have enough quota or capacity to create the GKE Cluster with the necessary resources, which can be checked in the GCP Console at <https://console.cloud.google.com/iam-admin/quotas>. In this case, in order to use the TPU v5e we need to check the quota with the following filter `Service: Compute Engine API`, `Type: Quota`, and `Name: TPU v5 Lite PodSlice chips`; and then ensure that we have enough capacity in the selected location by just taking into consideration that the topologies as e.g. `2x4` mean that we need `8` chips available.
+


consider mentioning that for now the largest supported configuration is 2x4. Larger multi-host environment are not yet supported by TGI (soon).

Fair thanks, I'll include this!

tengomucho · 2024-07-30T13:29:32Z

examples/gke/tgi-tpu-deployment/README.md

+```bash
+ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, message=ChatCompletionMessage(content='Sure, the answer is 4.\n\n2 + 2 = 4<eos>', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1722329005, model='google/gemma-7b-it', object='text_completion', system_fingerprint='2.0.2-native', usage=CompletionUsage(completion_tokens=17, prompt_tokens=0, total_tokens=17))
+```
+


I don't know if it's worth mentioning it, but it is possible just to use the standard python lib to do the request.

Standard as in via requests or via huggingface_hub.InferenceClient? Or none of those?

as in via requests.

alvarobartt added 7 commits July 29, 2024 17:06

Add examples/gke/tgi-on-tpu/deployment.yaml (WIP)

c084a9a

Add examples/gke/tgi-on-tpu/config/*.yaml

2e46df4

Rename examples/gke/tgi-on-tpu to tgi-tpu-deployment

c9b8b6e

Remove namespace from config/ingress.yaml

9c7c1a7

Fix BATCH_SIZE env var as it's MAX_BATCH_SIZE instead

21692e4

Add README.md and imgs/*.png

4159af3

alvarobartt added examples blocked labels Jul 30, 2024

alvarobartt requested review from tengomucho and philschmid July 30, 2024 09:01

alvarobartt self-assigned this Jul 30, 2024

philschmid approved these changes Jul 30, 2024

View reviewed changes

tengomucho approved these changes Jul 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `examples/gke/tgi-tpu-deployment/` for TGI on TPU #62

Add `examples/gke/tgi-tpu-deployment/` for TGI on TPU #62

alvarobartt commented Jul 30, 2024

tengomucho left a comment

tengomucho Jul 30, 2024

alvarobartt Jul 30, 2024

tengomucho Jul 30, 2024

alvarobartt Jul 30, 2024

tengomucho Jul 30, 2024

		> Installing the `gke-gcloud-auth-plugin` does not need to be installed via `gcloud` specifically, to read more about the alternative installation methods, please visit [Install `kubectl` and configure cluster access](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).

		Finally, we also need to ensure that we have enough quota or capacity to create the GKE Cluster with the necessary resources, which can be checked in the GCP Console at <https://console.cloud.google.com/iam-admin/quotas>. In this case, in order to use the TPU v5e we need to check the quota with the following filter `Service: Compute Engine API`, `Type: Quota`, and `Name: TPU v5 Lite PodSlice chips`; and then ensure that we have enough capacity in the selected location by just taking into consideration that the topologies as e.g. `2x4` mean that we need `8` chips available.

Add examples/gke/tgi-tpu-deployment/ for TGI on TPU #62

Are you sure you want to change the base?

Add examples/gke/tgi-tpu-deployment/ for TGI on TPU #62

Conversation

alvarobartt commented Jul 30, 2024

Description

What's missing?

tengomucho left a comment

Choose a reason for hiding this comment

tengomucho Jul 30, 2024

Choose a reason for hiding this comment

alvarobartt Jul 30, 2024

Choose a reason for hiding this comment

tengomucho Jul 30, 2024

Choose a reason for hiding this comment

alvarobartt Jul 30, 2024

Choose a reason for hiding this comment

tengomucho Jul 30, 2024

Choose a reason for hiding this comment

Add `examples/gke/tgi-tpu-deployment/` for TGI on TPU #62

Add `examples/gke/tgi-tpu-deployment/` for TGI on TPU #62