Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples/gke/tgi-tpu-deployment/ for TGI on TPU #62

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

alvarobartt
Copy link
Member

Description

This PR adds an example on how to use the recently created TGI container for TPU inference on #57 in Google Kubernetes Engine (GKE) using TPU v5e chips. In this case, the model served is google/gemma-7b-it which is among the supported models within optimum-tpu.

For more information on optimum-tpu please check https://github.com/huggingface/optimum-tpu

What's missing?

We still need to ping Google Cloud about the recent release of the TPU container as well as waiting for it to be released, and then just update the CONTAINER_URI accordingly.

Since `MAX_BATCH_PREFILL_TOKENS` is internally set by Text Generation
Inference (TGI) to `MAX_INPUT_TOKENS + 50`, and as the TGI on TPU model
warm-up validates that `MAX_BATCH_PREFILL_TOKENS <= MAX_INPUT_TOKENS *
BATCH_SIZE`, then we set the `BATCH_SIZE=2` so that `MAX_INPUT_TOKENS +
50 < MAX_INPUT_TOKENS * 2` so that the validation passes. Alternatively,
one could also set the `MAX_BATCH_PREFILL_TOKENS` to a value lower or
equal than `MAX_INPUT_TOKENS` (ideally equal).
Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 Great work!

> Installing the `gke-gcloud-auth-plugin` does not need to be installed via `gcloud` specifically, to read more about the alternative installation methods, please visit [Install `kubectl` and configure cluster access](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).

Finally, we also need to ensure that we have enough quota or capacity to create the GKE Cluster with the necessary resources, which can be checked in the GCP Console at <https://console.cloud.google.com/iam-admin/quotas>. In this case, in order to use the TPU v5e we need to check the quota with the following filter `Service: Compute Engine API`, `Type: Quota`, and `Name: TPU v5 Lite PodSlice chips`; and then ensure that we have enough capacity in the selected location by just taking into consideration that the topologies as e.g. `2x4` mean that we need `8` chips available.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider mentioning that for now the largest supported configuration is 2x4. Larger multi-host environment are not yet supported by TGI (soon).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair thanks, I'll include this!

```bash
ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, message=ChatCompletionMessage(content='Sure, the answer is 4.\n\n2 + 2 = 4<eos>', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1722329005, model='google/gemma-7b-it', object='text_completion', system_fingerprint='2.0.2-native', usage=CompletionUsage(completion_tokens=17, prompt_tokens=0, total_tokens=17))
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's worth mentioning it, but it is possible just to use the standard python lib to do the request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard as in via requests or via huggingface_hub.InferenceClient? Or none of those?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in via requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants