Skip to content

Commit

Permalink
docs: address PR feedback of tengomucho
Browse files Browse the repository at this point in the history
  • Loading branch information
baptistecolle committed Jan 16, 2025
1 parent c16495a commit 3aebf6c
Show file tree
Hide file tree
Showing 18 changed files with 215 additions and 139 deletions.
20 changes: 10 additions & 10 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,30 @@
- local: installation
title: Installation
- local: optimum_container
title: Optimum Container
title: Optimum TPU Containers
- sections:
- local: tutorials/tpu_setup
title: TPU Setup
title: First TPU Setup on Google Cloud
- local: tutorials/inference_on_tpu
title: Inference on TPU
title: First TPU Inference on Google Cloud
- local: tutorials/training_on_tpu
title: Training on TPU
title: First TPU Training on Google Cloud
title: Tutorials
- sections:
- local: howto/gcloud_cli
title: Using the GCloud CLI for TPU deployment and SSH connection
title: Deploying and Connecting to Google TPU Instances via GCloud CLI
- local: howto/serving
title: Deploying a TGI server on a Google Cloud TPU instance
- local: howto/training
title: Training on a Google Cloud TPU instance
- local: howto/deploy_instance_on_ie
title: How to Deploy an TGI server on IE
title: How to Deploy a Model on Inference Endpoint for Serving using TPUs
- local: howto/advanced-tgi-serving
title: Advanced TGI Server Configuration
- local: howto/installation_inside_a_container
title: Installation of optimum-tpuinside a container
title: Installing Optimum-TPU inside a Docker Container
- local: howto/more_examples
title: Find More Examples
title: Find More Examples on the Optimum-TPU GitHub Repository
title: How-To Guides
- sections:
- local: conceptual_guides/tpu_hardware_support
Expand All @@ -41,11 +41,11 @@
- local: reference/fsdp_v2
title: FSDPv2
- local: reference/tgi_advanced_options
title: TGI Advanced Options
title: TGI Configuration Reference Guide
title: Reference
- sections:
- local: contributing
title: Contributing
title: Contributing to Optimum TPU
title: Contributing
title: Optimum-TPU
isExpanded: true
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Differences between JetStream and PyTorch XLA
# Differences between Jetstream Pytorch and PyTorch XLA

| Feature | JetStream | PyTorch XLA |
This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backend in TGI.

JetStream PyTorch is a high-performance inference engine built on top of PyTorch XLA. It is optimized for throughput and memory efficiency when running Large Language Models (LLMs) on TPUs.

| Feature | Jetstream Pytorch | PyTorch XLA |
|---------|-----------|-------------|
| Training |||
| Serving |||
Expand All @@ -10,8 +14,10 @@
| Integration | Optimized for deployment | Standard PyTorch workflow |

**Notes:**
By default, optimum-tpu is using PyTorch XLA for training and JetStream for serving.
By default, optimum-tpu is using PyTorch XLA for training and Jetstream Pytorch for serving.

You can configure optimum-tpu to use either version for serving with TGI. You can use the Pytorch XLA backend in TGI by setting up `-e JETSTREAM_PT_DISABLE=1` in your docker run arguments.

You can find more information about:
- PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla
- JetStream: https://github.com/google/jaxon/tree/main/jetstream
- Jetstream Pytorch: https://github.com/AI-Hypercomputer/jetstream-pytorch
24 changes: 9 additions & 15 deletions docs/source/conceptual_guides/tpu_hardware_support.mdx
Original file line number Diff line number Diff line change
@@ -1,33 +1,27 @@
# TPU hardware support
Optimum-TPU support and is optimized for V5e, V5p, and V6e TPUs.

## When to use TPU
TPUs excel at large-scale machine learning workloads with matrix computations, extended training periods, and large batch sizes. In contrast, GPUs offer more flexibility for models with custom operations or mixed CPU/GPU workloads. TPUs aren't ideal for workloads needing frequent branching, high-precision arithmetic, or custom training loop operations. More information can be found at https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus
Optimum-TPU support and is optimized for v5e and v6e TPUs.

## TPU naming convention
The TPU naming follows this format: `<tpu_version>-<number_of_tpus>`

TPU versions available:
TPU version:
- v5litepod (v5e)
- v5p
- v6e.
- v6e

For example, a v5litepod-8 is a v5e TPU with 8 tpus.

## Memory on TPU
The HBM (High Bandwidth Memory) capacity per chip is 16gb for V5e, V5p and 32gb for V6e. So a v5e-8 (v5litepod-8), has 16gb*8=128gb of HBM memory
The HBM (High Bandwidth Memory) capacity per chip is 16GB for V5e, V5p and 32GB for V6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory

## Performance on TPU
There are several key metrics to consider when evaluating TPU performance:
- Peak compute per chip (bf16/int8): Measures the maximum theoretical computing power in floating point or integer operations per second. Higher values indicate faster processing capability for machine learning workloads.
HBM (High Bandwidth Memory) metrics:
- Capacity: Amount of available high-speed memory per chip
- Bandwidth: Speed at which data can be read from or written to memory
These affect how much data can be processed and how quickly it can be accessed.
- Capacity: Amount of available high-speed memory per chip.
- Bandwidth: Speed at which data can be read from or written to memory. These affect how much data can be processed and how quickly it can be accessed.
- Inter-chip interconnect (ICI) bandwidth: Determines how fast TPU chips can communicate with each other, which is crucial for distributed training across multiple chips.
Pod-level metrics:
- Peak compute per Pod: Total computing power when multiple chips work together
These indicate performance at scale for large training or serving jobs.
- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs.

The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.

Expand All @@ -39,7 +33,7 @@ During the TPU VM creation use the following TPU VM base images for optimum-tpu:
- v2-alpha-tpuv5-lite (TPU v5e) (recommended)
- tpu-ubuntu2204-base (default)

For installation instructions, refer to our [TPU setup tutorial](./tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those.
For installation instructions, refer to our [TPU setup tutorial](../tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those.

More information at https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax

Expand All @@ -51,4 +45,4 @@ https://cloud.google.com/tpu/docs/v5e

Pricing informatin can be found here https://cloud.google.com/tpu/pricing

Tpu availability can be found https://cloud.google.com/tpu/docs/regions-zones
TPU availability can be found https://cloud.google.com/tpu/docs/regions-zones
17 changes: 4 additions & 13 deletions docs/source/contributing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ We're excited that you're interested in contributing to Optimum TPU! Whether you

## Getting Started

1. Fork and clone the repository:
1. [Fork](https://github.com/huggingface/optimum-tpu/fork) and clone the repository:
```bash
git clone https://github.com/YOUR_USERNAME/optimum-tpu.git
cd optimum-tpu
Expand All @@ -15,11 +15,6 @@ cd optimum-tpu
python -m venv .venv
source .venv/bin/activate
python -m pip install . -f https://storage.googleapis.com/libtpu-releases/index.html

```
3. Install testing dependencies:
```bash
make test_installs
```

## Development Tools
Expand All @@ -28,7 +23,7 @@ The project includes a comprehensive Makefile with commands for various developm

### Testing
```bash
make tests # Run all tests
make tests # Run all the non-TGI-related tests
make tgi_test # Run TGI tests with PyTorch/XLA
make tgi_test_jetstream # Run TGI tests with Jetstream backend
make tgi_docker_test # Run TGI integration tests in Docker
Expand All @@ -53,7 +48,7 @@ make tpu-tgi-gcp # Build TGI Google Cloud image
```

### TGI Development
When working on Text Generation Inference (`/text-generation-inference` folder). You will also want to build TGI image from scratch, as discussed in the manual image building section of the [serving how to guide](./howto/serving)
When working on Text Generation Inference (`/text-generation-inference` folder), you might also want to build a TGI image from scratch. To do this, refer to the manual image building section of the [serving how to guide](./howto/serving)

1. Build the standalone server:
```bash
Expand Down Expand Up @@ -81,14 +76,10 @@ make style_check
- Test results
- Documentation updates if needed

5. Check that the CI tests are correct:
- Verify all CI workflows have passed
- Address any CI failures

## Need Help?

- Open an issue for bugs or feature requests
- Check the [documentation](https://huggingface.co/docs/optimum/tpu/overview)
- Open an issue for bugs or feature requests

## License

Expand Down
43 changes: 29 additions & 14 deletions docs/source/howto/advanced-tgi-serving.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Advanced Option for TGI server
# Advanced TGI Server Configuration

## Jetstream Pytorch and Pytorch XLA backends

Expand All @@ -13,21 +13,18 @@ When using Jetstream Pytorch engine, it is possible to enable quantization to re

## How to solve memory requirements

If you encounter `Backend(NotEnoughMemory(2048))`.
Here are some solutions that could help with reducing memory usage in TGI:
If you encounter `Backend(NotEnoughMemory(2048))`, here are some solutions that could help with reducing memory usage in TGI:

bash
```
```bash
docker run -p 8080:80 \
--shm-size 16G \
--shm-size 16GB \
--privileged \
--net host \
-e QUANTIZATION=1 \
-e MAX_BATCH_SIZE=2 \
-e LOG_LEVEL=text_generation_router=debug \
-v ~/hf_data:/data \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-e SKIP_WARMUP=1 \
-e HF_TOKEN=<your_hf_token_here> \
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
--model-id google/gemma-2b-it \
--max-input-length 512 \
Expand All @@ -36,14 +33,25 @@ docker run -p 8080:80 \
--max-batch-total-tokens 1024
```

<Tip warning={true}>
You need to replace <your_hf_token_here> with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)
</Tip>

<Tip warning>
If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience
</Tip>

**Optimum-TPU specific arguments:**
- `-e QUANTIZATION=1`: To enable quantization. This should reduce memory requirements by almost half
- `-e MAX_BATCH_SIZE=n`: You can manually reduce the size of the batch size

**TGI specific arguments:**
- `--max-input-length`: Maximum input sequence length
- `--max-total-tokens`: Maximum combined input and output tokens
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing
- `--max-batch-total-tokens`: Maximum total tokens in a batch

To reduce memory usage, you want to try a smaller number for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`.
To reduce memory usage, you can try smaller numbers for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`.

<Tip warning={true}>
`max-batch-prefill-tokens ≤ max-input-length * max_batch_size`. Otherwise, you will have an error as the configuration does not make sense. If the max-batch-prefill-tokens were bigger, then you would not be able to process any request
Expand All @@ -52,19 +60,26 @@ To reduce memory usage, you want to try a smaller number for `--max-input-lengt
## Sharding
Sharding is done automatically by the TGI server, so your model uses all the TPUs that are available. We do tensor parallelism, so the layers are automatically split in all available TPUs. However, the TGI router will only see one shard.

More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism
More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism.

## Understanding the configuration

Key parameters explained:
- `--shm-size 16G`: Shared memory allocation

**Required parameters**
- `--shm-size 16GB`: Shared memory allocation
- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
- `-v ~/hf_data:/data`: Volume mount for model storage
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production)
Those are needed to run a TPU container so that the container can properly access the TPU hardware

**Optional parameters**
- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production).
Those are parameters used by TGI and optimum-TPU to configure the server behavior.


<Tip warning={true}>
`--privileged --shm-size 16gb --net host` is required as specify in https://github.com/pytorch/xla
`--privileged --shm-size 16GB --net host` is required as specify in https://github.com/pytorch/xla
</Tip>

## Next steps
Expand Down
8 changes: 5 additions & 3 deletions docs/source/howto/deploy_instance_on_ie.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# How to Deploy a Model on Inference Endpoint
# How to Deploy a Model on Inference Endpoint (IE) for Serving using TPUs

You can deploy any of our supported models on Inference Endpoint (IE) (see list of supported models).
Inference Endpoints (IE) is a solution to serve generation using supported models on TPU. It does not require setting up a separate GCP account, and it will offer some pre-configured settings to serve models with Optimum's TPU TGI.

You can deploy any of our supported models on Inference Endpoint (see list of supported models).
Inference Endpoints offer secure production environments by setting up a TGI server that can auto-scale based on demand.

We have optimized Inference Endpoints on TPU to ensure each model achieves optimal performance.
Expand Down Expand Up @@ -30,7 +32,7 @@ Once you've completed the configuration, click the "Create Endpoint" button.

## 3. Using Your Endpoint

The endpoint requires initialization, during which you can monitor the logs. In the logs section, you'll observe the model undergoing warmup to compile for optimal performance. Endpoint startup typically takes between 5 to 30 minutes, depending on the model size. This warmup period triggers multiple compilations to ensure peak serving performance.
The endpoint requires initialization, during which you can monitor the logs. In the logs section, you will observe the model undergoing warmup to compile for optimal performance. Endpoint startup typically takes between 5 to 30 minutes, depending on the model size. This warmup period triggers multiple compilations to ensure peak serving performance.

![IE init](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/ie_endpoint_initalizing.png)

Expand Down
12 changes: 11 additions & 1 deletion docs/source/howto/gcloud_cli.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Deploying a Google TPU instance on Google Cloud Platform (GCP) via the gcloud CLI
# Deploying and Connecting to Google TPU Instances via GCloud CLI

## Context

Expand Down Expand Up @@ -57,3 +57,13 @@ gcloud compute tpus tpu-vm ssh optimum-tpu-get-started --zone=us-west4-a
$ >
```

## Other useful commands

This is used to get information about the tpu-vm for example its external IP:
```bash
gcloud compute tpus tpu-vm describe --zone=<tpu_zone> <tpu_name>
```

## Next steps
- If you wish to train your own model, you can now [install optimum-tpu](../installation)
- If you wish do to serving, you can look at our [serving tutorial](../tutorials/inference_on_tpu)
8 changes: 4 additions & 4 deletions docs/source/howto/installation_inside_a_container.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Running Optimum-TPU in a Docker Container
# Installing Optimum-TPU inside a Docker Container

This guide explains how to run Optimum-TPU within a Docker container using the official PyTorch/XLA image.

Expand All @@ -17,7 +17,7 @@ First, set the environment variables for the image URL and version:

```bash
export TPUVM_IMAGE_URL=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla
export TPUVM_IMAGE_VERSION=8f1dcd5b03f993e4da5c20d17c77aff6a5f22d5455f8eb042d2e4b16ac460526
export TPUVM_IMAGE_VERSION=v2.5.1

# Pull the image
docker pull ${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION}
Expand All @@ -31,14 +31,14 @@ Launch the container with the necessary flags for TPU access:
docker run -ti \
--rm \
--privileged \
--network=host \
--net=host \
${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION} \
bash
```

Key flags explained:
- `--privileged`: Required for TPU access
- `--network=host`: Uses host network mode for optimal performance
- `--net=host`: Required for TPU access
- `--rm`: Automatically removes the container when it exits

### 3. Install Optimum-TPU
Expand Down
11 changes: 8 additions & 3 deletions docs/source/howto/more_examples.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Available Examples
# Find More Examples on the Optimum-TPU GitHub Repository

To find the latest examples, visit the [examples folder in the optimum-tpu repo on github](https://github.com/huggingface/optimum-tpu/tree/main/examples)

Expand All @@ -15,19 +15,24 @@ Learn how to perform efficient inference for text generation tasks:
## Language Model Fine-tuning
Explore how to fine-tune language models on TPU infrastructure:

1. **Interactive Gemma Tutorial** ([examples/language-modeling/gemma_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb))
1. **Interactive Gemma Tutorial** ([view in the docs](../howto/gemma_tuning))
- Complete notebook showing Gemma fine-tuning process
- Covers environment setup and TPU configuration
- Demonstrates FSDPv2 integration for efficient model sharding
- Includes dataset preparation and PEFT/LoRA implementation
- Provides step-by-step training workflow

2. **LLaMA Fine-tuning Guide** ([examples/language-modeling/llama_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/llama_tuning.ipynb))
The full notebook is available at [examples/language-modeling/gemma_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb)


2. **LLaMA Fine-tuning Guide** ([view in the docs](../howto/llama_tuning))
- Detailed guide for fine-tuning LLaMA-2 and LLaMA-3 models
- Explains SPMD and FSDP concepts
- Shows how to implement efficient data parallel training
- Includes practical code examples and prerequisites

The full notebook is available at [examples/language-modeling/llama_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/llama_tuning.ipynb)

# Additional Resources

- Visit the [Optimum-TPU GitHub repository](https://github.com/huggingface/optimum-tpu) for more details
Expand Down
Loading

0 comments on commit 3aebf6c

Please sign in to comment.