docs: address PR feedback of tengomucho

huggingface · Jan 16, 2025 · 3aebf6c · 3aebf6c
1 parent c16495a
commit 3aebf6c
Show file tree

Hide file tree

Showing 18 changed files with 215 additions and 139 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -6,30 +6,30 @@
   - local: installation
     title: Installation
   - local: optimum_container
-    title: Optimum Container
+    title: Optimum TPU Containers
   - sections:
     - local: tutorials/tpu_setup
-      title: TPU Setup
+      title: First TPU Setup on Google Cloud
     - local: tutorials/inference_on_tpu
-      title: Inference on TPU
+      title: First TPU Inference on Google Cloud
     - local: tutorials/training_on_tpu
-      title: Training on TPU
+      title: First TPU Training on Google Cloud
     title: Tutorials
   - sections:
     - local: howto/gcloud_cli
-      title: Using the GCloud CLI for TPU deployment and SSH connection
+      title: Deploying and Connecting to Google TPU Instances via GCloud CLI
     - local: howto/serving
       title: Deploying a TGI server on a Google Cloud TPU instance
     - local: howto/training
       title: Training on a Google Cloud TPU instance
     - local: howto/deploy_instance_on_ie
-      title: How to Deploy an TGI server on IE
+      title: How to Deploy a Model on Inference Endpoint for Serving using TPUs
     - local: howto/advanced-tgi-serving
       title: Advanced TGI Server Configuration
     - local: howto/installation_inside_a_container
-      title: Installation of optimum-tpuinside a container
+      title: Installing Optimum-TPU inside a Docker Container
     - local: howto/more_examples
-      title: Find More Examples
+      title: Find More Examples on the Optimum-TPU GitHub Repository
     title: How-To Guides
   - sections:
     - local: conceptual_guides/tpu_hardware_support
@@ -41,11 +41,11 @@
     - local: reference/fsdp_v2
       title: FSDPv2
     - local: reference/tgi_advanced_options
-      title: TGI Advanced Options
+      title: TGI Configuration Reference Guide
     title: Reference
   - sections:
     - local: contributing
-      title: Contributing
+      title: Contributing to Optimum TPU
     title: Contributing
   title: Optimum-TPU
   isExpanded: true
diff --git a/docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx b/docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
@@ -1,6 +1,10 @@
-# Differences between JetStream and PyTorch XLA
+# Differences between Jetstream Pytorch and PyTorch XLA
 
-| Feature | JetStream | PyTorch XLA |
+This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backend in TGI.
+
+JetStream PyTorch is a high-performance inference engine built on top of PyTorch XLA. It is optimized for throughput and memory efficiency when running Large Language Models (LLMs) on TPUs.
+
+| Feature | Jetstream Pytorch | PyTorch XLA |
 |---------|-----------|-------------|
 | Training | ❌ | ✅ |
 | Serving | ✅ | ✅ |
@@ -10,8 +14,10 @@
 | Integration | Optimized for deployment | Standard PyTorch workflow |
 
 **Notes:**
-By default, optimum-tpu is using PyTorch XLA for training and JetStream for serving.
+By default, optimum-tpu is using PyTorch XLA for training and Jetstream Pytorch for serving. 
+
+You can configure optimum-tpu to use either version for serving with TGI. You can use the Pytorch XLA backend in TGI by setting up `-e JETSTREAM_PT_DISABLE=1` in your docker run arguments.
 
 You can find more information about:
 - PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla
-- JetStream: https://github.com/google/jaxon/tree/main/jetstream
+- Jetstream Pytorch: https://github.com/AI-Hypercomputer/jetstream-pytorch
diff --git a/docs/source/conceptual_guides/tpu_hardware_support.mdx b/docs/source/conceptual_guides/tpu_hardware_support.mdx
@@ -1,33 +1,27 @@
 # TPU hardware support
-Optimum-TPU support and is optimized for V5e, V5p, and V6e TPUs.
-
-## When to use TPU
-TPUs excel at large-scale machine learning workloads with matrix computations, extended training periods, and large batch sizes. In contrast, GPUs offer more flexibility for models with custom operations or mixed CPU/GPU workloads. TPUs aren't ideal for workloads needing frequent branching, high-precision arithmetic, or custom training loop operations. More information can be found at https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus
+Optimum-TPU support and is optimized for v5e and v6e TPUs.
 
 ## TPU naming convention
 The TPU naming follows this format: `<tpu_version>-<number_of_tpus>`
 
-TPU versions available: 
+TPU version: 
 - v5litepod (v5e)
-- v5p 
-- v6e.
+- v6e
 
 For example, a v5litepod-8 is a v5e TPU with 8 tpus.
 
 ## Memory on TPU
-The HBM (High Bandwidth Memory) capacity per chip is 16gb for V5e, V5p and 32gb for V6e. So a v5e-8 (v5litepod-8), has 16gb*8=128gb of HBM memory
+The HBM (High Bandwidth Memory) capacity per chip is 16GB for V5e, V5p and 32GB for V6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory
 
 ## Performance on TPU
 There are several key metrics to consider when evaluating TPU performance:
 - Peak compute per chip (bf16/int8): Measures the maximum theoretical computing power in floating point or integer operations per second. Higher values indicate faster processing capability for machine learning workloads.
 HBM (High Bandwidth Memory) metrics:
-- Capacity: Amount of available high-speed memory per chip
-- Bandwidth: Speed at which data can be read from or written to memory
-These affect how much data can be processed and how quickly it can be accessed.
+- Capacity: Amount of available high-speed memory per chip.
+- Bandwidth: Speed at which data can be read from or written to memory. These affect how much data can be processed and how quickly it can be accessed.
 - Inter-chip interconnect (ICI) bandwidth: Determines how fast TPU chips can communicate with each other, which is crucial for distributed training across multiple chips.
 Pod-level metrics:
-- Peak compute per Pod: Total computing power when multiple chips work together
-These indicate performance at scale for large training or serving jobs.
+- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs.
 
 The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.
 
@@ -39,7 +33,7 @@ During the TPU VM creation use the following TPU VM base images for optimum-tpu:
 - v2-alpha-tpuv5-lite (TPU v5e) (recommended)
 - tpu-ubuntu2204-base (default)
 
-For installation instructions, refer to our [TPU setup tutorial](./tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those.
+For installation instructions, refer to our [TPU setup tutorial](../tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those.
 
 More information at https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax
 
@@ -51,4 +45,4 @@ https://cloud.google.com/tpu/docs/v5e
 
 Pricing informatin can be found here https://cloud.google.com/tpu/pricing
 
-Tpu availability can be found https://cloud.google.com/tpu/docs/regions-zones
+TPU availability can be found https://cloud.google.com/tpu/docs/regions-zones
diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
@@ -4,7 +4,7 @@ We're excited that you're interested in contributing to Optimum TPU! Whether you
 
 ## Getting Started
 
-1. Fork and clone the repository:
+1. [Fork](https://github.com/huggingface/optimum-tpu/fork) and clone the repository:
 ```bash
 git clone https://github.com/YOUR_USERNAME/optimum-tpu.git
 cd optimum-tpu
@@ -15,11 +15,6 @@ cd optimum-tpu
 python -m venv .venv
 source .venv/bin/activate
 python -m pip install . -f https://storage.googleapis.com/libtpu-releases/index.html
-
-```
-3. Install testing dependencies:
-```bash
-make test_installs
 ```
 
 ## Development Tools
@@ -28,7 +23,7 @@ The project includes a comprehensive Makefile with commands for various developm
 
 ### Testing
 ```bash
-make tests              # Run all tests
+make tests              # Run all the non-TGI-related tests
 make tgi_test           # Run TGI tests with PyTorch/XLA
 make tgi_test_jetstream # Run TGI tests with Jetstream backend
 make tgi_docker_test    # Run TGI integration tests in Docker
@@ -53,7 +48,7 @@ make tpu-tgi-gcp    # Build TGI Google Cloud image
 ```
 
 ### TGI Development
-When working on Text Generation Inference (`/text-generation-inference` folder). You will also want to build TGI image from scratch, as discussed in the manual image building section of the [serving how to guide](./howto/serving)
+When working on Text Generation Inference (`/text-generation-inference` folder), you might also want to build a TGI image from scratch. To do this, refer to the manual image building section of the [serving how to guide](./howto/serving)
 
 1. Build the standalone server:
 ```bash
@@ -81,14 +76,10 @@ make style_check
    - Test results
    - Documentation updates if needed
 
-5. Check that the CI tests are correct:
-    - Verify all CI workflows have passed
-    - Address any CI failures
-
 ## Need Help?
 
-- Open an issue for bugs or feature requests
 - Check the [documentation](https://huggingface.co/docs/optimum/tpu/overview)
+- Open an issue for bugs or feature requests
 
 ## License
 

diff --git a/docs/source/howto/advanced-tgi-serving.mdx b/docs/source/howto/advanced-tgi-serving.mdx
@@ -1,4 +1,4 @@
-# Advanced Option for TGI server
+# Advanced TGI Server Configuration
 
 ## Jetstream Pytorch and Pytorch XLA backends
 
@@ -13,21 +13,18 @@ When using Jetstream Pytorch engine, it is possible to enable quantization to re
 
 ## How to solve memory requirements
 
-If you encounter `Backend(NotEnoughMemory(2048))`. 
-Here are some solutions that could help with reducing memory usage in TGI:
+If you encounter `Backend(NotEnoughMemory(2048))`, here are some solutions that could help with reducing memory usage in TGI:
 
-bash
-```
+```bash
 docker run -p 8080:80 \
-        --shm-size 16G \
+        --shm-size 16GB \
         --privileged \
         --net host \
         -e QUANTIZATION=1 \
         -e MAX_BATCH_SIZE=2 \
         -e LOG_LEVEL=text_generation_router=debug \
         -v ~/hf_data:/data \
-        -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-        -e SKIP_WARMUP=1 \
+        -e HF_TOKEN=<your_hf_token_here> \
         ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
         --model-id google/gemma-2b-it \
         --max-input-length 512 \
@@ -36,14 +33,25 @@ docker run -p 8080:80 \
         --max-batch-total-tokens 1024
 ```
 
+<Tip warning={true}>
+You need to replace <your_hf_token_here> with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)
+</Tip>
+
+<Tip warning>
+If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience
+</Tip>
+
+**Optimum-TPU specific arguments:**
 - `-e QUANTIZATION=1`: To enable quantization. This should reduce memory requirements by almost half
 - `-e MAX_BATCH_SIZE=n`: You can manually reduce the size of the batch size
+
+**TGI specific arguments:**
 - `--max-input-length`: Maximum input sequence length
 - `--max-total-tokens`: Maximum combined input and output tokens
 - `--max-batch-prefill-tokens`: Maximum tokens for batch processing
 - `--max-batch-total-tokens`: Maximum total tokens in a batch
 
-To reduce memory usage, you want to try a smaller number for  `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`. 
+To reduce memory usage, you can try smaller numbers for  `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`. 
 
 <Tip warning={true}>
 `max-batch-prefill-tokens ≤ max-input-length * max_batch_size`. Otherwise, you will have an error as the configuration does not make sense. If the max-batch-prefill-tokens were bigger, then you would not be able to process any request
@@ -52,19 +60,26 @@ To reduce memory usage, you want to try a smaller number for  `--max-input-lengt
 ## Sharding
 Sharding is done automatically by the TGI server, so your model uses all the TPUs that are available. We do tensor parallelism, so the layers are automatically split in all available TPUs. However, the TGI router will only see one shard. 
 
-More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism
+More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism.
 
 ## Understanding the configuration
 
 Key parameters explained:
-- `--shm-size 16G`: Shared memory allocation
+
+**Required parameters**
+- `--shm-size 16GB`: Shared memory allocation
 - `--privileged`: Required for TPU access
 - `--net host`: Uses host network mode
-- `-v ~/hf_data:/data`: Volume mount for model storage
-- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production)
+Those are needed to run a TPU container so that the container can properly access the TPU hardware
+
+**Optional parameters**
+- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data
+- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production).
+Those are parameters used by TGI and optimum-TPU to configure the server behavior.
+
 
 <Tip warning={true}>
-`--privileged --shm-size 16gb --net host` is required as specify in https://github.com/pytorch/xla
+`--privileged --shm-size 16GB --net host` is required as specify in https://github.com/pytorch/xla
 </Tip>
 
 ## Next steps

diff --git a/docs/source/howto/deploy_instance_on_ie.mdx b/docs/source/howto/deploy_instance_on_ie.mdx
@@ -1,6 +1,8 @@
-# How to Deploy a Model on Inference Endpoint
+# How to Deploy a Model on Inference Endpoint (IE) for Serving using TPUs
 
-You can deploy any of our supported models on Inference Endpoint (IE) (see list of supported models).
+Inference Endpoints (IE) is a solution to serve generation using supported models on TPU. It does not require setting up a separate GCP account, and it will offer some pre-configured settings to serve models with Optimum's TPU TGI.
+
+You can deploy any of our supported models on Inference Endpoint (see list of supported models).
 Inference Endpoints offer secure production environments by setting up a TGI server that can auto-scale based on demand.
 
 We have optimized Inference Endpoints on TPU to ensure each model achieves optimal performance.
@@ -30,7 +32,7 @@ Once you've completed the configuration, click the "Create Endpoint" button.
 
 ## 3. Using Your Endpoint
 
-The endpoint requires initialization, during which you can monitor the logs. In the logs section, you'll observe the model undergoing warmup to compile for optimal performance. Endpoint startup typically takes between 5 to 30 minutes, depending on the model size. This warmup period triggers multiple compilations to ensure peak serving performance.
+The endpoint requires initialization, during which you can monitor the logs. In the logs section, you will observe the model undergoing warmup to compile for optimal performance. Endpoint startup typically takes between 5 to 30 minutes, depending on the model size. This warmup period triggers multiple compilations to ensure peak serving performance.
 
 ![IE init](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/ie_endpoint_initalizing.png)
 

diff --git a/docs/source/howto/gcloud_cli.mdx b/docs/source/howto/gcloud_cli.mdx
@@ -1,4 +1,4 @@
-# Deploying a Google TPU instance on Google Cloud Platform (GCP) via the gcloud CLI
+# Deploying and Connecting to Google TPU Instances via GCloud CLI
 
 ## Context
 
@@ -57,3 +57,13 @@ gcloud compute tpus tpu-vm ssh optimum-tpu-get-started --zone=us-west4-a
 $ >
 ```
 
+## Other useful commands
+
+This is used to get information about the tpu-vm for example its external IP:
+```bash
+gcloud compute tpus tpu-vm describe --zone=<tpu_zone> <tpu_name>
+```
+
+## Next steps
+- If you wish to train your own model, you can now [install optimum-tpu](../installation)
+- If you wish do to serving, you can look at our [serving tutorial](../tutorials/inference_on_tpu)
diff --git a/docs/source/howto/installation_inside_a_container.mdx b/docs/source/howto/installation_inside_a_container.mdx
@@ -1,4 +1,4 @@
-# Running Optimum-TPU in a Docker Container
+# Installing Optimum-TPU inside a Docker Container
 
 This guide explains how to run Optimum-TPU within a Docker container using the official PyTorch/XLA image.
 
@@ -17,7 +17,7 @@ First, set the environment variables for the image URL and version:
 
 ```bash
 export TPUVM_IMAGE_URL=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla
-export TPUVM_IMAGE_VERSION=8f1dcd5b03f993e4da5c20d17c77aff6a5f22d5455f8eb042d2e4b16ac460526
+export TPUVM_IMAGE_VERSION=v2.5.1
 
 # Pull the image
 docker pull ${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION}
@@ -31,14 +31,14 @@ Launch the container with the necessary flags for TPU access:
 docker run -ti \
     --rm \
     --privileged \
-    --network=host \
+    --net=host \
     ${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION} \
     bash
 ```
 
 Key flags explained:
 - `--privileged`: Required for TPU access
-- `--network=host`: Uses host network mode for optimal performance
+- `--net=host`: Required for TPU access
 - `--rm`: Automatically removes the container when it exits
 
 ### 3. Install Optimum-TPU

diff --git a/docs/source/howto/more_examples.mdx b/docs/source/howto/more_examples.mdx
@@ -1,4 +1,4 @@
-# Available Examples
+# Find More Examples on the Optimum-TPU GitHub Repository
 
 To find the latest examples, visit the [examples folder in the optimum-tpu repo on github](https://github.com/huggingface/optimum-tpu/tree/main/examples)
 
@@ -15,19 +15,24 @@ Learn how to perform efficient inference for text generation tasks:
 ## Language Model Fine-tuning
 Explore how to fine-tune language models on TPU infrastructure:
 
-1. **Interactive Gemma Tutorial** ([examples/language-modeling/gemma_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb))
+1. **Interactive Gemma Tutorial** ([view in the docs](../howto/gemma_tuning))
    - Complete notebook showing Gemma fine-tuning process
    - Covers environment setup and TPU configuration
    - Demonstrates FSDPv2 integration for efficient model sharding
    - Includes dataset preparation and PEFT/LoRA implementation
    - Provides step-by-step training workflow
 
-2. **LLaMA Fine-tuning Guide** ([examples/language-modeling/llama_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/llama_tuning.ipynb))
+The full notebook is available at [examples/language-modeling/gemma_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb)
+
+
+2. **LLaMA Fine-tuning Guide** ([view in the docs](../howto/llama_tuning))
    - Detailed guide for fine-tuning LLaMA-2 and LLaMA-3 models
    - Explains SPMD and FSDP concepts
    - Shows how to implement efficient data parallel training
    - Includes practical code examples and prerequisites
 
+The full notebook is available at [examples/language-modeling/llama_tuning.ipynb](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/llama_tuning.ipynb)
+
 # Additional Resources
 
 - Visit the [Optimum-TPU GitHub repository](https://github.com/huggingface/optimum-tpu) for more details