From 5983ab14a6eb52a91c02390cc27e11880ac953a5 Mon Sep 17 00:00:00 2001
From: Dustin Franklin <dustinf@nvidia.com>
Date: Wed, 13 Nov 2024 13:50:32 -0500
Subject: [PATCH] added TensorRT-LLM page

---
 docs/tensorrt_llm.md | 81 ++++++++++++++++++++++++++++++++++++++++++++
 mkdocs.yml           |  3 +-
 2 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100644 docs/tensorrt_llm.md
diff --git a/docs/tensorrt_llm.md b/docs/tensorrt_llm.md
new file mode 100644
index 00000000..963c83f4
--- /dev/null
+++ b/docs/tensorrt_llm.md
@@ -0,0 +1,81 @@
+# TensorRT-LLM for Jetson
+
+[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is a high-performance LLM inference library with advanced quantization, attention kernels, and paged KV caching.  Initial support for building TensorRT-LLM from source for JetPack 6.1 has been included in the [`v0.12.0-jetson`](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson) branch of the [TensorRT-LLM repo](https://github.com/NVIDIA/TensorRT-LLM) for Jetson AGX Orin.
+
+<img src="https://blogs.nvidia.com/wp-content/uploads/2023/10/studio-ai-announcemenet-blog-kv-oct2023-1280x680-1.jpg">
+
+We've provided pre-compiled TensorRT-LLM [wheels](http://jetson.webredirect.org/jp6/cu126/tensorrt-llm/0.12.0) and containers along with this guide for [`TensorRT-LLM Deployment on Jetson Orin`](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md)
+
+!!! abstract "What you need"
+
+    1. One of the following Jetson devices:
+
+        <span class="blobDarkGreen4">Jetson AGX Orin</span>
+        *Support for other Orin devices is currently undergoing testing.
+	   
+    2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):
+
+        <span class="blobPink2">JetPack 6.1 (L4T r36.4)</span>
+
+    3. Sufficient storage space (preferably with NVMe SSD).
+
+        - `18.5GB` for `tensorrt_llm` container image
+        - Space for models (`>10GB`)
+        
+    4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
+    
+		```bash
+		git clone https://github.com/dusty-nv/jetson-containers
+		bash jetson-containers/install.sh
+		``` 
+		
+## Building TensorRT-LLM Engine for Llama
+
+You can find the steps for converting Llama to TensorRT-LLM under [`examples/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson/examples/llama) in the repo, and also in the [documentation](https://nvidia.github.io/TensorRT-LLM/).  This script will automate the process for Llama-7B with INT4 quantization applied, and run some generation and performance checks on the model:
+
+```bash
+jetson-containers run \
+  -e HUGGINGFACE_TOKEN=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 \
+  -e FORCE_BUILD=on \
+    cu126/tensorrt_llm:0.12-r36.4.0 \
+      /opt/TensorRT-LLM/llama.sh
+```
+
+There are many such conversion procedures outlined in the TensorRT-LLM examples for different model architectures.  
+
+## OpenAI API Endpoint
+
+TensorRT-LLM has programming APIs for Python and C++ available, but it also includes an example [server endpoint](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson/examples/apps) for the [OpenAI protocol](https://github.com/openai/openai-python) that makes it easy to substitute for other local or cloud model backends.  
+
+This will start the TensorRT-LLM container with the server and model that you built above:
+
+```
+jetson-containers run \
+  cu126/tensorrt_llm:0.12-r36.4.0 \
+  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
+    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
+```
+
+Then you can make chat completion requests against it from practically any language or from any connected device.  This [example](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson/examples/apps#v1completions) shows a simple way of testing it initially from another terminal with curl:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": <model_name>,
+        "prompt": "Where is New York?",
+        "max_tokens": 16,
+        "temperature": 0
+    }'
+```
+
+Or the code included with [openai_client.py](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/examples/apps/openai_client.py) will handle these requests using the standard [`openai-python`](https://github.com/openai/openai-python) package that can be installed outside of the container or on another machine.
+
+```
+jetson-containers run \
+  --workdir /opt/TensorRT-LLM/examples/apps \
+  cu126/tensorrt_llm:0.12-r36.4.0 \
+    python3 openai_client.py --prompt "Where is New York?" --api chat
+```
+
+The patches in the branch above for TensorRT-LLM 0.12 are a preview release for Jetson AGX Orin, and we'll continue with validating and testing the various settings in TensorRT-LLM.  If you need any support, please post to the [Jetson Developer Forums.](https://forums.developer.nvidia.com/c/agx-autonomous-machines/jetson-embedded-systems/jetson-agx-orin/486).
diff --git a/mkdocs.yml b/mkdocs.yml
index 7f632610..a0a59216 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -90,6 +90,7 @@ nav:
       - ollama: tutorial_ollama.md
       - llamaspeak: tutorial_llamaspeak.md
       - NanoLLM: tutorial_nano-llm.md
+      - TensorRT-LLM  🆕: tensorrt_llm.md
       - Small LLM (SLM): tutorial_slm.md
       - API Examples: tutorial_api-examples.md
     - Text + Vision (VLM):
@@ -105,7 +106,7 @@ nav:
       - SAM: vit/tutorial_sam.md
       - TAM: vit/tutorial_tam.md
     - Robotics & Embodiment:
-      - LeRobot 🆕: lerobot.md
+      - LeRobot: lerobot.md
       - ROS2 Nodes: ros.md
       - OpenVLA: openvla.md
     - Image Generation: