From ae4a2921b0843f392b466ee5ac6678aa6a7ea700 Mon Sep 17 00:00:00 2001 From: J Wyman Date: Wed, 12 Jun 2024 12:08:45 -0400 Subject: [PATCH] Update Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md Co-authored-by: Neelay Shah --- .../TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md b/Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md index e2f274c1..617af652 100644 --- a/Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md +++ b/Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md @@ -16,7 +16,7 @@ # Autoscaling and Load Balancing Generative AI w/ Triton Server and TensorRT-LLM -Setting up autoscaling and load balancing using Triton Inference Server, TensorRT-LLM or vLLM, and Kubernetes is not difficult, +Setting up autoscaling and load balancing for large language models served by Triton Inference Server is not difficult, but it does require preparation. This guide aims to help you automated acquisition of models from Hugging Face, minimize time spent optimizing models for