What's the difference between vllm and triton-inference-server? #240
-
May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but plan to add an integration layer with serving systems such as NVIDIA Triton and Ray Serve for those who want to scale out the system. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your response. So can I assume vLLM will server like a backend in NVIDIA Triton? I wandering whether the serving part will be overlapped with NVIDIA Triton's capabilities? |
Beta Was this translation helpful? Give feedback.
-
PagedAttention requires batching multiple requests together to achieve high throughput and we need to keep the batching logic within vLLM as well. This is typically not included in an NVIDIA Triton backend, which typically only handles inference on a single batch. From this perspective, vLLM is more than a typical NVIDIA Triton backend. However, we will mostly focus on building a core LLM serving engine and keep most frontend functionalities (e.g., fault tolerance, auto-scaling, multiple frontends including GRPC, ...) within other serving frontends (e.g., NVIDIA Triton, Ray Serve, ...). We will just focus on making LLM inference and serving lightning fast and cheap. |
Beta Was this translation helpful? Give feedback.
-
There is dynamic batching in NVIDIA Triton. Is this somehow different from what vLLM does? |
Beta Was this translation helpful? Give feedback.
-
This blog post from anyscale explains in detail what's the difference between "dynamic batching" in Triton and "continuous batching" in vLLM. In a nutshell, "dynamic batching" is designed mainly for traditional NNs (e.g., CNN), where the NNs receive fix-sized inputs and the system decides how many inputs to batch for each iteration. However, "continuous batching" is specifically designed for LLMs and language sequences, which batches individual tokens of different sequences for each iteration. |
Beta Was this translation helpful? Give feedback.
Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but plan to add an integration layer with serving systems such as NVIDIA Triton and Ray Serve for those who want to scale out the system.