What's the difference between vllm and triton-inference-server? #240

gesanqiu · 2023-06-21T06:24:04Z

gesanqiu
Jun 21, 2023

May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve.
BTW, vllm really accelerate our deploy work, thx.

Answered by WoosukKwon

Jun 21, 2023

Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but plan to add an integration layer with serving systems such as NVIDIA Triton and Ray Serve for those who want to scale out the system.

View full answer

WoosukKwon · 2023-06-21T06:33:02Z

WoosukKwon
Jun 21, 2023
Maintainer

Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but plan to add an integration layer with serving systems such as NVIDIA Triton and Ray Serve for those who want to scale out the system.

0 replies

gesanqiu · 2023-06-21T07:09:43Z

gesanqiu
Jun 21, 2023
Author

Thanks for your response. So can I assume vLLM will server like a backend in NVIDIA Triton? I wandering whether the serving part will be overlapped with NVIDIA Triton's capabilities?

0 replies

zhuohan123 · 2023-06-21T15:30:32Z

zhuohan123
Jun 21, 2023
Maintainer

PagedAttention requires batching multiple requests together to achieve high throughput and we need to keep the batching logic within vLLM as well. This is typically not included in an NVIDIA Triton backend, which typically only handles inference on a single batch. From this perspective, vLLM is more than a typical NVIDIA Triton backend.

However, we will mostly focus on building a core LLM serving engine and keep most frontend functionalities (e.g., fault tolerance, auto-scaling, multiple frontends including GRPC, ...) within other serving frontends (e.g., NVIDIA Triton, Ray Serve, ...). We will just focus on making LLM inference and serving lightning fast and cheap.

0 replies

eyusupov · 2023-06-23T04:19:45Z

eyusupov
Jun 23, 2023

There is dynamic batching in NVIDIA Triton. Is this somehow different from what vLLM does?

0 replies

zhuohan123 · 2023-06-23T08:30:54Z

zhuohan123
Jun 23, 2023
Maintainer

There is dynamic batching in NVIDIA Triton. Is this somehow different from what vLLM does?

This blog post from anyscale explains in detail what's the difference between "dynamic batching" in Triton and "continuous batching" in vLLM. In a nutshell, "dynamic batching" is designed mainly for traditional NNs (e.g., CNN), where the NNs receive fix-sized inputs and the system decides how many inputs to batch for each iteration. However, "continuous batching" is specifically designed for LLMs and language sequences, which batches individual tokens of different sequences for each iteration.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the difference between vllm and triton-inference-server? #240

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What's the difference between vllm and triton-inference-server? #240

gesanqiu Jun 21, 2023

Replies: 5 comments

WoosukKwon Jun 21, 2023 Maintainer

gesanqiu Jun 21, 2023 Author

zhuohan123 Jun 21, 2023 Maintainer

eyusupov Jun 23, 2023

zhuohan123 Jun 23, 2023 Maintainer

gesanqiu
Jun 21, 2023

WoosukKwon
Jun 21, 2023
Maintainer

gesanqiu
Jun 21, 2023
Author

zhuohan123
Jun 21, 2023
Maintainer

eyusupov
Jun 23, 2023

zhuohan123
Jun 23, 2023
Maintainer