Skip to content

What's the difference between vllm and triton-inference-server? #240

Answered by WoosukKwon
gesanqiu asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but plan to add an integration layer with serving systems such as NVIDIA Triton and Ray Serve for those who want to scale out the system.

Replies: 5 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by zhuohan123
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants
Converted from issue

This discussion was converted from issue #178 on June 25, 2023 16:43.