Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.
Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.
- vLLM: Designed to provide SOTA throughput.
- TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
- llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
- TGI: HuggingFace' fast and flexible engine designed for high throughput.
- LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
- DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
- ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.
β Included | π Inferior Alternative | π©οΈ Exists but has Issues | π¨ PR | ποΈ Planned |β Unclear / Unofficial | β Not Implemented
vLLM | TensorRT-LLM | llama.cpp | TGI | LightLLM | Fastgen | ExLlamaV2 | |
---|---|---|---|---|---|---|---|
Optimizations | |||||||
FlashAttention2 | β 1 | β 2 | π 3 | β 4 | β | β | β |
PagedAttention | β 4 | β 2 | β 5 | β | π *** 6 | β | β 7 |
Speculative Decoding | π¨ 8 | β 9 | β 10 | β 11 | β | β 12 | β |
Tensor Parallel | β | β 13 | π ** 14 | β 15 | β | β 16 | β |
Pipeline Parallel | β 17 | β 18 | β 19 | β 15 | β | β 20 | β |
Optim. / Scheduler | |||||||
Dyn. SplitFuse (SOTA21) | ποΈ 21 | ποΈ 22 | β | β | β | β 21 | β |
Efficient Rtr (better) | β | β | β | β | β 23 | β | β |
Cont. Batching | β 21 | β 24 | β | β | β | β 16 | β 25 |
Optim. / Quant | |||||||
EXL2 (SOTA26) | π¨ 27 | β | β | β 28 | β | β | β |
AWQ | π©οΈ 29 | β | β | β | β | β | β |
Other Quants | (yes) 30 | GPTQ | GGUF 31 | (yes) 32 | ? | ? | ? |
Features | |||||||
OpenAI-Style API | β | β 33 | β [^13] | β 34 | β 35 | β | β |
Feat. / Sampling | |||||||
Beam Search | β | β 2 | β 36 | π **** 37 | β | β 38 | β 39 |
JSON / Grammars via Outlines | β | ποΈ | β | β | ? | ? | β |
Models | |||||||
Llama 2 / 3 | β | β | β | β | β | β | β |
Mistral | β | β | β | β | β 40 | β | β |
Mixtral | β | β | β | β | β | β | β |
Implementation | |||||||
Core Language | Python | C++ | C++ | Py / Rust | Python | Python | Python |
GPU API | CUDA* | CUDA* | Metal / CUDA | CUDA* | Triton / CUDA | CUDA* | CUDA |
Repo | |||||||
License | Apache 2 | Apache 2 | MIT | Apache 2 41 | Apache 2 | Apache 2 | MIT |
Github Stars | 17K | 6K | 54K | 8K | 2K | 2K | 3K |
- BentoML (June 5th, 2024): Compares LMDeploy, MLC-LLM, TGI, TRT-LLM, vLLM
*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.
**Sequentially processed tensor split
****TGI maintainers suggest using best_of
instead of beam search. (best_of
creates n
generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.
Footnotes
-
https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046 β©
-
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md β© β©2 β©3
-
https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 β©
-
https://github.com/huggingface/text-generation-inference/issues/753#issuecomment-1663525606 β© β©2
-
https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md β©
-
https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5 β©
-
https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html β©
-
https://github.com/ggerganov/llama.cpp/blob/fe680e3d1080a765e5d3150ffd7bab189742898d/examples/speculative/README.md β©
-
https://github.com/huggingface/text-generation-inference/pull/1308 β©
-
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 β©
-
https://github.com/ggerganov/llama.cpp/issues/4014#issuecomment-1804925896 β©
-
https://github.com/huggingface/text-generation-inference/issues/1031#issuecomment-1727976990 β© β©2
-
https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 β©
-
"without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 β©
-
https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 β©
-
https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 β© β©2 β©3 β©4
-
https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 β©
-
https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router β©
-
https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md β©
-
https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 β©
-
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers β©
-
https://github.com/huggingface/text-generation-inference/pull/1211 β©
-
https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst β©
-
https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 β©
-
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md β©
-
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 β©
-
https://huggingface.co/docs/text-generation-inference/messages_api β©
-
https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 β©
-
https://github.com/ggerganov/llama.cpp/tree/master/examples/beam-search β©
-
https://github.com/huggingface/text-generation-inference/issues/722#issuecomment-1658823644 β©
-
https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 β©
-
https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 β©
-
https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 β©