Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server.
candle-vllm is in active, breaking development and as such is currently unstable.
- OpenAI compatible API server provided for serving LLMs.
- Highly extensible trait-based system to allow rapid implementation of new module pipelines,
- Streaming support in generation.
- Efficient management of key-value cache with PagedAttention.
- Continuous batching.
- Llama
- 7b
- 13b
- 70b
- Mistral
- 7b
See this folder for some examples.
In your terminal, install the openai
Python package by running pip install openai
. I use version 1.3.5
.
Then, create a new Python file and write the following code:
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:2000/v1/"
completion = openai.chat.completions.create(
model="llama7b",
messages=[
{
"role": "user",
"content": "Explain how to best learn Rust.",
},
],
max_tokens = 64,
)
print(completion.choices[0].message.content)
Next, launch a candle-vllm
instance by running HF_TOKEN=... cargo run --release -- --hf-token HF_TOKEN --port 2000 llama7b --repeat-last-n 64
.
After the candle-vllm
instance is running, run the Python script and enjoy efficient inference with an OpenAI compatible API server!
Installing candle-vllm
is as simple as the following steps. If you have any problems, please create an
issue.
- Be sure to install Rust here: https://www.rust-lang.org/tools/install
- Run
sudo apt install libssl-dev
or equivalent install command - Run
sudo apt install pkg-config
or equivalent install command
The following features are planned to be implemented, but contributions are especially welcome:
- Sampling methods:
- Beam search (huggingface/candle#1319)
- More pipelines (from
candle-transformers
)
- Python implementation:
vllm-project
vllm
paper