Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server.
candle-vllm is in active development and not currently stable.
- OpenAI compatible API server provided for serving LLMs.
- Highly extensible trait-based system to allow rapid implementation of new module pipelines,
- Streaming support in generation.
- Efficient management of key-value cache with PagedAttention.
- Llama
- 7b
- 13b
- 70b
- Mistral
- 7b
See this folder for some examples.
In your terminal, install the openai
Python package by running pip install openai
. I use version 1.3.5
.
Then, create a new Python file and write the following code:
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:2000/v1/"
completion = openai.chat.completions.create(
model="llama7b",
messages=[
{
"role": "user",
"content": "Explain how to best learn Rust.",
},
],
max_tokens = 64,
)
print(completion.choices[0].message.content)
Next, launch a candle-vllm
instance by running HF_TOKEN=... cargo run --release -- --hf-token HF_TOKEN --port 2000 llama7b --repeat-last-n 64
.
After the candle-vllm
instance is running, run the Python script and enjoy efficient inference with an OpenAI compatible API server!
Installing candle-vllm
is as simple as the following steps. If you have any problems, please create an
issue.
- Be sure to install Rust here: https://www.rust-lang.org/tools/install
- Run
sudo apt install libssl-dev
or equivalent install command - Run
sudo apt install pkg-config
or equivalent install command - See the "Compiling PagedAttention CUDA kernels" section.
Go to either the "Install with Pytorch" or "Install with libtorch" section to continue.
- Install
setuptools >= 49.4.0
:pip install setuptools==49.4.0
- Run
python3 setup.py build
to compile the PagedAttention CUDA headers. todo!()
- Run
sudo find / -name libtorch_cpu.so
, taking note of the paths returned. - Install Pytorch 2.1.0 from https://pytorch.org/get-started/previous-versions/. Be sure that the correct CUDA version is used (
nvcc --version
). - Run
sudo find / -name libtorch_cpu.so
again. Take note of the new path (not including the filename). - Add the following to
.bashrc
or equivalent:
# candle-vllm
export LD_LIBRARY_PATH=/the/new/path/:$LD_LIBRARY_PATH
export LIBTORCH_USE_PYTORCH=1
- Either run
source .bashrc
(or equivalent) or reload the terminal.
-
Download libtorch, the Pytorch C++ library, from https://pytorch.org/get-started/locally/. Before executing the
wget
command, ensure the following:- Be sure that you are downloading Pytorch 2.1.0 instead of Pytorch 2.1.1 (change the link, the number is near the end).
- If on Linux, use the link corresponding to the CXX11 ABI.
- The correct CUDA version is used (
nvcc --version
).
-
Unzip the directory.
-
Add the following line to your
.bashrc
or equivalent:
# candle-lora
export LIBTORCH=/path/to/libtorch
- Either run
source .bashrc
(or equivalent) or reload your terminal.
If you get this error: error while loading shared libraries: libtorch_cpu.so: cannot open shared object file: No such file or directory
,
Add the following to your .bashrc
or equivalent:
# For Linux
export LD_LIBRARY_PATH=/path/to/libtorch/lib:$LD_LIBRARY_PATH
# For macOS
export DYLD_LIBRARY_PATH=/path/to/libtorch/lib:$DYLD_LIBRARY_PATH
Then, either run source .bashrc
(or equivalent) or reload the terminal
The following features are planned to be implemented, but contributions are especially welcome:
- Sampling methods:
- Beam search (huggingface/candle#1319)
- Pipeline batching (#3)
- More pipelines (from
candle-transformers
)
- Python implementation:
vllm-project
vllm
paper