Skip to content

Commit

Permalink
update vLLM integration page (#182)
Browse files Browse the repository at this point in the history
* update vLLM integration page

* address feedback
  • Loading branch information
anakin87 authored Feb 16, 2024
1 parent 669eb08 commit 998b6e1
Showing 1 changed file with 59 additions and 6 deletions.
65 changes: 59 additions & 6 deletions integrations/vllm.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: integration
name: vLLM Invocation Layer
description: Use a vLLM server or locally hosted instance in your Prompt Node
description: Use the vLLM inference engine with Haystack
authors:
- name: Lukas Kreussel
socials:
Expand All @@ -11,6 +11,7 @@ repo: https://github.com/LLukas22/vLLM-haystack-adapter
type: Model Provider
report_issue: https://github.com/LLukas22/vLLM-haystack-adapter/issues
logo: /logos/vllm.png
version: Haystack 2.0
toc: true
---
[![PyPI - Version](https://img.shields.io/pypi/v/vllm-haystack.svg)](https://pypi.org/project/vllm-haystack)
Expand All @@ -25,15 +26,67 @@ Simply use [vLLM](https://github.com/vllm-project/vllm) in your haystack pipelin
</a>
</p>

## Installation
### Table of Contents

- [Overview](#overview)
- [Haystack 2.0](#haystack-20)
- [Installation](#installation)
- [Usage](#usage)
- [Haystack 1.x](#haystack-1x)
- [Installation (1.x)](#installation-1x)
- [Usage (1.x)](#usage-1x)

## Overview

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
It is an open-source project that allows serving open models in production, when you have GPU resources available.

For Haystack 1.x, the integration is available as a separate package, while for Haystack 2.x, the integration comes out of the box.

## Haystack 2.x

vLLM can be deployed as a server that implements the OpenAI API protocol.
This allows vLLM to be used with the [`OpenAIGenerator`](https://docs.haystack.deepset.ai/v2.0/docs/openaigenerator) and [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/v2.0/docs/openaichatgenerator) components in Haystack.

For an end-to-end example of [vLLM + Haystack 2.x, see this notebook](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/vllm_inference_engine.ipynb).


### Installation
vLLM should be installed.
- you can use `pip`: `pip install vllm` (more information in the [vLLM documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html))
- for production use cases, there are many other options, including Docker ([docs](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html))

### Usage
You first need to run an vLLM OpenAI-compatible server. You can do that using [Python](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) or [Docker](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html).

Then, you can use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.

```python
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), # for compatibility with the OpenAI API, a placeholder api_key is needed
model="mistralai/Mistral-7B-Instruct-v0.1",
api_base_url="http://localhost:8000/v1",
generation_kwargs = {"max_tokens": 512}
)

response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])
```

## Haystack 1.x

### Installation (1.x)
Install the wrapper via pip: `pip install vllm-haystack`

## Usage
### Usage (1.x)
This integration provides two invocation layers:
- `vLLMInvocationLayer`: To use models hosted on a vLLM server
- `vLLMLocalInvocationLayer`: To use locally hosted vLLM models

### Use a Model Hosted on a vLLM Server
#### Use a Model Hosted on a vLLM Server
To utilize the wrapper the `vLLMInvocationLayer` has to be used.

Here is a simple example of how a `PromptNode` can be created with the wrapper.
Expand All @@ -52,12 +105,12 @@ prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)
The model will be inferred based on the model served on the vLLM server.
For more configuration examples, take a look at the unit-tests.

#### Hosting a vLLM Server
##### Hosting a vLLM Server

To create an *OpenAI-Compatible Server* via vLLM you can follow the steps in the
Quickstart section of their [documentation](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html#openai-compatible-server).

### Use a Model Hosted Locally
#### Use a Model Hosted Locally
⚠️To run `vLLM` locally you need to have `vllm` installed and a supported GPU.

If you don't want to use an API-Server this wrapper also provides a `vLLMLocalInvocationLayer` which executes the vLLM on the same node Haystack is running on.
Expand Down

0 comments on commit 998b6e1

Please sign in to comment.