diff --git a/README.md b/README.md new file mode 100644 index 00000000..abdba2cf --- /dev/null +++ b/README.md @@ -0,0 +1,166 @@ + + +[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause) + +# vLLM Backend + +The Triton backend for [vLLM](https://github.com/vllm-project/vllm) +is designed to run +[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) +on a +[vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py). +You can learn more about Triton backends in the [backend +repo](https://github.com/triton-inference-server/backend). + + +This is a Python-based backend. When using this backend, all requests are placed on the +vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled +by the vLLM engine. + +Where can I ask general questions about Triton and Triton backends? +Be sure to read all the information below as well as the [general +Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server) +available in the main [server](https://github.com/triton-inference-server/server) +repo. If you don't find your answer there you can ask questions on the +main Triton [issues page](https://github.com/triton-inference-server/server/issues). + +## Building the vLLM Backend + +There are several ways to install and deploy the vLLM backend. + +### Option 1. Use the Pre-Built Docker Container. + +Pull the container with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) registry. This container has everything you need to run your vLLM model. + +### Option 2. Build a Custom Container From Source +You can follow steps described in the +[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker) +guide and use the +[build.py](https://github.com/triton-inference-server/server/blob/main/build.py) +script. + +A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs. + +``` +./build.py -v --enable-logging + --enable-stats + --enable-tracing + --enable-metrics + --enable-gpu-metrics + --enable-cpu-metrics + --enable-gpu + --filesystem=gcs + --filesystem=s3 + --filesystem=azure_storage + --endpoint=http + --endpoint=grpc + --endpoint=sagemaker + --endpoint=vertex-ai + --upstream-container-version=23.10 + --backend=python:r23.10 + --backend=vllm:r23.10 +``` + +### Option 3. Add the vLLM Backend to the Default Triton Container + +You can install the vLLM backend directly into the NGC Triton container. +In this case, please install vLLM first. You can do so by running +`pip install vllm==`. Then, set up the vLLM backend in the +container with the following commands: + +``` +mkdir -p /opt/tritonserver/backends/vllm +wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py +``` + +## Using the vLLM Backend + +You can see an example +[model_repository](samples/model_repository) +in the [samples](samples) folder. +You can use this as is and change the model by changing the `model` value in `model.json`. +`model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model. +You can see supported arguments in vLLM's +[arg_utils.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py). +Specifically, +[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11) +and +[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201). + +For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in +[model.json](samples/model_repository/vllm_model/1/model.json). + +Note: vLLM greedily consume up to 90% of the GPU's memory under default settings. +The sample model updates this behavior by setting gpu_memory_utilization to 50%. +You can tweak this behavior using fields like gpu_memory_utilization and other settings in +[model.json](samples/model_repository/vllm_model/1/model.json). + +In the [samples](samples) folder, you can also find a sample client, +[client.py](samples/client.py). + +## Running the Latest vLLM Version + +To see the version of vLLM in the container, see the +[version_map](https://github.com/triton-inference-server/server/blob/85487a1e15438ccb9592b58e308a3f78724fa483/build.py#L83) +in [build.py](https://github.com/triton-inference-server/server/blob/main/build.py) +for the Triton version you are using. + +If you would like to use a specific vLLM commit or the latest version of vLLM, you +will need to use a +[custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments). + + +## Sending Your First Inference + +After you +[start Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html) +with the +[sample model_repository](samples/model_repository), +you can quickly run your first inference request with the +[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md). + +Try out the command below. + +``` +$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}' +``` + +## Running Multiple Instances of Triton Server + +If you are running multiple instances of Triton server with a Python-based backend, +you need to specify a different `shm-region-prefix-name` for each server. See +[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server) +for more information. + +## Referencing the Tutorial + +You can read further in the +[vLLM Quick Deploy guide](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM) +in the +[tutorials](https://github.com/triton-inference-server/tutorials/) repository. \ No newline at end of file diff --git a/samples/client.py b/samples/client.py new file mode 100755 index 00000000..06bf0c3e --- /dev/null +++ b/samples/client.py @@ -0,0 +1,236 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import argparse +import asyncio +import json +import sys + +import numpy as np +import tritonclient.grpc.aio as grpcclient +from tritonclient.utils import * + + +class LLMClient: + def __init__(self, flags: argparse.Namespace): + self._client = grpcclient.InferenceServerClient( + url=flags.url, verbose=flags.verbose + ) + self._flags = flags + self._loop = asyncio.get_event_loop() + self._results_dict = {} + + async def async_request_iterator(self, prompts, sampling_parameters): + try: + for iter in range(self._flags.iterations): + for i, prompt in enumerate(prompts): + prompt_id = self._flags.offset + (len(prompts) * iter) + i + self._results_dict[str(prompt_id)] = [] + yield self.create_request( + prompt, + self._flags.streaming_mode, + prompt_id, + sampling_parameters, + ) + except Exception as error: + print(f"Caught an error in the request iterator: {error}") + + async def stream_infer(self, prompts, sampling_parameters): + try: + # Start streaming + response_iterator = self._client.stream_infer( + inputs_iterator=self.async_request_iterator( + prompts, sampling_parameters + ), + stream_timeout=self._flags.stream_timeout, + ) + async for response in response_iterator: + yield response + except InferenceServerException as error: + print(error) + sys.exit(1) + + async def process_stream(self, prompts, sampling_parameters): + # Clear results in between process_stream calls + self.results_dict = [] + + # Read response from the stream + async for response in self.stream_infer(prompts, sampling_parameters): + result, error = response + if error: + print(f"Encountered error while processing: {error}") + else: + output = result.as_numpy("text_output") + for i in output: + self._results_dict[result.get_response().id].append(i) + + async def run(self): + sampling_parameters = {"temperature": "0.1", "top_p": "0.95"} + with open(self._flags.input_prompts, "r") as file: + print(f"Loading inputs from `{self._flags.input_prompts}`...") + prompts = file.readlines() + + await self.process_stream(prompts, sampling_parameters) + + with open(self._flags.results_file, "w") as file: + for id in self._results_dict.keys(): + for result in self._results_dict[id]: + file.write(result.decode("utf-8")) + file.write("\n") + file.write("\n=========\n\n") + print(f"Storing results into `{self._flags.results_file}`...") + + if self._flags.verbose: + with open(self._flags.results_file, "r") as file: + print(f"\nContents of `{self._flags.results_file}` ===>") + print(file.read()) + + print("PASS: vLLM example") + + def run_async(self): + self._loop.run_until_complete(self.run()) + + def create_request( + self, + prompt, + stream, + request_id, + sampling_parameters, + send_parameters_as_tensor=True, + ): + inputs = [] + prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_) + try: + inputs.append(grpcclient.InferInput("text_input", [1], "BYTES")) + inputs[-1].set_data_from_numpy(prompt_data) + except Exception as error: + print(f"Encountered an error during request creation: {error}") + + stream_data = np.array([stream], dtype=bool) + inputs.append(grpcclient.InferInput("stream", [1], "BOOL")) + inputs[-1].set_data_from_numpy(stream_data) + + # Request parameters are not yet supported via BLS. Provide an + # optional mechanism to send serialized parameters as an input + # tensor until support is added + + if send_parameters_as_tensor: + sampling_parameters_data = np.array( + [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_ + ) + inputs.append(grpcclient.InferInput("sampling_parameters", [1], "BYTES")) + inputs[-1].set_data_from_numpy(sampling_parameters_data) + + # Add requested outputs + outputs = [] + outputs.append(grpcclient.InferRequestedOutput("text_output")) + + # Issue the asynchronous sequence inference. + return { + "model_name": self._flags.model, + "inputs": inputs, + "outputs": outputs, + "request_id": str(request_id), + "parameters": sampling_parameters, + } + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", + "--model", + type=str, + required=False, + default="vllm_model", + help="Model name", + ) + parser.add_argument( + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", + "--url", + type=str, + required=False, + default="localhost:8001", + help="Inference server URL and its gRPC port. Default is localhost:8001.", + ) + parser.add_argument( + "-t", + "--stream-timeout", + type=float, + required=False, + default=None, + help="Stream timeout in seconds. Default is None.", + ) + parser.add_argument( + "--offset", + type=int, + required=False, + default=0, + help="Add offset to request IDs used", + ) + parser.add_argument( + "--input-prompts", + type=str, + required=False, + default="prompts.txt", + help="Text file with input prompts", + ) + parser.add_argument( + "--results-file", + type=str, + required=False, + default="results.txt", + help="The file with output results", + ) + parser.add_argument( + "--iterations", + type=int, + required=False, + default=1, + help="Number of iterations through the prompts file", + ) + parser.add_argument( + "-s", + "--streaming-mode", + action="store_true", + required=False, + default=False, + help="Enable streaming mode", + ) + FLAGS = parser.parse_args() + + client = LLMClient(FLAGS) + client.run_async() diff --git a/samples/model_repository/vllm_model/1/model.json b/samples/model_repository/vllm_model/1/model.json new file mode 100644 index 00000000..e610c3cb --- /dev/null +++ b/samples/model_repository/vllm_model/1/model.json @@ -0,0 +1,5 @@ +{ + "model":"facebook/opt-125m", + "disable_log_requests": "true", + "gpu_memory_utilization": 0.5 +} diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt new file mode 100644 index 00000000..169f3815 --- /dev/null +++ b/samples/model_repository/vllm_model/config.pbtxt @@ -0,0 +1,77 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +# Note: You do not need to change any fields in this configuration. + +backend: "vllm" + +# Disabling batching in Triton, let vLLM handle the batching on its own. +max_batch_size: 0 + +# We need to use decoupled transaction policy for saturating +# vLLM engine for max throughtput. +# TODO [DLIS:5233]: Allow asynchronous execution to lift this +# restriction for cases there is exactly a single response to +# a single request. +model_transaction_policy { + decoupled: True +} +# Note: The vLLM backend uses the following input and output names. +# Any change here needs to also be made in model.py +input [ + { + name: "text_input" + data_type: TYPE_STRING + dims: [ 1 ] + }, + { + name: "stream" + data_type: TYPE_BOOL + dims: [ 1 ] + }, + { + name: "sampling_parameters" + data_type: TYPE_STRING + dims: [ 1 ] + optional: true + } +] + +output [ + { + name: "text_output" + data_type: TYPE_STRING + dims: [ -1 ] + } +] + +# The usage of device is deferred to the vLLM engine +instance_group [ + { + count: 1 + kind: KIND_MODEL + } +] diff --git a/samples/prompts.txt b/samples/prompts.txt new file mode 100644 index 00000000..133800ec --- /dev/null +++ b/samples/prompts.txt @@ -0,0 +1,4 @@ +Hello, my name is +The most dangerous animal is +The capital of France is +The future of AI is