diff --git a/README.md b/README.md
new file mode 100644
index 00000000..abdba2cf
--- /dev/null
+++ b/README.md
@@ -0,0 +1,166 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
+
+# vLLM Backend
+
+The Triton backend for [vLLM](https://github.com/vllm-project/vllm)
+is designed to run
+[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+on a
+[vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
+You can learn more about Triton backends in the [backend
+repo](https://github.com/triton-inference-server/backend).
+
+
+This is a Python-based backend. When using this backend, all requests are placed on the
+vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled
+by the vLLM engine.
+
+Where can I ask general questions about Triton and Triton backends?
+Be sure to read all the information below as well as the [general
+Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server)
+available in the main [server](https://github.com/triton-inference-server/server)
+repo. If you don't find your answer there you can ask questions on the
+main Triton [issues page](https://github.com/triton-inference-server/server/issues).
+
+## Building the vLLM Backend
+
+There are several ways to install and deploy the vLLM backend.
+
+### Option 1. Use the Pre-Built Docker Container.
+
+Pull the container with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) registry. This container has everything you need to run your vLLM model.
+
+### Option 2. Build a Custom Container From Source
+You can follow steps described in the
+[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
+guide and use the
+[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
+script.
+
+A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs.
+
+```
+./build.py -v  --enable-logging
+                --enable-stats
+                --enable-tracing
+                --enable-metrics
+                --enable-gpu-metrics
+                --enable-cpu-metrics
+                --enable-gpu
+                --filesystem=gcs
+                --filesystem=s3
+                --filesystem=azure_storage
+                --endpoint=http
+                --endpoint=grpc
+                --endpoint=sagemaker
+                --endpoint=vertex-ai
+                --upstream-container-version=23.10
+                --backend=python:r23.10
+                --backend=vllm:r23.10
+```
+
+### Option 3. Add the vLLM Backend to the Default Triton Container
+
+You can install the vLLM backend directly into the NGC Triton container.
+In this case, please install vLLM first. You can do so by running
+`pip install vllm==<vLLM_version>`. Then, set up the vLLM backend in the
+container with the following commands:
+
+```
+mkdir -p /opt/tritonserver/backends/vllm
+wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py
+```
+
+## Using the vLLM Backend
+
+You can see an example
+[model_repository](samples/model_repository)
+in the [samples](samples) folder.
+You can use this as is and change the model by changing the `model` value in `model.json`.
+`model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model.
+You can see supported arguments in vLLM's
+[arg_utils.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py).
+Specifically,
+[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11)
+and
+[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201).
+
+For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in
+[model.json](samples/model_repository/vllm_model/1/model.json).
+
+Note: vLLM greedily consume up to 90% of the GPU's memory under default settings.
+The sample model updates this behavior by setting gpu_memory_utilization to 50%.
+You can tweak this behavior using fields like gpu_memory_utilization and other settings in
+[model.json](samples/model_repository/vllm_model/1/model.json).
+
+In the [samples](samples) folder, you can also find a sample client,
+[client.py](samples/client.py).
+
+## Running the Latest vLLM Version
+
+To see the version of vLLM in the container, see the
+[version_map](https://github.com/triton-inference-server/server/blob/85487a1e15438ccb9592b58e308a3f78724fa483/build.py#L83)
+in [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
+for the Triton version you are using.
+
+If you would like to use a specific vLLM commit or the latest version of vLLM, you
+will need to use a
+[custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).
+
+
+## Sending Your First Inference
+
+After you
+[start Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
+with the
+[sample model_repository](samples/model_repository),
+you can quickly run your first inference request with the
+[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
+
+Try out the command below.
+
+```
+$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
+```
+
+## Running Multiple Instances of Triton Server
+
+If you are running multiple instances of Triton server with a Python-based backend,
+you need to specify a different `shm-region-prefix-name` for each server. See
+[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
+for more information.
+
+## Referencing the Tutorial
+
+You can read further in the
+[vLLM Quick Deploy guide](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM)
+in the
+[tutorials](https://github.com/triton-inference-server/tutorials/) repository.
\ No newline at end of file
diff --git a/samples/client.py b/samples/client.py
new file mode 100755
index 00000000..06bf0c3e
--- /dev/null
+++ b/samples/client.py
@@ -0,0 +1,236 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import asyncio
+import json
+import sys
+
+import numpy as np
+import tritonclient.grpc.aio as grpcclient
+from tritonclient.utils import *
+
+
+class LLMClient:
+    def __init__(self, flags: argparse.Namespace):
+        self._client = grpcclient.InferenceServerClient(
+            url=flags.url, verbose=flags.verbose
+        )
+        self._flags = flags
+        self._loop = asyncio.get_event_loop()
+        self._results_dict = {}
+
+    async def async_request_iterator(self, prompts, sampling_parameters):
+        try:
+            for iter in range(self._flags.iterations):
+                for i, prompt in enumerate(prompts):
+                    prompt_id = self._flags.offset + (len(prompts) * iter) + i
+                    self._results_dict[str(prompt_id)] = []
+                    yield self.create_request(
+                        prompt,
+                        self._flags.streaming_mode,
+                        prompt_id,
+                        sampling_parameters,
+                    )
+        except Exception as error:
+            print(f"Caught an error in the request iterator: {error}")
+
+    async def stream_infer(self, prompts, sampling_parameters):
+        try:
+            # Start streaming
+            response_iterator = self._client.stream_infer(
+                inputs_iterator=self.async_request_iterator(
+                    prompts, sampling_parameters
+                ),
+                stream_timeout=self._flags.stream_timeout,
+            )
+            async for response in response_iterator:
+                yield response
+        except InferenceServerException as error:
+            print(error)
+            sys.exit(1)
+
+    async def process_stream(self, prompts, sampling_parameters):
+        # Clear results in between process_stream calls
+        self.results_dict = []
+
+        # Read response from the stream
+        async for response in self.stream_infer(prompts, sampling_parameters):
+            result, error = response
+            if error:
+                print(f"Encountered error while processing: {error}")
+            else:
+                output = result.as_numpy("text_output")
+                for i in output:
+                    self._results_dict[result.get_response().id].append(i)
+
+    async def run(self):
+        sampling_parameters = {"temperature": "0.1", "top_p": "0.95"}
+        with open(self._flags.input_prompts, "r") as file:
+            print(f"Loading inputs from `{self._flags.input_prompts}`...")
+            prompts = file.readlines()
+
+        await self.process_stream(prompts, sampling_parameters)
+
+        with open(self._flags.results_file, "w") as file:
+            for id in self._results_dict.keys():
+                for result in self._results_dict[id]:
+                    file.write(result.decode("utf-8"))
+                    file.write("\n")
+                file.write("\n=========\n\n")
+            print(f"Storing results into `{self._flags.results_file}`...")
+
+        if self._flags.verbose:
+            with open(self._flags.results_file, "r") as file:
+                print(f"\nContents of `{self._flags.results_file}` ===>")
+                print(file.read())
+
+        print("PASS: vLLM example")
+
+    def run_async(self):
+        self._loop.run_until_complete(self.run())
+
+    def create_request(
+        self,
+        prompt,
+        stream,
+        request_id,
+        sampling_parameters,
+        send_parameters_as_tensor=True,
+    ):
+        inputs = []
+        prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
+        try:
+            inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
+            inputs[-1].set_data_from_numpy(prompt_data)
+        except Exception as error:
+            print(f"Encountered an error during request creation: {error}")
+
+        stream_data = np.array([stream], dtype=bool)
+        inputs.append(grpcclient.InferInput("stream", [1], "BOOL"))
+        inputs[-1].set_data_from_numpy(stream_data)
+
+        # Request parameters are not yet supported via BLS. Provide an
+        # optional mechanism to send serialized parameters as an input
+        # tensor until support is added
+
+        if send_parameters_as_tensor:
+            sampling_parameters_data = np.array(
+                [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
+            )
+            inputs.append(grpcclient.InferInput("sampling_parameters", [1], "BYTES"))
+            inputs[-1].set_data_from_numpy(sampling_parameters_data)
+
+        # Add requested outputs
+        outputs = []
+        outputs.append(grpcclient.InferRequestedOutput("text_output"))
+
+        # Issue the asynchronous sequence inference.
+        return {
+            "model_name": self._flags.model,
+            "inputs": inputs,
+            "outputs": outputs,
+            "request_id": str(request_id),
+            "parameters": sampling_parameters,
+        }
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--model",
+        type=str,
+        required=False,
+        default="vllm_model",
+        help="Model name",
+    )
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8001",
+        help="Inference server URL and its gRPC port. Default is localhost:8001.",
+    )
+    parser.add_argument(
+        "-t",
+        "--stream-timeout",
+        type=float,
+        required=False,
+        default=None,
+        help="Stream timeout in seconds. Default is None.",
+    )
+    parser.add_argument(
+        "--offset",
+        type=int,
+        required=False,
+        default=0,
+        help="Add offset to request IDs used",
+    )
+    parser.add_argument(
+        "--input-prompts",
+        type=str,
+        required=False,
+        default="prompts.txt",
+        help="Text file with input prompts",
+    )
+    parser.add_argument(
+        "--results-file",
+        type=str,
+        required=False,
+        default="results.txt",
+        help="The file with output results",
+    )
+    parser.add_argument(
+        "--iterations",
+        type=int,
+        required=False,
+        default=1,
+        help="Number of iterations through the prompts file",
+    )
+    parser.add_argument(
+        "-s",
+        "--streaming-mode",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable streaming mode",
+    )
+    FLAGS = parser.parse_args()
+
+    client = LLMClient(FLAGS)
+    client.run_async()
diff --git a/samples/model_repository/vllm_model/1/model.json b/samples/model_repository/vllm_model/1/model.json
new file mode 100644
index 00000000..e610c3cb
--- /dev/null
+++ b/samples/model_repository/vllm_model/1/model.json
@@ -0,0 +1,5 @@
+{
+    "model":"facebook/opt-125m",
+    "disable_log_requests": "true",
+    "gpu_memory_utilization": 0.5
+}
diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
new file mode 100644
index 00000000..169f3815
--- /dev/null
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -0,0 +1,77 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Note: You do not need to change any fields in this configuration.
+
+backend: "vllm"
+
+# Disabling batching in Triton, let vLLM handle the batching on its own.
+max_batch_size: 0
+
+# We need to use decoupled transaction policy for saturating
+# vLLM engine for max throughtput.
+# TODO [DLIS:5233]: Allow asynchronous execution to lift this
+# restriction for cases there is exactly a single response to
+# a single request.
+model_transaction_policy {
+  decoupled: True
+}
+# Note: The vLLM backend uses the following input and output names.
+# Any change here needs to also be made in model.py
+input [
+  {
+    name: "text_input"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  },
+  {
+    name: "stream"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+  },
+  {
+    name: "sampling_parameters"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+    optional: true
+  }
+]
+
+output [
+  {
+    name: "text_output"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  }
+]
+
+# The usage of device is deferred to the vLLM engine
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]
diff --git a/samples/prompts.txt b/samples/prompts.txt
new file mode 100644
index 00000000..133800ec
--- /dev/null
+++ b/samples/prompts.txt
@@ -0,0 +1,4 @@
+Hello, my name is
+The most dangerous animal is
+The capital of France is
+The future of AI is