From 1688a33bf9e377f3934fa714e957935a2202142f Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Mon, 9 Oct 2023 20:34:22 -0700
Subject: [PATCH 01/49] Draft README and samples

---
 README.md                                     |  93 ++++++++
 samples/client.py                             | 202 ++++++++++++++++++
 .../model_repository/vllm_opt/1/model.json    |   5 +
 .../model_repository/vllm_opt/config.pbtxt    |  75 +++++++
 samples/prompts.txt                           |   4 +
 5 files changed, 379 insertions(+)
 create mode 100644 README.md
 create mode 100644 samples/client.py
 create mode 100644 samples/model_repository/vllm_opt/1/model.json
 create mode 100644 samples/model_repository/vllm_opt/config.pbtxt
 create mode 100644 samples/prompts.txt

diff --git a/README.md b/README.md
new file mode 100644
index 00000000..6dc41495
--- /dev/null
+++ b/README.md
@@ -0,0 +1,93 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
+
+# vLLM Backend
+
+The Triton backend for [vLLM](https://github.com/vllm-project/vllm).
+You can learn more about Triton backends in the [backend
+repo](https://github.com/triton-inference-server/backend). Ask
+questions or report problems on the [issues
+page](https://github.com/triton-inference-server/server/issues).
+This backend is designed to run vLLM's
+[supported HuggingFace models](https://vllm.readthedocs.io/en/latest/models/supported_models.html).
+
+Where can I ask general questions about Triton and Triton backends?
+Be sure to read all the information below as well as the [general
+Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server)
+available in the main [server](https://github.com/triton-inference-server/server)
+repo. If you don't find your answer there you can ask questions on the
+main Triton [issues page](https://github.com/triton-inference-server/server/issues).
+
+## Build the vLLM Backend
+
+As a Python-based backend, your Triton server just needs to have the (Python backend)[https://github.com/triton-inference-server/python_backend]
+built under `/opt/tritonserver/backends/python`. After that, you can save this in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend. 
+
+In other words, there are no build steps. You only need to copy this to your Triton backends repository. If you use the official Triton vLLM container, this is already set up for you.
+
+The backend repository should look like this:
+```
+/opt/tritonserver/backends/
+`-- vllm
+    |-- model.py
+ -- python
+    |-- libtriton_python.so
+    |-- triton_python_backend_stub
+    |-- triton_python_backend_utils.py
+```
+
+
+## Using the vLLM Backend
+
+You can see an example model_repository in the `samples` folder.
+You can use this as is and change the model by changing the `model` value in `model.json`.
+You can change the GPU utilization and logging in that file as well.
+
+In the `samples` folder, you can also find a sample client, `client.py`.
+This client is meant to function similarly to the Triton
+(vLLM example)[https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM].
+By default, this will test `prompts.txt`, which we have included in the samples folder.
+
+
+## Important Notes
+
+* At present, Triton only supports one Python-based backend per server. If you try to start multiple vLLM models, you will get an error.
+
+### Running Multiple Instances of Triton Server
+
+Python-based backends use shared memory to transfer requests to the stub process. When running multiple instances of Triton Server on the same machine that use Python-based backend models, there would be shared memory region name conflicts that can result in segmentation faults or hangs. In order to avoid this issue, you need to specify different shm-region-prefix-name using the --backend-config flag.
+```
+# Triton instance 1
+tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix1
+
+# Triton instance 2
+tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2
+```
+Note that the hangs would only occur if the /dev/shm is shared between the two instances of the server. If you run the servers in different containers that don't share this location, you don't need to specify shm-region-prefix-name.
\ No newline at end of file
diff --git a/samples/client.py b/samples/client.py
new file mode 100644
index 00000000..d53968d4
--- /dev/null
+++ b/samples/client.py
@@ -0,0 +1,202 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import asyncio
+import queue
+import sys
+from os import system
+import json
+
+import numpy as np
+import tritonclient.grpc.aio as grpcclient
+from tritonclient.utils import *
+
+
+
+def create_request(prompt, stream, request_id, sampling_parameters, model_name, send_parameters_as_tensor=True):
+    inputs = []
+    prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
+    try:
+        inputs.append(grpcclient.InferInput("PROMPT", [1], "BYTES"))
+        inputs[-1].set_data_from_numpy(prompt_data)
+    except Exception as e:
+        print(f"Encountered an error {e}")
+
+    stream_data = np.array([stream], dtype=bool)
+    inputs.append(grpcclient.InferInput("STREAM", [1], "BOOL"))
+    inputs[-1].set_data_from_numpy(stream_data)
+
+    # Request parameters are not yet supported via BLS. Provide an
+    # optional mechanism to send serialized parameters as an input
+    # tensor until support is added
+    
+    if send_parameters_as_tensor:
+        sampling_parameters_data = np.array(
+            [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
+        )
+        inputs.append(grpcclient.InferInput("SAMPLING_PARAMETERS", [1], "BYTES"))
+        inputs[-1].set_data_from_numpy(sampling_parameters_data)
+
+    # Add requested outputs
+    outputs = []
+    outputs.append(grpcclient.InferRequestedOutput("TEXT"))
+
+    # Issue the asynchronous sequence inference.
+    return {
+        "model_name": model_name,
+        "inputs": inputs,
+        "outputs": outputs,
+        "request_id": str(request_id),
+        "parameters": sampling_parameters
+    }
+
+
+async def main(FLAGS):
+    model_name = "vllm_opt"
+    sampling_parameters = {"temperature": "0.1", "top_p": "0.95"}
+    stream = FLAGS.streaming_mode
+    with open(FLAGS.input_prompts, "r") as file:
+        print(f"Loading inputs from `{FLAGS.input_prompts}`...")
+        prompts = file.readlines()
+
+    results_dict = {}
+
+    async with grpcclient.InferenceServerClient(
+        url=FLAGS.url, verbose=FLAGS.verbose
+    ) as triton_client:
+        # Request iterator that yields the next request
+        async def async_request_iterator():
+            try:
+                for iter in range(FLAGS.iterations):
+                    for i, prompt in enumerate(prompts):
+                        prompt_id = FLAGS.offset + (len(prompts) * iter) + i
+                        results_dict[str(prompt_id)] = []
+                        yield create_request(
+                            prompt, stream, prompt_id, sampling_parameters, model_name
+                        )
+            except Exception as error:
+                print(f"caught error in request iterator:  {error}")
+
+        try:
+            # Start streaming
+            response_iterator = triton_client.stream_infer(
+                inputs_iterator=async_request_iterator(),
+                stream_timeout=FLAGS.stream_timeout,
+            )
+            # Read response from the stream
+            async for response in response_iterator:
+                result, error = response
+                if error:
+                    print(f"Encountered error while processing: {error}")
+                else:
+                    output = result.as_numpy("TEXT")
+                    for i in output:
+                        results_dict[result.get_response().id].append(i)
+
+        except InferenceServerException as error:
+            print(error)
+            sys.exit(1)
+
+    with open(FLAGS.results_file, "w") as file:
+        for id in results_dict.keys():
+            for result in results_dict[id]:
+                file.write(result.decode("utf-8"))
+                file.write("\n")
+            file.write("\n=========\n\n")
+        print(f"Storing results into `{FLAGS.results_file}`...")
+
+    if FLAGS.verbose:
+        print(f"\nContents of `{FLAGS.results_file}` ===>")
+        system(f"cat {FLAGS.results_file}")
+
+    print("PASS: vLLM example")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8001",
+        help="Inference server URL and it gRPC port. Default is localhost:8001.",
+    )
+    parser.add_argument(
+        "-t",
+        "--stream-timeout",
+        type=float,
+        required=False,
+        default=None,
+        help="Stream timeout in seconds. Default is None.",
+    )
+    parser.add_argument(
+        "--offset",
+        type=int,
+        required=False,
+        default=0,
+        help="Add offset to request IDs used",
+    )
+    parser.add_argument(
+        "--input-prompts",
+        type=str,
+        required=False,
+        default="prompts.txt",
+        help="Text file with input prompts",
+    )
+    parser.add_argument(
+        "--results-file",
+        type=str,
+        required=False,
+        default="results.txt",
+        help="The file with output results",
+    )
+    parser.add_argument(
+        "--iterations",
+        type=int,
+        required=False,
+        default=1,
+        help="Number of iterations through the prompts file",
+    )
+    parser.add_argument(
+        "-s",
+        "--streaming-mode",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable streaming mode",
+    )
+    FLAGS = parser.parse_args()
+    asyncio.run(main(FLAGS))
diff --git a/samples/model_repository/vllm_opt/1/model.json b/samples/model_repository/vllm_opt/1/model.json
new file mode 100644
index 00000000..e610c3cb
--- /dev/null
+++ b/samples/model_repository/vllm_opt/1/model.json
@@ -0,0 +1,5 @@
+{
+    "model":"facebook/opt-125m",
+    "disable_log_requests": "true",
+    "gpu_memory_utilization": 0.5
+}
diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_opt/config.pbtxt
new file mode 100644
index 00000000..83b5ed70
--- /dev/null
+++ b/samples/model_repository/vllm_opt/config.pbtxt
@@ -0,0 +1,75 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "vllm_opt"
+backend: "vllm"
+
+# Disabling batching in Triton, let vLLM handle the batching on its own.
+max_batch_size: 0
+
+# We need to use decoupled transaction policy for saturating
+# vLLM engine for max throughtput.
+# TODO [DLIS:5233]: Allow asychronous execution to lift this
+# restriction for cases there is exactly a single response to
+# a single request.
+model_transaction_policy {
+  decoupled: True
+}
+
+input [
+  {
+    name: "PROMPT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  },
+  {
+    name: "STREAM"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+  },
+  {
+    name: "SAMPLING_PARAMETERS"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+    optional: true
+  }
+]
+
+output [
+  {
+    name: "TEXT"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  }
+]
+
+# The usage of device is deferred to the vLLM engine
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]
diff --git a/samples/prompts.txt b/samples/prompts.txt
new file mode 100644
index 00000000..133800ec
--- /dev/null
+++ b/samples/prompts.txt
@@ -0,0 +1,4 @@
+Hello, my name is
+The most dangerous animal is
+The capital of France is
+The future of AI is

From 0ba6200a3eafa50e2578619070e5d309f35087d2 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Mon, 9 Oct 2023 20:39:05 -0700
Subject: [PATCH 02/49] Run pre-commit

---
 README.md                                      |  2 +-
 samples/client.py                              | 16 +++++++++++-----
 samples/model_repository/vllm_opt/config.pbtxt |  2 +-
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 6dc41495..01eaa09f 100644
--- a/README.md
+++ b/README.md
@@ -48,7 +48,7 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 ## Build the vLLM Backend
 
 As a Python-based backend, your Triton server just needs to have the (Python backend)[https://github.com/triton-inference-server/python_backend]
-built under `/opt/tritonserver/backends/python`. After that, you can save this in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend. 
+built under `/opt/tritonserver/backends/python`. After that, you can save this in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend.
 
 In other words, there are no build steps. You only need to copy this to your Triton backends repository. If you use the official Triton vLLM container, this is already set up for you.
 
diff --git a/samples/client.py b/samples/client.py
index d53968d4..dc9f16d1 100644
--- a/samples/client.py
+++ b/samples/client.py
@@ -26,18 +26,24 @@
 
 import argparse
 import asyncio
+import json
 import queue
 import sys
 from os import system
-import json
 
 import numpy as np
 import tritonclient.grpc.aio as grpcclient
 from tritonclient.utils import *
 
 
-
-def create_request(prompt, stream, request_id, sampling_parameters, model_name, send_parameters_as_tensor=True):
+def create_request(
+    prompt,
+    stream,
+    request_id,
+    sampling_parameters,
+    model_name,
+    send_parameters_as_tensor=True,
+):
     inputs = []
     prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
     try:
@@ -53,7 +59,7 @@ def create_request(prompt, stream, request_id, sampling_parameters, model_name,
     # Request parameters are not yet supported via BLS. Provide an
     # optional mechanism to send serialized parameters as an input
     # tensor until support is added
-    
+
     if send_parameters_as_tensor:
         sampling_parameters_data = np.array(
             [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
@@ -71,7 +77,7 @@ def create_request(prompt, stream, request_id, sampling_parameters, model_name,
         "inputs": inputs,
         "outputs": outputs,
         "request_id": str(request_id),
-        "parameters": sampling_parameters
+        "parameters": sampling_parameters,
     }
 
 
diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_opt/config.pbtxt
index 83b5ed70..d1d21b72 100644
--- a/samples/model_repository/vllm_opt/config.pbtxt
+++ b/samples/model_repository/vllm_opt/config.pbtxt
@@ -32,7 +32,7 @@ max_batch_size: 0
 
 # We need to use decoupled transaction policy for saturating
 # vLLM engine for max throughtput.
-# TODO [DLIS:5233]: Allow asychronous execution to lift this
+# TODO [DLIS:5233]: Allow asynchronous execution to lift this
 # restriction for cases there is exactly a single response to
 # a single request.
 model_transaction_policy {

From a4921c11da75bbd47c2eae4e4b46d0aa90b36adf Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Mon, 9 Oct 2023 20:39:52 -0700
Subject: [PATCH 03/49] Remove unused queue.

---
 samples/client.py | 1 -
 1 file changed, 1 deletion(-)

diff --git a/samples/client.py b/samples/client.py
index dc9f16d1..bd93bfe2 100644
--- a/samples/client.py
+++ b/samples/client.py
@@ -27,7 +27,6 @@
 import argparse
 import asyncio
 import json
-import queue
 import sys
 from os import system
 

From 92124bff3081b7320c30353bc8ebd23f291fb57d Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Mon, 9 Oct 2023 20:44:22 -0700
Subject: [PATCH 04/49] Fixes for README

---
 README.md | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 01eaa09f..186de7c6 100644
--- a/README.md
+++ b/README.md
@@ -35,8 +35,10 @@ You can learn more about Triton backends in the [backend
 repo](https://github.com/triton-inference-server/backend). Ask
 questions or report problems on the [issues
 page](https://github.com/triton-inference-server/server/issues).
-This backend is designed to run vLLM's
-[supported HuggingFace models](https://vllm.readthedocs.io/en/latest/models/supported_models.html).
+This backend is designed to run [vLLM](https://github.com/vllm-project/vllm)
+with
+[one of the HuggingFace models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+it supports.
 
 Where can I ask general questions about Triton and Triton backends?
 Be sure to read all the information below as well as the [general
@@ -47,8 +49,8 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 
 ## Build the vLLM Backend
 
-As a Python-based backend, your Triton server just needs to have the (Python backend)[https://github.com/triton-inference-server/python_backend]
-built under `/opt/tritonserver/backends/python`. After that, you can save this in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend.
+As a Python-based backend, your Triton server just needs to have the [Python backend](https://github.com/triton-inference-server/python_backend)
+located in the backends directory: `/opt/tritonserver/backends/python`. After that, you can save the vLLM backend in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend.
 
 In other words, there are no build steps. You only need to copy this to your Triton backends repository. If you use the official Triton vLLM container, this is already set up for you.
 
@@ -68,11 +70,11 @@ The backend repository should look like this:
 
 You can see an example model_repository in the `samples` folder.
 You can use this as is and change the model by changing the `model` value in `model.json`.
-You can change the GPU utilization and logging in that file as well.
+You can change the GPU utilization and logging parameters in that file as well.
 
 In the `samples` folder, you can also find a sample client, `client.py`.
 This client is meant to function similarly to the Triton
-(vLLM example)[https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM].
+[vLLM example](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM).
 By default, this will test `prompts.txt`, which we have included in the samples folder.
 
 
@@ -90,4 +92,4 @@ tritonserver --model-repository=/models --backend-config=python,shm-region-prefi
 # Triton instance 2
 tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2
 ```
-Note that the hangs would only occur if the /dev/shm is shared between the two instances of the server. If you run the servers in different containers that don't share this location, you don't need to specify shm-region-prefix-name.
\ No newline at end of file
+Note that the hangs would only occur if the /dev/shm is shared between the two instances of the server. If you run the servers in different containers that do not share this location, you do not need to specify shm-region-prefix-name.
\ No newline at end of file

From aa8a1051a31de57f7677c1d7583b4b10aa047424 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Mon, 9 Oct 2023 20:48:11 -0700
Subject: [PATCH 05/49] Add client.py shebang

---
 samples/client.py | 2 ++
 1 file changed, 2 insertions(+)
 mode change 100644 => 100755 samples/client.py

diff --git a/samples/client.py b/samples/client.py
old mode 100644
new mode 100755
index bd93bfe2..394f4248
--- a/samples/client.py
+++ b/samples/client.py
@@ -1,3 +1,5 @@
+#!/usr/bin/env python3
+
 # Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without

From ed108d0ae27f7f28b48883a3a5cc51e666e73a4c Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 10 Oct 2023 07:28:31 -0700
Subject: [PATCH 06/49] Add Conda instructions.

---
 README.md                      |  11 +++-
 samples/conda/README.md        |  72 ++++++++++++++++++++++
 samples/conda/gen_vllm_env.ssh | 105 +++++++++++++++++++++++++++++++++
 3 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 samples/conda/README.md
 create mode 100755 samples/conda/gen_vllm_env.ssh

diff --git a/README.md b/README.md
index 186de7c6..e4b7f1d5 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,6 @@ The backend repository should look like this:
     |-- triton_python_backend_utils.py
 ```
 
-
 ## Using the vLLM Backend
 
 You can see an example model_repository in the `samples` folder.
@@ -78,6 +77,16 @@ This client is meant to function similarly to the Triton
 By default, this will test `prompts.txt`, which we have included in the samples folder.
 
 
+## Running the Latest vLLM Version
+
+By default, the vLLM backend uses the version of vLLM that is available via Pip.
+These are compatible with the newer versions of CUDA running in Triton.
+If you would like to use a specific vLLM commit or the latest version of vLLM, you
+will need to use a
+[custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).
+Please see the
+[conda](samples/conda) subdirectory of the `samples` folder for information on how to do so.
+
 ## Important Notes
 
 * At present, Triton only supports one Python-based backend per server. If you try to start multiple vLLM models, you will get an error.
diff --git a/samples/conda/README.md b/samples/conda/README.md
new file mode 100644
index 00000000..8bd8ee5e
--- /dev/null
+++ b/samples/conda/README.md
@@ -0,0 +1,72 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+If you would like to run conda with the latest version of vLLM, you will need to create a
+a [custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).
+This is because vLLM currently does not support the latest versions of CUDA in the Triton environment.
+Instructions for creating a custom execution environment with the latest vLLM version are below.
+
+## Step 1: Build a Custom Execution Environment With vLLM and Other Dependencies
+
+The provided script should build the package environment
+for you which will be used to load the model in Triton.
+
+Run the following command from this directory. You can use any version of Triton.
+```
+docker run --gpus all -it --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 --shm-size=8G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:23.09-py3 bash
+./gen_vllm_env.sh
+```
+
+This step might take a while to build the environment packages. Once complete, the current folder will be populated with
+`triton_python_backend_stub` and `vllm_env`.
+
+## Step 2: Update Your Model Repository
+
+You want to place the stub and environment in your model directory.
+The model directory should look something like this:
+```
+model_repository/
+`-- vllm_model
+    |-- 1
+    |   `-- model.json
+    |-- config.pbtxt
+    |-- triton_python_backend_stub
+    `-- vllm_env
+```
+
+You also want to add this section to the config.pbtxt of your model:
+```
+parameters: {
+  key: "EXECUTION_ENV_PATH",
+  value: {string_value: "$$TRITON_MODEL_DIRECTORY/vllm_env"}
+}
+```
+
+## Step 3: Run Your Model
+
+You can now start Triton server with your model!
\ No newline at end of file
diff --git a/samples/conda/gen_vllm_env.ssh b/samples/conda/gen_vllm_env.ssh
new file mode 100755
index 00000000..afa99e87
--- /dev/null
+++ b/samples/conda/gen_vllm_env.ssh
@@ -0,0 +1,105 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#
+# This script creates a conda environment for Triton with vllm
+# dependencies.
+#
+
+# Pick the release tag from the container environment variable
+RELEASE_TAG="r${NVIDIA_TRITON_SERVER_VERSION}"
+
+# Save target directories for conda environment and Python backend stubs
+ENV_DIR="./model_repository/vllm/vllm_env/"
+STUB_FILE="./model_repository/vllm/triton_python_backend_stub"
+
+# If targets already exist, print a message and exit.
+if [ -d "$ENV_DIR" ] && [ -f "$STUB_FILE" ]; then
+    echo "The conda environment directory and Python backend stubs already exist."
+    echo "Exiting environment set-up."
+    exit 0
+fi
+
+# If this script runs, clean up previous targets.
+rm -rf $ENV_DIR $STUB_FILE
+
+# Install and setup conda environment
+FILE_NAME="Miniconda3-latest-Linux-x86_64.sh"
+rm -rf ./miniconda $FILE_NAME
+wget https://repo.anaconda.com/miniconda/$FILE_NAME
+
+# Install miniconda in silent mode
+bash $FILE_NAME -p ./miniconda -b
+
+# Activate conda
+eval "$(./miniconda/bin/conda shell.bash hook)"
+
+# Installing cmake and dependencies
+apt update && apt install software-properties-common rapidjson-dev libarchive-dev zlib1g-dev -y
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+apt install -y gpg wget && \
+    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
+        gpg --dearmor - |  \
+        tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null && \
+    . /etc/os-release && \
+    echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | \
+    tee /etc/apt/sources.list.d/kitware.list >/dev/null && \
+    apt-get update && \
+    apt-get install -y --no-install-recommends cmake cmake-data
+
+conda create -n vllm_env python=3.10 -y
+conda activate vllm_env
+export PYTHONNOUSERSITE=True
+conda install -c conda-forge libstdcxx-ng=12 -y
+conda install -c conda-forge conda-pack -y
+
+# vLLM needs cuda 11.8 to run properly
+conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
+
+pip install numpy
+pip install git+https://github.com/huggingface/transformers.git
+pip install git+https://github.com/vllm-project/vllm.git
+
+
+rm -rf python_backend
+git clone https://github.com/triton-inference-server/python_backend -b $RELEASE_TAG
+(cd python_backend/ && mkdir builddir && cd builddir && \
+cmake -DTRITON_ENABLE_GPU=ON -DTRITON_BACKEND_REPO_TAG=$RELEASE_TAG -DTRITON_COMMON_REPO_TAG=$RELEASE_TAG -DTRITON_CORE_REPO_TAG=$RELEASE_TAG ../ && \
+make -j18 triton-python-backend-stub)
+
+mv python_backend/builddir/triton_python_backend_stub ./model_repository/vllm/
+
+# Prepare and copy the conda environment
+cp -r $CONDA_PREFIX/lib/python3.10/site-packages/conda_pack/scripts/posix/activate $CONDA_PREFIX/bin/
+rm -r $CONDA_PREFIX/nsight*
+cp -r $CONDA_PREFIX ./model_repository/vllm/
+
+conda deactivate
+
+# Clean-up
+rm -rf ./miniconda $FILE_NAME
+rm -rf python_backend

From c5213f6d914272eabff1f8c6bea0fc498e30d985 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 10 Oct 2023 07:31:55 -0700
Subject: [PATCH 07/49] Spacing, title

---
 README.md               | 1 -
 samples/conda/README.md | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e4b7f1d5..7e80f783 100644
--- a/README.md
+++ b/README.md
@@ -76,7 +76,6 @@ This client is meant to function similarly to the Triton
 [vLLM example](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM).
 By default, this will test `prompts.txt`, which we have included in the samples folder.
 
-
 ## Running the Latest vLLM Version
 
 By default, the vLLM backend uses the version of vLLM that is available via Pip.
diff --git a/samples/conda/README.md b/samples/conda/README.md
index 8bd8ee5e..6bbdaf55 100644
--- a/samples/conda/README.md
+++ b/samples/conda/README.md
@@ -26,6 +26,8 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
+# How to Create a Custom Conda Environment
+
 If you would like to run conda with the latest version of vLLM, you will need to create a
 a [custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).
 This is because vLLM currently does not support the latest versions of CUDA in the Triton environment.

From 2c6881c2f328df6e898ffcecae232d081eb5cb87 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 10 Oct 2023 13:49:49 -0700
Subject: [PATCH 08/49] Switch i/o to lowercase

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 samples/model_repository/vllm_opt/config.pbtxt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_opt/config.pbtxt
index d1d21b72..5c368c0d 100644
--- a/samples/model_repository/vllm_opt/config.pbtxt
+++ b/samples/model_repository/vllm_opt/config.pbtxt
@@ -60,7 +60,7 @@ input [
 
 output [
   {
-    name: "TEXT"
+    name: "text_output"
     data_type: TYPE_STRING
     dims: [ -1 ]
   }

From ac3340723d196b090706b07f64dc2c64d0b9d6db Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 10 Oct 2023 13:49:56 -0700
Subject: [PATCH 09/49] Switch i/o to lowercase

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 samples/model_repository/vllm_opt/config.pbtxt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_opt/config.pbtxt
index 5c368c0d..646c3bb4 100644
--- a/samples/model_repository/vllm_opt/config.pbtxt
+++ b/samples/model_repository/vllm_opt/config.pbtxt
@@ -46,7 +46,7 @@ input [
     dims: [ 1 ]
   },
   {
-    name: "STREAM"
+    name: "stream"
     data_type: TYPE_BOOL
     dims: [ 1 ]
   },

From d2fdb3f5d935587b93ca532baac59cde9b1c6742 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 10 Oct 2023 13:50:03 -0700
Subject: [PATCH 10/49] Switch i/o to lowercase

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 samples/model_repository/vllm_opt/config.pbtxt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_opt/config.pbtxt
index 646c3bb4..b516b43a 100644
--- a/samples/model_repository/vllm_opt/config.pbtxt
+++ b/samples/model_repository/vllm_opt/config.pbtxt
@@ -41,7 +41,7 @@ model_transaction_policy {
 
 input [
   {
-    name: "PROMPT"
+    name: "text_input"
     data_type: TYPE_STRING
     dims: [ 1 ]
   },

From 02c116772ca94027d5fa29954e4c6cfee7dae3ef Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 10 Oct 2023 13:50:12 -0700
Subject: [PATCH 11/49] Switch i/o to lowercase

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 samples/model_repository/vllm_opt/config.pbtxt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_opt/config.pbtxt
index b516b43a..83e8091b 100644
--- a/samples/model_repository/vllm_opt/config.pbtxt
+++ b/samples/model_repository/vllm_opt/config.pbtxt
@@ -51,7 +51,7 @@ input [
     dims: [ 1 ]
   },
   {
-    name: "SAMPLING_PARAMETERS"
+    name: "sampling_parameters"
     data_type: TYPE_STRING
     dims: [ 1 ]
     optional: true

From d164dab138fad68d37a32b05d1970f48cb3dc480 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 10 Oct 2023 13:53:50 -0700
Subject: [PATCH 12/49] Change client code to use lowercase inputs/outputs

---
 samples/client.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/samples/client.py b/samples/client.py
index 394f4248..d29ad1de 100755
--- a/samples/client.py
+++ b/samples/client.py
@@ -48,13 +48,13 @@ def create_request(
     inputs = []
     prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
     try:
-        inputs.append(grpcclient.InferInput("PROMPT", [1], "BYTES"))
+        inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
         inputs[-1].set_data_from_numpy(prompt_data)
     except Exception as e:
         print(f"Encountered an error {e}")
 
     stream_data = np.array([stream], dtype=bool)
-    inputs.append(grpcclient.InferInput("STREAM", [1], "BOOL"))
+    inputs.append(grpcclient.InferInput("stream", [1], "BOOL"))
     inputs[-1].set_data_from_numpy(stream_data)
 
     # Request parameters are not yet supported via BLS. Provide an
@@ -65,12 +65,12 @@ def create_request(
         sampling_parameters_data = np.array(
             [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
         )
-        inputs.append(grpcclient.InferInput("SAMPLING_PARAMETERS", [1], "BYTES"))
+        inputs.append(grpcclient.InferInput("sampling_parameters", [1], "BYTES"))
         inputs[-1].set_data_from_numpy(sampling_parameters_data)
 
     # Add requested outputs
     outputs = []
-    outputs.append(grpcclient.InferRequestedOutput("TEXT"))
+    outputs.append(grpcclient.InferRequestedOutput("text_output"))
 
     # Issue the asynchronous sequence inference.
     return {
@@ -120,7 +120,7 @@ async def async_request_iterator():
                 if error:
                     print(f"Encountered error while processing: {error}")
                 else:
-                    output = result.as_numpy("TEXT")
+                    output = result.as_numpy("text_output")
                     for i in output:
                         results_dict[result.get_response().id].append(i)
 

From 45a531fc0efab13fc28e560ec19bd50e02c62bdd Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 10 Oct 2023 19:32:14 -0700
Subject: [PATCH 13/49] Update client to use iterable client class

---
 samples/client.py | 218 ++++++++++++++++++++++++++--------------------
 1 file changed, 124 insertions(+), 94 deletions(-)

diff --git a/samples/client.py b/samples/client.py
index d29ad1de..93786a07 100755
--- a/samples/client.py
+++ b/samples/client.py
@@ -29,6 +29,7 @@
 import argparse
 import asyncio
 import json
+import queue
 import sys
 from os import system
 
@@ -37,114 +38,141 @@
 from tritonclient.utils import *
 
 
-def create_request(
-    prompt,
-    stream,
-    request_id,
-    sampling_parameters,
-    model_name,
-    send_parameters_as_tensor=True,
-):
-    inputs = []
-    prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
-    try:
-        inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
-        inputs[-1].set_data_from_numpy(prompt_data)
-    except Exception as e:
-        print(f"Encountered an error {e}")
-
-    stream_data = np.array([stream], dtype=bool)
-    inputs.append(grpcclient.InferInput("stream", [1], "BOOL"))
-    inputs[-1].set_data_from_numpy(stream_data)
-
-    # Request parameters are not yet supported via BLS. Provide an
-    # optional mechanism to send serialized parameters as an input
-    # tensor until support is added
-
-    if send_parameters_as_tensor:
-        sampling_parameters_data = np.array(
-            [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
+class LLMClient:
+    def __init__(self, flags: argparse.Namespace):
+        self._client = grpcclient.InferenceServerClient(
+            url=flags.url, verbose=flags.verbose
         )
-        inputs.append(grpcclient.InferInput("sampling_parameters", [1], "BYTES"))
-        inputs[-1].set_data_from_numpy(sampling_parameters_data)
-
-    # Add requested outputs
-    outputs = []
-    outputs.append(grpcclient.InferRequestedOutput("text_output"))
-
-    # Issue the asynchronous sequence inference.
-    return {
-        "model_name": model_name,
-        "inputs": inputs,
-        "outputs": outputs,
-        "request_id": str(request_id),
-        "parameters": sampling_parameters,
-    }
-
-
-async def main(FLAGS):
-    model_name = "vllm_opt"
-    sampling_parameters = {"temperature": "0.1", "top_p": "0.95"}
-    stream = FLAGS.streaming_mode
-    with open(FLAGS.input_prompts, "r") as file:
-        print(f"Loading inputs from `{FLAGS.input_prompts}`...")
-        prompts = file.readlines()
-
-    results_dict = {}
-
-    async with grpcclient.InferenceServerClient(
-        url=FLAGS.url, verbose=FLAGS.verbose
-    ) as triton_client:
-        # Request iterator that yields the next request
-        async def async_request_iterator():
-            try:
-                for iter in range(FLAGS.iterations):
-                    for i, prompt in enumerate(prompts):
-                        prompt_id = FLAGS.offset + (len(prompts) * iter) + i
-                        results_dict[str(prompt_id)] = []
-                        yield create_request(
-                            prompt, stream, prompt_id, sampling_parameters, model_name
-                        )
-            except Exception as error:
-                print(f"caught error in request iterator:  {error}")
+        self._flags = flags
+        self._loop = asyncio.get_event_loop()
+        self._results_dict = {}
 
+    async def async_request_iterator(self, prompts, sampling_parameters):
+        try:
+            for iter in range(self._flags.iterations):
+                for i, prompt in enumerate(prompts):
+                    prompt_id = self._flags.offset + (len(prompts) * iter) + i
+                    self._results_dict[str(prompt_id)] = []
+                    yield self.create_request(
+                        prompt,
+                        self._flags.streaming_mode,
+                        prompt_id,
+                        sampling_parameters,
+                    )
+        except Exception as error:
+            print(f"Caught an error in the request iterator: {error}")
+
+    async def stream_infer(self, prompts, sampling_parameters):
         try:
             # Start streaming
-            response_iterator = triton_client.stream_infer(
-                inputs_iterator=async_request_iterator(),
-                stream_timeout=FLAGS.stream_timeout,
+            response_iterator = self._client.stream_infer(
+                inputs_iterator=self.async_request_iterator(
+                    prompts, sampling_parameters
+                ),
+                stream_timeout=self._flags.stream_timeout,
             )
-            # Read response from the stream
             async for response in response_iterator:
-                result, error = response
-                if error:
-                    print(f"Encountered error while processing: {error}")
-                else:
-                    output = result.as_numpy("text_output")
-                    for i in output:
-                        results_dict[result.get_response().id].append(i)
-
+                yield response
         except InferenceServerException as error:
             print(error)
             sys.exit(1)
 
-    with open(FLAGS.results_file, "w") as file:
-        for id in results_dict.keys():
-            for result in results_dict[id]:
-                file.write(result.decode("utf-8"))
-                file.write("\n")
-            file.write("\n=========\n\n")
-        print(f"Storing results into `{FLAGS.results_file}`...")
+    async def process_stream(self, prompts, sampling_parameters):
+        # Clear results in between process_stream calls
+        self.results_dict = []
+
+        # Read response from the stream
+        async for response in self.stream_infer(prompts, sampling_parameters):
+            result, error = response
+            if error:
+                print(f"Encountered error while processing: {error}")
+            else:
+                output = result.as_numpy("TEXT")
+                for i in output:
+                    self._results_dict[result.get_response().id].append(i)
+
+    async def run(self):
+        sampling_parameters = {"temperature": "0.1", "top_p": "0.95"}
+        stream = self._flags.streaming_mode
+        with open(self._flags.input_prompts, "r") as file:
+            print(f"Loading inputs from `{self._flags.input_prompts}`...")
+            prompts = file.readlines()
+
+        await self.process_stream(prompts, sampling_parameters)
+
+        with open(self._flags.results_file, "w") as file:
+            for id in self._results_dict.keys():
+                for result in self._results_dict[id]:
+                    file.write(result.decode("utf-8"))
+                    file.write("\n")
+                file.write("\n=========\n\n")
+            print(f"Storing results into `{self._flags.results_file}`...")
+
+        if self._flags.verbose:
+            with open(self._flags.results_file, "r") as file:
+                print(f"\nContents of `{self._flags.results_file}` ===>")
+                print(file.read())
+
+        print("PASS: vLLM example")
+
+    def run_async(self):
+        self._loop.run_until_complete(self.run())
+
+    def create_request(
+        self,
+        prompt,
+        stream,
+        request_id,
+        sampling_parameters,
+        send_parameters_as_tensor=True,
+    ):
+        inputs = []
+        prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
+        try:
+            inputs.append(grpcclient.InferInput("PROMPT", [1], "BYTES"))
+            inputs[-1].set_data_from_numpy(prompt_data)
+        except Exception as error:
+            print(f"Encountered an error during request creation: {error}")
+
+        stream_data = np.array([stream], dtype=bool)
+        inputs.append(grpcclient.InferInput("STREAM", [1], "BOOL"))
+        inputs[-1].set_data_from_numpy(stream_data)
+
+        # Request parameters are not yet supported via BLS. Provide an
+        # optional mechanism to send serialized parameters as an input
+        # tensor until support is added
+
+        if send_parameters_as_tensor:
+            sampling_parameters_data = np.array(
+                [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
+            )
+            inputs.append(grpcclient.InferInput("SAMPLING_PARAMETERS", [1], "BYTES"))
+            inputs[-1].set_data_from_numpy(sampling_parameters_data)
 
-    if FLAGS.verbose:
-        print(f"\nContents of `{FLAGS.results_file}` ===>")
-        system(f"cat {FLAGS.results_file}")
+        # Add requested outputs
+        outputs = []
+        outputs.append(grpcclient.InferRequestedOutput("TEXT"))
 
-    print("PASS: vLLM example")
+        # Issue the asynchronous sequence inference.
+        return {
+            "model_name": self._flags.model,
+            "inputs": inputs,
+            "outputs": outputs,
+            "request_id": str(request_id),
+            "parameters": sampling_parameters,
+        }
 
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--model",
+        type=str,
+        required=False,
+        default="vllm",
+        help="Model name",
+    )
     parser.add_argument(
         "-v",
         "--verbose",
@@ -159,7 +187,7 @@ async def async_request_iterator():
         type=str,
         required=False,
         default="localhost:8001",
-        help="Inference server URL and it gRPC port. Default is localhost:8001.",
+        help="Inference server URL and its gRPC port. Default is localhost:8001.",
     )
     parser.add_argument(
         "-t",
@@ -206,4 +234,6 @@ async def async_request_iterator():
         help="Enable streaming mode",
     )
     FLAGS = parser.parse_args()
-    asyncio.run(main(FLAGS))
+
+    client = LLMClient(FLAGS)
+    client.run_async()

From 1e27105fc0cd95fe1a0577e11a1d9e8cf66218f8 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 10 Oct 2023 19:38:53 -0700
Subject: [PATCH 14/49] Rename vLLM model, add note to config

---
 .../model_repository/{vllm_opt => vllm_model}/1/model.json | 0
 .../model_repository/{vllm_opt => vllm_model}/config.pbtxt | 7 ++++++-
 2 files changed, 6 insertions(+), 1 deletion(-)
 rename samples/model_repository/{vllm_opt => vllm_model}/1/model.json (100%)
 rename samples/model_repository/{vllm_opt => vllm_model}/config.pbtxt (90%)

diff --git a/samples/model_repository/vllm_opt/1/model.json b/samples/model_repository/vllm_model/1/model.json
similarity index 100%
rename from samples/model_repository/vllm_opt/1/model.json
rename to samples/model_repository/vllm_model/1/model.json
diff --git a/samples/model_repository/vllm_opt/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
similarity index 90%
rename from samples/model_repository/vllm_opt/config.pbtxt
rename to samples/model_repository/vllm_model/config.pbtxt
index 83e8091b..b34e45ef 100644
--- a/samples/model_repository/vllm_opt/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -24,7 +24,12 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-name: "vllm_opt"
+# Note: You do not need to change any fields in this configuration.
+# If you are using a custom execution environment, there are
+# instructions in the samples/conda README on how to add a parameter
+# to use a custom execution environment.
+
+name: "vllm_model"
 backend: "vllm"
 
 # Disabling batching in Triton, let vLLM handle the batching on its own.

From 97417c5873a6d22a4670e5c66d13b09e30ca7a5b Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 10 Oct 2023 20:12:21 -0700
Subject: [PATCH 15/49] Remove unused imports and vars

---
 samples/client.py | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/samples/client.py b/samples/client.py
index 93786a07..83f1e49c 100755
--- a/samples/client.py
+++ b/samples/client.py
@@ -29,9 +29,7 @@
 import argparse
 import asyncio
 import json
-import queue
 import sys
-from os import system
 
 import numpy as np
 import tritonclient.grpc.aio as grpcclient
@@ -93,7 +91,6 @@ async def process_stream(self, prompts, sampling_parameters):
 
     async def run(self):
         sampling_parameters = {"temperature": "0.1", "top_p": "0.95"}
-        stream = self._flags.streaming_mode
         with open(self._flags.input_prompts, "r") as file:
             print(f"Loading inputs from `{self._flags.input_prompts}`...")
             prompts = file.readlines()

From d943de28fd0d3cea6bfe31c89b6696b0f834a091 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 05:11:23 -0700
Subject: [PATCH 16/49] Clarify whaat Conda parameter is doing.

---
 samples/conda/README.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/samples/conda/README.md b/samples/conda/README.md
index 6bbdaf55..fb0cf759 100644
--- a/samples/conda/README.md
+++ b/samples/conda/README.md
@@ -61,7 +61,10 @@ model_repository/
     `-- vllm_env
 ```
 
-You also want to add this section to the config.pbtxt of your model:
+You also want to add this section to the config.pbtxt of your model.
+This will direct Triton to look for a custom execution environment in
+the vllm_env subdirectory of your model's directory.
+
 ```
 parameters: {
   key: "EXECUTION_ENV_PATH",

From 99943cc14d1b832180c324787c01b067a1c2574c Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 11 Oct 2023 05:30:21 -0700
Subject: [PATCH 17/49] Add clarifying note to model config

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 samples/model_repository/vllm_model/config.pbtxt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
index b34e45ef..8ce25120 100644
--- a/samples/model_repository/vllm_model/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -43,7 +43,8 @@ max_batch_size: 0
 model_transaction_policy {
   decoupled: True
 }
-
+# Note: The vLLM backend uses the following input and output names.  
+# Any change here needs to also be made in model.py
 input [
   {
     name: "text_input"

From b08f426fa46920ca418000317700999a1777ee38 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 07:19:54 -0700
Subject: [PATCH 18/49] Run pre-commit

---
 samples/model_repository/vllm_model/config.pbtxt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
index 8ce25120..f862dfba 100644
--- a/samples/model_repository/vllm_model/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -43,7 +43,7 @@ max_batch_size: 0
 model_transaction_policy {
   decoupled: True
 }
-# Note: The vLLM backend uses the following input and output names.  
+# Note: The vLLM backend uses the following input and output names.
 # Any change here needs to also be made in model.py
 input [
   {

From 682ad0c46b296cac22badca8a13caa995e53e9b3 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 12:02:17 -0700
Subject: [PATCH 19/49] Remove limitation, model name

---
 README.md                                        | 6 +-----
 samples/model_repository/vllm_model/config.pbtxt | 1 -
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 7e80f783..ce40871a 100644
--- a/README.md
+++ b/README.md
@@ -86,11 +86,7 @@ will need to use a
 Please see the
 [conda](samples/conda) subdirectory of the `samples` folder for information on how to do so.
 
-## Important Notes
-
-* At present, Triton only supports one Python-based backend per server. If you try to start multiple vLLM models, you will get an error.
-
-### Running Multiple Instances of Triton Server
+## Running Multiple Instances of Triton Server
 
 Python-based backends use shared memory to transfer requests to the stub process. When running multiple instances of Triton Server on the same machine that use Python-based backend models, there would be shared memory region name conflicts that can result in segmentation faults or hangs. In order to avoid this issue, you need to specify different shm-region-prefix-name using the --backend-config flag.
 ```
diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
index f862dfba..c8d47343 100644
--- a/samples/model_repository/vllm_model/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -29,7 +29,6 @@
 # instructions in the samples/conda README on how to add a parameter
 # to use a custom execution environment.
 
-name: "vllm_model"
 backend: "vllm"
 
 # Disabling batching in Triton, let vLLM handle the batching on its own.

From e7578f1a69a50cc0c1a3b4853c751e09c166a48a Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 12:04:11 -0700
Subject: [PATCH 20/49] Fix gen vllm env script name

---
 samples/conda/{gen_vllm_env.ssh => gen_vllm_env.sh} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename samples/conda/{gen_vllm_env.ssh => gen_vllm_env.sh} (100%)

diff --git a/samples/conda/gen_vllm_env.ssh b/samples/conda/gen_vllm_env.sh
similarity index 100%
rename from samples/conda/gen_vllm_env.ssh
rename to samples/conda/gen_vllm_env.sh

From 502f4dbbd4f8f63995de2598129410abd4be2e84 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 11 Oct 2023 12:36:18 -0700
Subject: [PATCH 21/49] Update wording for supported models

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index ce40871a..1c12d4c9 100644
--- a/README.md
+++ b/README.md
@@ -37,7 +37,7 @@ questions or report problems on the [issues
 page](https://github.com/triton-inference-server/server/issues).
 This backend is designed to run [vLLM](https://github.com/vllm-project/vllm)
 with
-[one of the HuggingFace models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+[vllm supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
 it supports.
 
 Where can I ask general questions about Triton and Triton backends?

From fe064168f906ef56e417d9b2f4d325dd986d18a4 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 12:37:13 -0700
Subject: [PATCH 22/49] Update capitalization

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 1c12d4c9..3b520d0c 100644
--- a/README.md
+++ b/README.md
@@ -37,7 +37,7 @@ questions or report problems on the [issues
 page](https://github.com/triton-inference-server/server/issues).
 This backend is designed to run [vLLM](https://github.com/vllm-project/vllm)
 with
-[vllm supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+[vLLM supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
 it supports.
 
 Where can I ask general questions about Triton and Triton backends?

From 0144d337c625fbef65e8383c7c29b1f0078bfe5e Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 11 Oct 2023 14:41:36 -0700
Subject: [PATCH 23/49] Update wording around shared memory across servers

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 3b520d0c..f24ff60a 100644
--- a/README.md
+++ b/README.md
@@ -88,7 +88,9 @@ Please see the
 
 ## Running Multiple Instances of Triton Server
 
-Python-based backends use shared memory to transfer requests to the stub process. When running multiple instances of Triton Server on the same machine that use Python-based backend models, there would be shared memory region name conflicts that can result in segmentation faults or hangs. In order to avoid this issue, you need to specify different shm-region-prefix-name using the --backend-config flag.
+Python-based backends use shared memory to transfer requests to the stub process. When running multiple instances of Triton Server on the same machine you need to specify different shm-region-prefix-name using the --backend-config flag.
+
+> **Note** There are known runtime issues If you do not launch with different region-prefix-names which can lead to segmentation faults and hangs. 
 ```
 # Triton instance 1
 tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix1

From 0f0f96899513328c63a3028c30f84a0076374ae5 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 11 Oct 2023 14:41:56 -0700
Subject: [PATCH 24/49] Remove extra note about shared memory hangs across
 servers

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/README.md b/README.md
index f24ff60a..5cb6ab16 100644
--- a/README.md
+++ b/README.md
@@ -97,5 +97,4 @@ tritonserver --model-repository=/models --backend-config=python,shm-region-prefi
 
 # Triton instance 2
 tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2
-```
-Note that the hangs would only occur if the /dev/shm is shared between the two instances of the server. If you run the servers in different containers that do not share this location, you do not need to specify shm-region-prefix-name.
\ No newline at end of file
+```
\ No newline at end of file

From b81574d85ceb65134eccdfe7bdbd3ab89f27b966 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 14:43:36 -0700
Subject: [PATCH 25/49] Fix line lengths and clarify wording.

---
 README.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 5cb6ab16..5395b001 100644
--- a/README.md
+++ b/README.md
@@ -88,9 +88,13 @@ Please see the
 
 ## Running Multiple Instances of Triton Server
 
-Python-based backends use shared memory to transfer requests to the stub process. When running multiple instances of Triton Server on the same machine you need to specify different shm-region-prefix-name using the --backend-config flag.
+Python-based backends use shared memory to transfer requests to the stub process.
+When running multiple instances of Triton Server on the same machine,
+you need to specify different shm-region-prefix-name using the --backend-config flag.
+
+> **Note** There are known runtime issues if you do not launch with different region-prefix-names.
+This can lead to to segmentation faults and hangs.
 
-> **Note** There are known runtime issues If you do not launch with different region-prefix-names which can lead to segmentation faults and hangs. 
 ```
 # Triton instance 1
 tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix1

From faa29a6ba7a8e94685d1c538109ac252d9a9ca79 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 17:19:04 -0700
Subject: [PATCH 26/49] Add container steps

---
 README.md | 62 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 18 deletions(-)

diff --git a/README.md b/README.md
index 5395b001..c1bb127e 100644
--- a/README.md
+++ b/README.md
@@ -47,38 +47,57 @@ available in the main [server](https://github.com/triton-inference-server/server
 repo. If you don't find your answer there you can ask questions on the
 main Triton [issues page](https://github.com/triton-inference-server/server/issues).
 
-## Build the vLLM Backend
+## Building the vLLM Backend
 
-As a Python-based backend, your Triton server just needs to have the [Python backend](https://github.com/triton-inference-server/python_backend)
-located in the backends directory: `/opt/tritonserver/backends/python`. After that, you can save the vLLM backend in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend.
+There are several ways to use the vLLM backend.
 
-In other words, there are no build steps. You only need to copy this to your Triton backends repository. If you use the official Triton vLLM container, this is already set up for you.
+### Option 1. Run the Docker Container.
+
+Starting in release 23.10, Triton includes a container with just the vLLM backend. This container has everything you need to run your vLLM model.
+
+### Option 2. Build via the Build.py Script.
+You can follow steps described in the
+[Building With Docker] (https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
+guide and use the
+(build.py)[https://github.com/triton-inference-server/server/blob/main/build.py]
+script.
+
+A sample command to build a Triton Server container with all available options enabled is below.
 
-The backend repository should look like this:
 ```
-/opt/tritonserver/backends/
-`-- vllm
-    |-- model.py
- -- python
-    |-- libtriton_python.so
-    |-- triton_python_backend_stub
-    |-- triton_python_backend_utils.py
+./build.py -v --image=base,${BASE_CONTAINER_IMAGE_NAME}
+                --enable-logging --enable-stats --enable-tracing
+                --enable-metrics --enable-gpu-metrics --enable-cpu-metrics
+                --enable-gpu
+                --filesystem=gcs --filesystem=s3 --filesystem=azure_storage
+                --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai
+                --backend=python:r23.10
+                --backend=vllm:r23.10
 ```
 
+### Option 3. Add the vLLM Backend to the Triton Container
+
+You can install the vLLM backend directly into our NGC Triton container. In this case, please install vLLM first. You can do this by running `pip install vllm==<vLLM_version>`, then set up the vLLM backend in the container as follows:
+
+mkdir -p /opt/tritonserver/backends/vllm
+wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py
+
 ## Using the vLLM Backend
 
 You can see an example model_repository in the `samples` folder.
 You can use this as is and change the model by changing the `model` value in `model.json`.
 You can change the GPU utilization and logging parameters in that file as well.
 
-In the `samples` folder, you can also find a sample client, `client.py`.
-This client is meant to function similarly to the Triton
-[vLLM example](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM).
-By default, this will test `prompts.txt`, which we have included in the samples folder.
+In the `[samples](samples)` folder, you can also find a sample client,
+`[client.py](samples/client.py)`.
 
 ## Running the Latest vLLM Version
 
-By default, the vLLM backend uses the version of vLLM that is available via Pip.
+To see the version of vLLM in the container, see the
+[version_map](https://github.com/triton-inference-server/server/blob/85487a1e15438ccb9592b58e308a3f78724fa483/build.py#L83)
+in [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
+for the Triton version you are using.
+
 These are compatible with the newer versions of CUDA running in Triton.
 If you would like to use a specific vLLM commit or the latest version of vLLM, you
 will need to use a
@@ -101,4 +120,11 @@ tritonserver --model-repository=/models --backend-config=python,shm-region-prefi
 
 # Triton instance 2
 tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2
-```
\ No newline at end of file
+```
+
+## Referencing the Tutorial
+
+You can read further in the
+[vLLM Quick Deploy guide](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM)
+in the
+[tutorials](https://github.com/triton-inference-server/tutorials/) repository.
\ No newline at end of file

From 4259a7ef2302cb8467bc204cc357c4a21a0883c6 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 17:23:57 -0700
Subject: [PATCH 27/49] Add links to engine args, define model.json

---
 README.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c1bb127e..b20ab92e 100644
--- a/README.md
+++ b/README.md
@@ -86,7 +86,13 @@ wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton
 
 You can see an example model_repository in the `samples` folder.
 You can use this as is and change the model by changing the `model` value in `model.json`.
-You can change the GPU utilization and logging parameters in that file as well.
+`model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model.
+You can see supported arguments in vLLM's
+(arg_utils.py)[https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py].
+Specifically,
+(here)[https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11]
+and
+(here)[https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201].
 
 In the `[samples](samples)` folder, you can also find a sample client,
 `[client.py](samples/client.py)`.

From 76c2d89e2be0ab39e732e6f8da98f249c41de784 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 17:26:02 -0700
Subject: [PATCH 28/49] Change verbiage around vLLM engine models

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index b20ab92e..084928ac 100644
--- a/README.md
+++ b/README.md
@@ -35,10 +35,10 @@ You can learn more about Triton backends in the [backend
 repo](https://github.com/triton-inference-server/backend). Ask
 questions or report problems on the [issues
 page](https://github.com/triton-inference-server/server/issues).
-This backend is designed to run [vLLM](https://github.com/vllm-project/vllm)
-with
-[vLLM supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
-it supports.
+This backend is designed to run
+(supported models)[https://vllm.readthedocs.io/en/latest/models/supported_models.html]
+on a
+(vLLM engine)[https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py].
 
 Where can I ask general questions about Triton and Triton backends?
 Be sure to read all the information below as well as the [general

From 31f1733c0c5c5188ec0d41b948cba382e07d23cf Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 17:43:00 -0700
Subject: [PATCH 29/49] Fix links

---
 README.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 084928ac..1290627e 100644
--- a/README.md
+++ b/README.md
@@ -36,9 +36,9 @@ repo](https://github.com/triton-inference-server/backend). Ask
 questions or report problems on the [issues
 page](https://github.com/triton-inference-server/server/issues).
 This backend is designed to run
-(supported models)[https://vllm.readthedocs.io/en/latest/models/supported_models.html]
+[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
 on a
-(vLLM engine)[https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py].
+[vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
 
 Where can I ask general questions about Triton and Triton backends?
 Be sure to read all the information below as well as the [general
@@ -57,9 +57,9 @@ Starting in release 23.10, Triton includes a container with just the vLLM backen
 
 ### Option 2. Build via the Build.py Script.
 You can follow steps described in the
-[Building With Docker] (https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
+[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
 guide and use the
-(build.py)[https://github.com/triton-inference-server/server/blob/main/build.py]
+[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
 script.
 
 A sample command to build a Triton Server container with all available options enabled is below.
@@ -94,8 +94,8 @@ Specifically,
 and
 (here)[https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201].
 
-In the `[samples](samples)` folder, you can also find a sample client,
-`[client.py](samples/client.py)`.
+In the [samples](samples) folder, you can also find a sample client,
+[client.py](samples/client.py).
 
 ## Running the Latest vLLM Version
 

From 76d0652a75e3e55d3fd19cc63e1e561833ea1c28 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 17:49:02 -0700
Subject: [PATCH 30/49] Fix links, grammar

---
 README.md | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 1290627e..bf5a213f 100644
--- a/README.md
+++ b/README.md
@@ -49,11 +49,11 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 
 ## Building the vLLM Backend
 
-There are several ways to use the vLLM backend.
+There are several ways to access the vLLM backend.
 
 ### Option 1. Run the Docker Container.
 
-Starting in release 23.10, Triton includes a container with just the vLLM backend. This container has everything you need to run your vLLM model.
+Starting in release 23.10, Triton includes a container with the vLLM backend. This container has everything you need to run your vLLM model.
 
 ### Option 2. Build via the Build.py Script.
 You can follow steps described in the
@@ -62,7 +62,7 @@ guide and use the
 [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
 script.
 
-A sample command to build a Triton Server container with all available options enabled is below.
+A sample command to build a Triton Server container with all options enabled is shown below.
 
 ```
 ./build.py -v --image=base,${BASE_CONTAINER_IMAGE_NAME}
@@ -77,22 +77,29 @@ A sample command to build a Triton Server container with all available options e
 
 ### Option 3. Add the vLLM Backend to the Triton Container
 
-You can install the vLLM backend directly into our NGC Triton container. In this case, please install vLLM first. You can do this by running `pip install vllm==<vLLM_version>`, then set up the vLLM backend in the container as follows:
+You can install the vLLM backend directly into the NGC Triton container.
+In this case, please install vLLM first. You can do so by running
+`pip install vllm==<vLLM_version>`. Then, set up the vLLM backend in the
+container with the following commands:
 
+```
 mkdir -p /opt/tritonserver/backends/vllm
 wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py
+```
 
 ## Using the vLLM Backend
 
-You can see an example model_repository in the `samples` folder.
+You can see an example
+[model_repository](samples/model_repository)
+in the [samples](samples) folder.
 You can use this as is and change the model by changing the `model` value in `model.json`.
 `model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model.
 You can see supported arguments in vLLM's
-(arg_utils.py)[https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py].
+[arg_utils.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py).
 Specifically,
-(here)[https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11]
+[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11)
 and
-(here)[https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201].
+[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201).
 
 In the [samples](samples) folder, you can also find a sample client,
 [client.py](samples/client.py).
@@ -102,9 +109,8 @@ In the [samples](samples) folder, you can also find a sample client,
 To see the version of vLLM in the container, see the
 [version_map](https://github.com/triton-inference-server/server/blob/85487a1e15438ccb9592b58e308a3f78724fa483/build.py#L83)
 in [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
-for the Triton version you are using.
+for the Triton version you are using. These are compatible with the newer versions of CUDA running in Triton.
 
-These are compatible with the newer versions of CUDA running in Triton.
 If you would like to use a specific vLLM commit or the latest version of vLLM, you
 will need to use a
 [custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).

From a50ae8d980fe1ded3d4aa7087a5d1ea8ce48763d Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 18:37:53 -0700
Subject: [PATCH 31/49] Remove Conda references.

---
 samples/conda/README.md                       |  77 -------------
 samples/conda/gen_vllm_env.sh                 | 105 ------------------
 .../model_repository/vllm_model/config.pbtxt  |   3 -
 3 files changed, 185 deletions(-)
 delete mode 100644 samples/conda/README.md
 delete mode 100755 samples/conda/gen_vllm_env.sh

diff --git a/samples/conda/README.md b/samples/conda/README.md
deleted file mode 100644
index fb0cf759..00000000
--- a/samples/conda/README.md
+++ /dev/null
@@ -1,77 +0,0 @@
-<!--
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--->
-
-# How to Create a Custom Conda Environment
-
-If you would like to run conda with the latest version of vLLM, you will need to create a
-a [custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).
-This is because vLLM currently does not support the latest versions of CUDA in the Triton environment.
-Instructions for creating a custom execution environment with the latest vLLM version are below.
-
-## Step 1: Build a Custom Execution Environment With vLLM and Other Dependencies
-
-The provided script should build the package environment
-for you which will be used to load the model in Triton.
-
-Run the following command from this directory. You can use any version of Triton.
-```
-docker run --gpus all -it --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 --shm-size=8G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:23.09-py3 bash
-./gen_vllm_env.sh
-```
-
-This step might take a while to build the environment packages. Once complete, the current folder will be populated with
-`triton_python_backend_stub` and `vllm_env`.
-
-## Step 2: Update Your Model Repository
-
-You want to place the stub and environment in your model directory.
-The model directory should look something like this:
-```
-model_repository/
-`-- vllm_model
-    |-- 1
-    |   `-- model.json
-    |-- config.pbtxt
-    |-- triton_python_backend_stub
-    `-- vllm_env
-```
-
-You also want to add this section to the config.pbtxt of your model.
-This will direct Triton to look for a custom execution environment in
-the vllm_env subdirectory of your model's directory.
-
-```
-parameters: {
-  key: "EXECUTION_ENV_PATH",
-  value: {string_value: "$$TRITON_MODEL_DIRECTORY/vllm_env"}
-}
-```
-
-## Step 3: Run Your Model
-
-You can now start Triton server with your model!
\ No newline at end of file
diff --git a/samples/conda/gen_vllm_env.sh b/samples/conda/gen_vllm_env.sh
deleted file mode 100755
index afa99e87..00000000
--- a/samples/conda/gen_vllm_env.sh
+++ /dev/null
@@ -1,105 +0,0 @@
-#!/bin/bash
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-#
-# This script creates a conda environment for Triton with vllm
-# dependencies.
-#
-
-# Pick the release tag from the container environment variable
-RELEASE_TAG="r${NVIDIA_TRITON_SERVER_VERSION}"
-
-# Save target directories for conda environment and Python backend stubs
-ENV_DIR="./model_repository/vllm/vllm_env/"
-STUB_FILE="./model_repository/vllm/triton_python_backend_stub"
-
-# If targets already exist, print a message and exit.
-if [ -d "$ENV_DIR" ] && [ -f "$STUB_FILE" ]; then
-    echo "The conda environment directory and Python backend stubs already exist."
-    echo "Exiting environment set-up."
-    exit 0
-fi
-
-# If this script runs, clean up previous targets.
-rm -rf $ENV_DIR $STUB_FILE
-
-# Install and setup conda environment
-FILE_NAME="Miniconda3-latest-Linux-x86_64.sh"
-rm -rf ./miniconda $FILE_NAME
-wget https://repo.anaconda.com/miniconda/$FILE_NAME
-
-# Install miniconda in silent mode
-bash $FILE_NAME -p ./miniconda -b
-
-# Activate conda
-eval "$(./miniconda/bin/conda shell.bash hook)"
-
-# Installing cmake and dependencies
-apt update && apt install software-properties-common rapidjson-dev libarchive-dev zlib1g-dev -y
-# Using CMAKE installation instruction from:: https://apt.kitware.com/
-apt install -y gpg wget && \
-    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-        gpg --dearmor - |  \
-        tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null && \
-    . /etc/os-release && \
-    echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | \
-    tee /etc/apt/sources.list.d/kitware.list >/dev/null && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends cmake cmake-data
-
-conda create -n vllm_env python=3.10 -y
-conda activate vllm_env
-export PYTHONNOUSERSITE=True
-conda install -c conda-forge libstdcxx-ng=12 -y
-conda install -c conda-forge conda-pack -y
-
-# vLLM needs cuda 11.8 to run properly
-conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
-
-pip install numpy
-pip install git+https://github.com/huggingface/transformers.git
-pip install git+https://github.com/vllm-project/vllm.git
-
-
-rm -rf python_backend
-git clone https://github.com/triton-inference-server/python_backend -b $RELEASE_TAG
-(cd python_backend/ && mkdir builddir && cd builddir && \
-cmake -DTRITON_ENABLE_GPU=ON -DTRITON_BACKEND_REPO_TAG=$RELEASE_TAG -DTRITON_COMMON_REPO_TAG=$RELEASE_TAG -DTRITON_CORE_REPO_TAG=$RELEASE_TAG ../ && \
-make -j18 triton-python-backend-stub)
-
-mv python_backend/builddir/triton_python_backend_stub ./model_repository/vllm/
-
-# Prepare and copy the conda environment
-cp -r $CONDA_PREFIX/lib/python3.10/site-packages/conda_pack/scripts/posix/activate $CONDA_PREFIX/bin/
-rm -r $CONDA_PREFIX/nsight*
-cp -r $CONDA_PREFIX ./model_repository/vllm/
-
-conda deactivate
-
-# Clean-up
-rm -rf ./miniconda $FILE_NAME
-rm -rf python_backend
diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
index c8d47343..169f3815 100644
--- a/samples/model_repository/vllm_model/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -25,9 +25,6 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 # Note: You do not need to change any fields in this configuration.
-# If you are using a custom execution environment, there are
-# instructions in the samples/conda README on how to add a parameter
-# to use a custom execution environment.
 
 backend: "vllm"
 

From edaff542a02e65c0bb9e651065a14f4c34a54cb3 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 11 Oct 2023 19:05:24 -0700
Subject: [PATCH 32/49] Fix client I/O and model names

---
 samples/client.py                                | 12 ++++++------
 samples/model_repository/vllm_model/config.pbtxt |  1 +
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/samples/client.py b/samples/client.py
index 83f1e49c..06bf0c3e 100755
--- a/samples/client.py
+++ b/samples/client.py
@@ -85,7 +85,7 @@ async def process_stream(self, prompts, sampling_parameters):
             if error:
                 print(f"Encountered error while processing: {error}")
             else:
-                output = result.as_numpy("TEXT")
+                output = result.as_numpy("text_output")
                 for i in output:
                     self._results_dict[result.get_response().id].append(i)
 
@@ -126,13 +126,13 @@ def create_request(
         inputs = []
         prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_)
         try:
-            inputs.append(grpcclient.InferInput("PROMPT", [1], "BYTES"))
+            inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
             inputs[-1].set_data_from_numpy(prompt_data)
         except Exception as error:
             print(f"Encountered an error during request creation: {error}")
 
         stream_data = np.array([stream], dtype=bool)
-        inputs.append(grpcclient.InferInput("STREAM", [1], "BOOL"))
+        inputs.append(grpcclient.InferInput("stream", [1], "BOOL"))
         inputs[-1].set_data_from_numpy(stream_data)
 
         # Request parameters are not yet supported via BLS. Provide an
@@ -143,12 +143,12 @@ def create_request(
             sampling_parameters_data = np.array(
                 [json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_
             )
-            inputs.append(grpcclient.InferInput("SAMPLING_PARAMETERS", [1], "BYTES"))
+            inputs.append(grpcclient.InferInput("sampling_parameters", [1], "BYTES"))
             inputs[-1].set_data_from_numpy(sampling_parameters_data)
 
         # Add requested outputs
         outputs = []
-        outputs.append(grpcclient.InferRequestedOutput("TEXT"))
+        outputs.append(grpcclient.InferRequestedOutput("text_output"))
 
         # Issue the asynchronous sequence inference.
         return {
@@ -167,7 +167,7 @@ def create_request(
         "--model",
         type=str,
         required=False,
-        default="vllm",
+        default="vllm_model",
         help="Model name",
     )
     parser.add_argument(
diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
index 169f3815..0e20a574 100644
--- a/samples/model_repository/vllm_model/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -26,6 +26,7 @@
 
 # Note: You do not need to change any fields in this configuration.
 
+name: "vllm_model"
 backend: "vllm"
 
 # Disabling batching in Triton, let vLLM handle the batching on its own.

From 33dbaed79a8e8eb4ba79b46c35a677efaf90a899 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Thu, 12 Oct 2023 06:56:11 -0700
Subject: [PATCH 33/49] Remove model name in config

---
 samples/model_repository/vllm_model/config.pbtxt | 1 -
 1 file changed, 1 deletion(-)

diff --git a/samples/model_repository/vllm_model/config.pbtxt b/samples/model_repository/vllm_model/config.pbtxt
index 0e20a574..169f3815 100644
--- a/samples/model_repository/vllm_model/config.pbtxt
+++ b/samples/model_repository/vllm_model/config.pbtxt
@@ -26,7 +26,6 @@
 
 # Note: You do not need to change any fields in this configuration.
 
-name: "vllm_model"
 backend: "vllm"
 
 # Disabling batching in Triton, let vLLM handle the batching on its own.

From 657519712d6735d4f95fb96d4d7977dae3dec24f Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Thu, 12 Oct 2023 12:42:31 -0700
Subject: [PATCH 34/49] Add generate endpoint, switch to min container

---
 README.md | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index bf5a213f..4911e442 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,7 @@ script.
 A sample command to build a Triton Server container with all options enabled is shown below.
 
 ```
-./build.py -v --image=base,${BASE_CONTAINER_IMAGE_NAME}
+./build.py -v --image=base,${BASE_CONTAINER_IMAGE_NAME}-min
                 --enable-logging --enable-stats --enable-tracing
                 --enable-metrics --enable-gpu-metrics --enable-cpu-metrics
                 --enable-gpu
@@ -117,6 +117,23 @@ will need to use a
 Please see the
 [conda](samples/conda) subdirectory of the `samples` folder for information on how to do so.
 
+
+## Sending Your First Inference
+
+After you
+[start Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
+with the
+[sample model_repository](samples/model_repository),
+you can quickly run your first inference request with the
+[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
+
+Try out the command below.
+You can replace _client input_ with your input text.
+
+```
+$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "client input", "parameters": {"stream": false, "temperature": 0}}'
+```
+
 ## Running Multiple Instances of Triton Server
 
 Python-based backends use shared memory to transfer requests to the stub process.

From 9effb18fb258b486af825bd06617afa5e08a935f Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Thu, 12 Oct 2023 13:08:24 -0700
Subject: [PATCH 35/49] Change to min

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 4911e442..756df4a2 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,7 @@ script.
 A sample command to build a Triton Server container with all options enabled is shown below.
 
 ```
-./build.py -v --image=base,${BASE_CONTAINER_IMAGE_NAME}-min
+./build.py -v --image=min,${BASE_CONTAINER_IMAGE_NAME}-min
                 --enable-logging --enable-stats --enable-tracing
                 --enable-metrics --enable-gpu-metrics --enable-cpu-metrics
                 --enable-gpu

From 8dc3f510e4a8d9acecaae24f40ff5d363873209f Mon Sep 17 00:00:00 2001
From: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
Date: Fri, 13 Oct 2023 13:22:30 -0700
Subject: [PATCH 36/49] Apply suggestions from code review

---
 README.md | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 756df4a2..8533269d 100644
--- a/README.md
+++ b/README.md
@@ -53,20 +53,19 @@ There are several ways to access the vLLM backend.
 
 ### Option 1. Run the Docker Container.
 
-Starting in release 23.10, Triton includes a container with the vLLM backend. This container has everything you need to run your vLLM model.
+Pull the container with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) registry. This container has everything you need to run your vLLM model.
 
-### Option 2. Build via the Build.py Script.
+### Option 2. Build a custom container from source
 You can follow steps described in the
 [Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
 guide and use the
 [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
 script.
 
-A sample command to build a Triton Server container with all options enabled is shown below.
+A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs.
 
 ```
-./build.py -v --image=min,${BASE_CONTAINER_IMAGE_NAME}-min
-                --enable-logging --enable-stats --enable-tracing
+./build.py -v  --enable-logging --enable-stats --enable-tracing
                 --enable-metrics --enable-gpu-metrics --enable-cpu-metrics
                 --enable-gpu
                 --filesystem=gcs --filesystem=s3 --filesystem=azure_storage
@@ -114,8 +113,6 @@ for the Triton version you are using. These are compatible with the newer versio
 If you would like to use a specific vLLM commit or the latest version of vLLM, you
 will need to use a
 [custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).
-Please see the
-[conda](samples/conda) subdirectory of the `samples` folder for information on how to do so.
 
 
 ## Sending Your First Inference

From bf0d905fa1b7d3cc4c21503090ae333fdb15ff04 Mon Sep 17 00:00:00 2001
From: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
Date: Fri, 13 Oct 2023 13:25:39 -0700
Subject: [PATCH 37/49] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 8533269d..eaa74acb 100644
--- a/README.md
+++ b/README.md
@@ -49,7 +49,7 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 
 ## Building the vLLM Backend
 
-There are several ways to access the vLLM backend.
+There are several ways to take advantage of the vLLM backend.
 
 ### Option 1. Run the Docker Container.
 

From 7ec9b5fe3c28d4d49099fdea99c38e19e4a09ffb Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 17 Oct 2023 11:45:26 -0700
Subject: [PATCH 38/49] Add example model args, link to multi-server behavior

---
 README.md | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/README.md b/README.md
index eaa74acb..3e0adb6c 100644
--- a/README.md
+++ b/README.md
@@ -40,6 +40,10 @@ This backend is designed to run
 on a
 [vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
 
+This is a Python-based backend. When using this backend, all requests are placed on the
+vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled
+by vLLM engine.
+
 Where can I ask general questions about Triton and Triton backends?
 Be sure to read all the information below as well as the [general
 Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server)
@@ -100,6 +104,14 @@ Specifically,
 and
 [here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201).
 
+For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in
+[model.json](samples/model_repository/1/model.json).
+
+Note: vLLM greedily consume up to 90% of the GPU's memory under default settings.
+The sample model updates this behavior by setting gpu_memory_utilization to 50%.
+You can tweak this behavior using fields like gpu_memory_utilization and other settings in
+[model.json](samples/model_repository/1/model.json).
+
 In the [samples](samples) folder, you can also find a sample client,
 [client.py](samples/client.py).
 
@@ -133,20 +145,10 @@ $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "
 
 ## Running Multiple Instances of Triton Server
 
-Python-based backends use shared memory to transfer requests to the stub process.
-When running multiple instances of Triton Server on the same machine,
-you need to specify different shm-region-prefix-name using the --backend-config flag.
-
-> **Note** There are known runtime issues if you do not launch with different region-prefix-names.
-This can lead to to segmentation faults and hangs.
-
-```
-# Triton instance 1
-tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix1
-
-# Triton instance 2
-tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2
-```
+If you are running multiple instances of Triton server with a Python-based backend,
+you need to specify different shm-region-prefix-name for each server. See
+[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
+for more information.
 
 ## Referencing the Tutorial
 

From 3b64abc588f7f83c5080a8f1a0a6bc716da6937f Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 17 Oct 2023 11:49:34 -0700
Subject: [PATCH 39/49] Format client input, add upstream tag.

---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 3e0adb6c..2119380a 100644
--- a/README.md
+++ b/README.md
@@ -74,6 +74,7 @@ A sample command to build a Triton Server container with all options enabled is
                 --enable-gpu
                 --filesystem=gcs --filesystem=s3 --filesystem=azure_storage
                 --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai
+                --upstream-container-version=23.10
                 --backend=python:r23.10
                 --backend=vllm:r23.10
 ```
@@ -137,7 +138,7 @@ you can quickly run your first inference request with the
 [generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
 
 Try out the command below.
-You can replace _client input_ with your input text.
+You can replace "client input" with your input text.
 
 ```
 $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "client input", "parameters": {"stream": false, "temperature": 0}}'

From 3a3b3266c6f62834ee18baf086f0d4823f8c1871 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 17 Oct 2023 11:53:16 -0700
Subject: [PATCH 40/49] Fix links, grammar

---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 2119380a..362d990f 100644
--- a/README.md
+++ b/README.md
@@ -106,12 +106,12 @@ and
 [here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201).
 
 For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in
-[model.json](samples/model_repository/1/model.json).
+[model.json](samples/model_repository/vllm_model/1/model.json).
 
 Note: vLLM greedily consume up to 90% of the GPU's memory under default settings.
 The sample model updates this behavior by setting gpu_memory_utilization to 50%.
 You can tweak this behavior using fields like gpu_memory_utilization and other settings in
-[model.json](samples/model_repository/1/model.json).
+[model.json](samples/model_repository/vllm_model/1/model.json).
 
 In the [samples](samples) folder, you can also find a sample client,
 [client.py](samples/client.py).
@@ -147,7 +147,7 @@ $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "
 ## Running Multiple Instances of Triton Server
 
 If you are running multiple instances of Triton server with a Python-based backend,
-you need to specify different shm-region-prefix-name for each server. See
+you need to specify a different shm-region-prefix-name for each server. See
 [here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
 for more information.
 

From 48e08e75d44db9ceb22927f6d07ec4d4b6d91e60 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 17 Oct 2023 13:07:55 -0700
Subject: [PATCH 41/49] Add quotes to shm-region-prefix-name

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 362d990f..147f05ab 100644
--- a/README.md
+++ b/README.md
@@ -147,7 +147,7 @@ $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "
 ## Running Multiple Instances of Triton Server
 
 If you are running multiple instances of Triton server with a Python-based backend,
-you need to specify a different shm-region-prefix-name for each server. See
+you need to specify a different `shm-region-prefix-name` for each server. See
 [here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
 for more information.
 

From 9b4a1939887964d2ea4a5e86c6529691ace07bab Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 17 Oct 2023 13:08:41 -0700
Subject: [PATCH 42/49] Update sentence ordering, remove extra issues link

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
---
 README.md | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 147f05ab..faef4358 100644
--- a/README.md
+++ b/README.md
@@ -30,15 +30,14 @@
 
 # vLLM Backend
 
-The Triton backend for [vLLM](https://github.com/vllm-project/vllm).
-You can learn more about Triton backends in the [backend
-repo](https://github.com/triton-inference-server/backend). Ask
-questions or report problems on the [issues
-page](https://github.com/triton-inference-server/server/issues).
-This backend is designed to run
+The Triton backend for [vLLM](https://github.com/vllm-project/vllm)
+is designed to run
 [supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
 on a
 [vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
+You can learn more about Triton backends in the [backend
+repo](https://github.com/triton-inference-server/backend). 
+
 
 This is a Python-based backend. When using this backend, all requests are placed on the
 vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled

From 45be0f6712dce42de2af1440e075cea40c60b984 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Tue, 17 Oct 2023 13:10:58 -0700
Subject: [PATCH 43/49] Modify input text example, one arg per line

---
 README.md | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index faef4358..633eafdb 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ is designed to run
 on a
 [vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
 You can learn more about Triton backends in the [backend
-repo](https://github.com/triton-inference-server/backend). 
+repo](https://github.com/triton-inference-server/backend).
 
 
 This is a Python-based backend. When using this backend, all requests are placed on the
@@ -68,11 +68,20 @@ script.
 A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs.
 
 ```
-./build.py -v  --enable-logging --enable-stats --enable-tracing
-                --enable-metrics --enable-gpu-metrics --enable-cpu-metrics
+./build.py -v  --enable-logging
+                --enable-stats
+                --enable-tracing
+                --enable-metrics
+                --enable-gpu-metrics
+                --enable-cpu-metrics
                 --enable-gpu
-                --filesystem=gcs --filesystem=s3 --filesystem=azure_storage
-                --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai
+                --filesystem=gcs
+                --filesystem=s3
+                --filesystem=azure_storage
+                --endpoint=http
+                --endpoint=grpc
+                --endpoint=sagemaker
+                --endpoint=vertex-ai
                 --upstream-container-version=23.10
                 --backend=python:r23.10
                 --backend=vllm:r23.10
@@ -137,10 +146,9 @@ you can quickly run your first inference request with the
 [generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
 
 Try out the command below.
-You can replace "client input" with your input text.
 
 ```
-$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "client input", "parameters": {"stream": false, "temperature": 0}}'
+$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
 ```
 
 ## Running Multiple Instances of Triton Server

From 204ce5ab1f44bfa6268ed668e6bdec71657b8453 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Tue, 17 Oct 2023 19:47:03 -0700
Subject: [PATCH 44/49] Remove line about CUDA version compatibility.

Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 633eafdb..297b8808 100644
--- a/README.md
+++ b/README.md
@@ -129,7 +129,7 @@ In the [samples](samples) folder, you can also find a sample client,
 To see the version of vLLM in the container, see the
 [version_map](https://github.com/triton-inference-server/server/blob/85487a1e15438ccb9592b58e308a3f78724fa483/build.py#L83)
 in [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
-for the Triton version you are using. These are compatible with the newer versions of CUDA running in Triton.
+for the Triton version you are using.
 
 If you would like to use a specific vLLM commit or the latest version of vLLM, you
 will need to use a

From 8c9c4e71c854b1028eeee39e67e5f42f55a5a166 Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 18 Oct 2023 07:54:06 -0700
Subject: [PATCH 45/49] Wording of Triton container option

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 297b8808..931c4930 100644
--- a/README.md
+++ b/README.md
@@ -87,7 +87,7 @@ A sample command to build a Triton Server container with all options enabled is
                 --backend=vllm:r23.10
 ```
 
-### Option 3. Add the vLLM Backend to the Triton Container
+### Option 3. Add the vLLM Backend to the default Triton Container
 
 You can install the vLLM backend directly into the NGC Triton container.
 In this case, please install vLLM first. You can do so by running

From 3ab47745f1d0ed9347e2db1edae7d1ed3a9ca68d Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 18 Oct 2023 07:54:29 -0700
Subject: [PATCH 46/49] Update wording of pre-built Docker container option

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 931c4930..2fc987cb 100644
--- a/README.md
+++ b/README.md
@@ -54,7 +54,7 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 
 There are several ways to take advantage of the vLLM backend.
 
-### Option 1. Run the Docker Container.
+### Option 1. Pre-built Docker Container.
 
 Pull the container with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) registry. This container has everything you need to run your vLLM model.
 

From 757e2b2bf49353373bd55b43946623f407f7061d Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 18 Oct 2023 07:54:44 -0700
Subject: [PATCH 47/49] Update README.md wording

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 2fc987cb..b17d31a5 100644
--- a/README.md
+++ b/README.md
@@ -52,7 +52,7 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 
 ## Building the vLLM Backend
 
-There are several ways to take advantage of the vLLM backend.
+There are several ways to install and deploy the vLLM backend.
 
 ### Option 1. Pre-built Docker Container.
 

From aa9ec6568e71cc0b8058d1cc09947130c0c8332f Mon Sep 17 00:00:00 2001
From: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Date: Wed, 18 Oct 2023 07:55:08 -0700
Subject: [PATCH 48/49] Update wording - add "the"

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b17d31a5..d99a8b4a 100644
--- a/README.md
+++ b/README.md
@@ -41,7 +41,7 @@ repo](https://github.com/triton-inference-server/backend).
 
 This is a Python-based backend. When using this backend, all requests are placed on the
 vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled
-by vLLM engine.
+by the vLLM engine.
 
 Where can I ask general questions about Triton and Triton backends?
 Be sure to read all the information below as well as the [general

From e0161f4c7678df9615d9691963b38c2c40f4b314 Mon Sep 17 00:00:00 2001
From: David Yastremsky <dyastremsky@nvidia.com>
Date: Wed, 18 Oct 2023 16:07:56 -0700
Subject: [PATCH 49/49] Standarize capitalization, headings

---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index d99a8b4a..abdba2cf 100644
--- a/README.md
+++ b/README.md
@@ -54,11 +54,11 @@ main Triton [issues page](https://github.com/triton-inference-server/server/issu
 
 There are several ways to install and deploy the vLLM backend.
 
-### Option 1. Pre-built Docker Container.
+### Option 1. Use the Pre-Built Docker Container.
 
 Pull the container with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) registry. This container has everything you need to run your vLLM model.
 
-### Option 2. Build a custom container from source
+### Option 2. Build a Custom Container From Source
 You can follow steps described in the
 [Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
 guide and use the
@@ -87,7 +87,7 @@ A sample command to build a Triton Server container with all options enabled is
                 --backend=vllm:r23.10
 ```
 
-### Option 3. Add the vLLM Backend to the default Triton Container
+### Option 3. Add the vLLM Backend to the Default Triton Container
 
 You can install the vLLM backend directly into the NGC Triton container.
 In this case, please install vLLM first. You can do so by running