add vllm deployment instructions

redhat-gpte-devopsautomation · Dec 13, 2024 · 8b0dd54 · 8b0dd54
1 parent 383273c
commit 8b0dd54
Show file tree

Hide file tree

Showing 4 changed files with 70 additions and 1 deletion.
diff --git a/content/modules/ROOT/assets/images/02-add-catalog.png b/content/modules/ROOT/assets/images/02-add-catalog.png
diff --git a/content/modules/ROOT/assets/images/02-instantiate-template.png b/content/modules/ROOT/assets/images/02-instantiate-template.png
diff --git a/content/modules/ROOT/assets/images/02-select-template.png b/content/modules/ROOT/assets/images/02-select-template.png
diff --git a/content/modules/ROOT/pages/02-vllm.adoc b/content/modules/ROOT/pages/02-vllm.adoc
@@ -1,8 +1,77 @@
 = Deploying a Model with vLLM
 
+In this section will will deploy a https://huggingface.co/ibm-granite/granite-3.0-8b-instruct[Granite 3.0 8B Instruct] model using vLLM.
+
+For our Model Server we will be deploying a vLLM instance using a model packaged into an OCI container with ModelCar.
+
+[NOTE]
+====
+ModelCar is Tech Preview as of OpenShift AI 2.14.
+
+ModelCar is a great option for smaller models like our 8B model.  While it is still a relatively large container (15Gb) it is still reasonable to easily pull into a cluster.
+
+Treating the model as an OCI artifact allows us to easily promote the model between different environments using customers existing promotion processes.  By contrast, dealing with promoting models between S3 instances in different environments may create new challenges.
+====
+
 == Creating the vLLM Instance
 
-# Instructions for setting up vLLM instance with granite using oci
+Since we are using a ModelCar container to deploy our model instead of S3, we will need to deploy the resources without the OpenShift AI Dashboard.
+
+. To start, With the `redhat-ods-applications` namespace selected, navigate to Developer perspective in the OpenShift Web Console.  From the `+Add` page, select `All Services`.
+
+image::01-add-catalog.png[Add Catalog]
+
+. Search for `vLLM` and select the `vLLM ServingRuntime for KServe` template
+
+image::01-select-template.png[Select Template]
+
+. Choose to `Instantiate Template`.  Select the `composer-ai-apps` project and click `Create`
+
+image::01-instantiate-template.png[Instantiate Template]
+
+The vLLM ServingRuntime for KServe `Template` is the same template that the OpenShift AI Dashboard uses when deploying a new instance.  Unlike the Dashboard though, the template with just create the `ServingRuntime` object and not the `InferenceService`.
+
+. Next we will need to create the InferenceService.  Click the `+` in the top right hand corner and paste the following object in and click `Create`.
+
+[source,yaml]
+----
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  annotations:
+    serving.knative.openshift.io/enablePassthrough: 'true'
+    sidecar.istio.io/inject: 'true'
+    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+  name: vllm
+  namespace: composer-ai-apps
+  labels:
+    opendatahub.io/dashboard: 'true'
+spec:
+  predictor:
+    annotations:
+      serving.knative.dev/progress-deadline: 45m
+    maxReplicas: 1
+    minReplicas: 1
+    model:
+      modelFormat:
+        name: vLLM
+      name: ''
+      resources: 
+        limits:
+          cpu: '4'
+          memory: 20Gi
+          nvidia.com/gpu: '1'
+        requests:
+          cpu: '2'
+          memory: 16Gi
+          nvidia.com/gpu: '1'
+      runtime: vllm-runtime
+      storageUri: 'oci://quay.io/redhat-ai-services/modelcar-catalog:granite-3.0-8b-instruct'
+    tolerations:
+      - effect: NoSchedule
+        key: nvidia.com/gpu
+        operator: Equal
+----
 
 == Testing vLLM Endpoints