Skip to content

Commit

Permalink
add vllm deployment instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
strangiato committed Dec 13, 2024
1 parent 383273c commit 8b0dd54
Show file tree
Hide file tree
Showing 4 changed files with 70 additions and 1 deletion.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
71 changes: 70 additions & 1 deletion content/modules/ROOT/pages/02-vllm.adoc
Original file line number Diff line number Diff line change
@@ -1,8 +1,77 @@
= Deploying a Model with vLLM

In this section will will deploy a https://huggingface.co/ibm-granite/granite-3.0-8b-instruct[Granite 3.0 8B Instruct] model using vLLM.

For our Model Server we will be deploying a vLLM instance using a model packaged into an OCI container with ModelCar.

[NOTE]
====
ModelCar is Tech Preview as of OpenShift AI 2.14.
ModelCar is a great option for smaller models like our 8B model. While it is still a relatively large container (15Gb) it is still reasonable to easily pull into a cluster.
Treating the model as an OCI artifact allows us to easily promote the model between different environments using customers existing promotion processes. By contrast, dealing with promoting models between S3 instances in different environments may create new challenges.
====

== Creating the vLLM Instance

# Instructions for setting up vLLM instance with granite using oci
Since we are using a ModelCar container to deploy our model instead of S3, we will need to deploy the resources without the OpenShift AI Dashboard.

. To start, With the `redhat-ods-applications` namespace selected, navigate to Developer perspective in the OpenShift Web Console. From the `+Add` page, select `All Services`.

image::01-add-catalog.png[Add Catalog]

. Search for `vLLM` and select the `vLLM ServingRuntime for KServe` template

image::01-select-template.png[Select Template]

. Choose to `Instantiate Template`. Select the `composer-ai-apps` project and click `Create`

image::01-instantiate-template.png[Instantiate Template]

The vLLM ServingRuntime for KServe `Template` is the same template that the OpenShift AI Dashboard uses when deploying a new instance. Unlike the Dashboard though, the template with just create the `ServingRuntime` object and not the `InferenceService`.

. Next we will need to create the InferenceService. Click the `+` in the top right hand corner and paste the following object in and click `Create`.

[source,yaml]
----
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
serving.knative.openshift.io/enablePassthrough: 'true'
sidecar.istio.io/inject: 'true'
sidecar.istio.io/rewriteAppHTTPProbers: 'true'
name: vllm
namespace: composer-ai-apps
labels:
opendatahub.io/dashboard: 'true'
spec:
predictor:
annotations:
serving.knative.dev/progress-deadline: 45m
maxReplicas: 1
minReplicas: 1
model:
modelFormat:
name: vLLM
name: ''
resources:
limits:
cpu: '4'
memory: 20Gi
nvidia.com/gpu: '1'
requests:
cpu: '2'
memory: 16Gi
nvidia.com/gpu: '1'
runtime: vllm-runtime
storageUri: 'oci://quay.io/redhat-ai-services/modelcar-catalog:granite-3.0-8b-instruct'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Equal
----

== Testing vLLM Endpoints

Expand Down

0 comments on commit 8b0dd54

Please sign in to comment.