From c588614de81cc1076a6d020849f23b7ca295c27a Mon Sep 17 00:00:00 2001
From: Vitaliy Emporopulo <vemporop@redhat.com>
Date: Thu, 7 Mar 2024 15:05:17 +0200
Subject: [PATCH] Document deploying DRA to OpenShift

* Document the differences on OpenShift
* Include useful setup scripts

Signed-off-by: Vitaliy Emporopulo <vemporop@redhat.com>
---
 README.md                                     |   2 +-
 demo/clusters/openshift/README.md             | 142 ++++++++++++++++++
 .../openshift/add-certified-catalog-source.sh |  21 +++
 demo/clusters/openshift/enable-dra-profile.sh |   6 +
 .../openshift/extend-kube-scheduler-rbac.sh   |  30 ++++
 5 files changed, 200 insertions(+), 1 deletion(-)
 create mode 100644 demo/clusters/openshift/README.md
 create mode 100755 demo/clusters/openshift/add-certified-catalog-source.sh
 create mode 100755 demo/clusters/openshift/enable-dra-profile.sh
 create mode 100755 demo/clusters/openshift/extend-kube-scheduler-rbac.sh

diff --git a/README.md b/README.md
index 9a7b497b..75861ca4 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ A document and demo of the DRA support for GPUs provided by this repo can be fou
 
 ## Demo
 
-This section describes using `kind` to demo the functionality of the NVIDIA GPU DRA Driver.
+This section describes using `kind` to demo the functionality of the NVIDIA GPU DRA Driver. For Red Hat OpenShift, refer to [running the NVIDIA DRA driver on OpenShift](demo/clusters/openshift/README.md).
 
 First since we'll launch kind with GPU support, ensure that the following prerequisites are met:
 1. `kind` is installed. See the official documentation [here](https://kind.sigs.k8s.io/docs/user/quick-start/#installation).
diff --git a/demo/clusters/openshift/README.md b/demo/clusters/openshift/README.md
new file mode 100644
index 00000000..6910a203
--- /dev/null
+++ b/demo/clusters/openshift/README.md
@@ -0,0 +1,142 @@
+# Running the NVIDIA DRA Driver on Red Hat OpenShift
+
+This document explains the differences between deploying the NVIDIA DRA driver on OpenShift and upstream Kubernetes or its flavors.
+
+## Prerequisites
+
+Install a recent build of OpenShift 4.16 (e.g. 4.16.0-ec.3). You can obtain an IPI installer binary (`openshift-install`) from the [Release Status](https://amd64.ocp.releases.ci.openshift.org/) page, or use the Assisted Installer to install on bare metal. Refer to the [OpenShift documentation](https://docs.openshift.com/container-platform/4.15/installing/index.html) for different installation methods.
+
+## Enabling DRA on OpenShift
+
+Enable the `TechPreviewNoUpgrade` feature set as explained in [Enabling features using FeatureGates](https://docs.openshift.com/container-platform/4.15/nodes/clusters/nodes-cluster-enabling-features.html), either during the installation or post-install. The feature set includes the `DynamicResourceAllocation` feature gate.
+
+Update the cluster scheduler to enable the DRA scheduling plugin:
+
+```console
+$ oc patch --type merge -p '{"spec":{"profile": "HighNodeUtilization", "profileCustomizations": {"dynamicResourceAllocation": "Enabled"}}}' scheduler cluster
+```
+
+## NVIDIA GPU Drivers
+
+The easiest way to install NVIDIA GPU drivers on OpenShift nodes is via the NVIDIA GPU Operator.
+
+**Be careful to disable the device plugin so it does not conflict with the DRA plugin**. It is recommended to leave only the NVIDIA GPU driver and driver toolkit configs, and disable everything else:
+
+```yaml
+  <...>
+  devicePlugin:
+    enabled: false
+  <...>
+  driver:
+    enabled: true
+  <...>
+  toolkit:
+    enabled: true
+  <...>
+```
+
+
+The NVIDIA GPU Operator might not be available through the OperatorHub in a pre-production version of OpenShift. In this case, deploy the operator from a bundle or add a certified catalog index from an earlier version of OpenShift, e.g.:
+
+```yaml
+kind: CatalogSource
+apiVersion: operators.coreos.com/v1alpha1
+metadata:
+  name: certified-operators-v415
+  namespace: openshift-marketplace
+spec:
+  displayName: Certified Operators v4.15
+  image: registry.redhat.io/redhat/certified-operator-index:v4.15
+  priority: -100
+  publisher: Red Hat
+  sourceType: grpc
+  updateStrategy:
+    registryPoll:
+      interval: 10m0s
+```
+
+Then follow the installation steps in [NVIDIA GPU Operator on Red Hat OpenShift Container Platform](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html).
+
+## NVIDIA Binaries on RHCOS
+
+The location of some NVIDIA binaries on an OpenShift node differs from the defaults. Make sure to pass the following values when installing the Helm chart:
+
+```yaml
+nvidiaDriverRoot: /run/nvidia/driver
+nvidiaCtkPath: /var/usrlocal/nvidia/toolkit/nvidia-ctk
+```
+
+## OpenShift Security
+
+OpenShift generally requires more stringent security settings than Kubernetes. If you see a warning about security context constraints when deploying the DRA plugin, pass the following to the Helm chart, either via an in-line variable or a values file:
+
+```yaml
+kubeletPlugin:
+  containers:
+    plugin:
+      securityContext:
+        privileged: true
+        seccompProfile:
+          type: Unconfined
+```
+
+If you see security context constraints errors/warnings when deploying a sample workload, make sure to update the workload's security settings according to the [OpenShift documentation](https://docs.openshift.com/container-platform/4.15/operators/operator_sdk/osdk-complying-with-psa.html). Usually applying the following `securityContext` definition at a pod or container level works for non-privileged workloads.
+
+```yaml
+  securityContext:
+    runAsNonRoot: true
+    seccompProfile:
+      type: RuntimeDefault
+    allowPrivilegeEscalation: false
+    capabilities:
+      drop:
+        - ALL
+```
+
+If you see the following error when trying to deploy a workload:
+
+```console
+Warning  FailedScheduling  21m                default-scheduler  running Reserve plugin "DynamicResources": podschedulingcontexts.resource.k8s.io "gpu-example" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
+```
+
+apply the following RBAC configuration (this should be fixed in newer OpenShift builds):
+
+```yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: system:kube-scheduler:podfinalizers
+rules:
+- apiGroups:
+  - ""
+  resources:
+  - pods/finalizers
+  verbs:
+  - update
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: system:kube-scheduler:podfinalizers:crbinding
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: system:kube-scheduler:podfinalizers
+subjects:
+- kind: User
+  name: system:kube-scheduler
+```
+
+## Using Multi-Instance GPU (MIG)
+
+Workloads that use the Multi-instance GPU (MIG) feature require MIG to be [enabled](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#enable-mig-mode) on the worker nodes with [MIG-supported GPUs](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus), e.g. A100.
+
+You can do it via the driver daemon set pod running on a GPU node as follows (here, the GPU ID is 0, i.e. `-i 0`):
+
+```console
+oc exec -ti nvidia-driver-daemonset-416.94.202402160025-0-g45bd -n nvidia-gpu-operator -- nvidia-smi -i 0 -mig 1
+Enabled MIG Mode for GPU 00000000:0A:00.0
+All done.
+```
+
+Make sure to stop everything that may hold the GPU before enabling MIG. Otherwise you will see a warning, and the MIG status will have an asterisk (i.e. `Enabled*`), meaning that the setting could not be applied.
\ No newline at end of file
diff --git a/demo/clusters/openshift/add-certified-catalog-source.sh b/demo/clusters/openshift/add-certified-catalog-source.sh
new file mode 100755
index 00000000..12fe1495
--- /dev/null
+++ b/demo/clusters/openshift/add-certified-catalog-source.sh
@@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+
+set -ex
+set -o pipefail
+
+oc create -f - <<EOF
+kind: CatalogSource
+apiVersion: operators.coreos.com/v1alpha1
+metadata:
+  name: certified-operators-v415
+  namespace: openshift-marketplace
+spec:
+  displayName: Certified Operators v4.15
+  image: registry.redhat.io/redhat/certified-operator-index:v4.15
+  priority: -100
+  publisher: Red Hat
+  sourceType: grpc
+  updateStrategy:
+    registryPoll:
+      interval: 10m0s
+EOF
diff --git a/demo/clusters/openshift/enable-dra-profile.sh b/demo/clusters/openshift/enable-dra-profile.sh
new file mode 100755
index 00000000..222b74fe
--- /dev/null
+++ b/demo/clusters/openshift/enable-dra-profile.sh
@@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+
+set -ex
+set -o pipefail
+
+oc patch --type merge -p '{"spec":{"profile": "HighNodeUtilization", "profileCustomizations": {"dynamicResourceAllocation": "Enabled"}}}' scheduler cluster
diff --git a/demo/clusters/openshift/extend-kube-scheduler-rbac.sh b/demo/clusters/openshift/extend-kube-scheduler-rbac.sh
new file mode 100755
index 00000000..7d327ccd
--- /dev/null
+++ b/demo/clusters/openshift/extend-kube-scheduler-rbac.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+set -ex
+set -o pipefail
+
+oc apply -f - <<EOF
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: system:kube-scheduler:podfinalizers
+rules:
+- apiGroups:
+  - ""
+  resources:
+  - pods/finalizers
+  verbs:
+  - update
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: system:kube-scheduler:podfinalizers:crbinding
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: system:kube-scheduler:podfinalizers
+subjects:
+- kind: User
+  name: system:kube-scheduler
+EOF