nvidia arm64 & GPU operator test #583

jepio · 2025-02-27T18:41:36Z

Add SkipFunc implementation for skipping test on unsupported instance types
Add GPU operator test (includes nvidia-runtime sysext test)
Add Arm64 support to both tests
Add AWS support

PR Overview

This pull request adds support for NVIDIA GPU testing by introducing a SkipFunc for unsupported instance types, adding a GPU operator test (including an NVIDIA runtime sysext test), and extending support to the ARM64 architecture and AWS platform.

Introduces skipOnNonGpu to conditionally skip tests on unsupported instances.
Adds a new test (cl.misc.nvidia.operator) with a complete GPU operator installation and validation workflow.
Updates existing NVIDIA installation test to incorporate ARM64 support via template configuration.

Reviewed Changes

File	Description
kola/tests/misc/nvidia.go	Added new constants, skip logic, GPU operator test implementation, and expanded platform/architecture support

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

kola/tests/misc/nvidia.go:162

The multi-line helm installation command uses backticks, which preserve literal newlines. Verify that the shell execution handles these newlines as intended, or consider converting it to a single-line command.

_ = c.MustSSH(m, `curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

kola/tests/misc/nvidia.go:101

The SSH check in waitForNvidiaDriver only verifies for the substring 'active (exited)', which may be too specific if the nvidia service enters other valid states. Consider broadening the check or adding comments to clarify the expected state.

out, err := c.SSH(*m, "systemctl status nvidia.service")

kola/tests/misc/nvidia.go

krnowak · 2025-03-06T07:01:11Z

kola/tests/misc/nvidia.go

+    contents:
+      inline: |
+        NVIDIA_DRIVER_VERSION=570.86.15
+  - path: /opt/extensions/kubernetes-v1.30.4-{{ .ARCH_SUFFIX }}.raw


Is there a reason for picking exactly this version? As opposed to, say, 1.30.8 or any newer major version. Asking, because I think that eventually we will need to be bumping those versions, so maybe in future we will want to change the code to always download the latest version of k8s or nvidia runtime.

any version is good enough, i've bumped it to 1.30.8.

I think we'll need to update this eventually but i don't expect it to have to always be the latest - the GPU operator itself has a matrix of supported k8s version.

I think this test is going to be most helpful to test the nvidia-runtime sysext and any changes to flatcar's nvidia.service that may break the GPU operator.

krnowak · 2025-03-06T07:06:58Z

kola/tests/misc/nvidia.go

+  - path: /etc/flatcar/nvidia-metadata
+    contents:
+      inline: |
+        NVIDIA_DRIVER_VERSION=570.86.15


Should we upgrade the nvidia drivers in the coreos-overlay instead of hardcoding this version?

R570 seems to be the first one that compiles correctly for aarch64 with kernel 6.6 and our gcc version, older ones have a build error. But according to nvidia docs (https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-and-drivers-table) R535 is actually LTS, so it makes for a better default for Flatcar. Nvidia driver's on amd64 are also much more widely used.

How about I upgrade only arm64 to 570 in coreos-overlay?

R570 seems to be the first one that compiles correctly for aarch64 with kernel 6.6 and our gcc version, older ones have a build error. But according to nvidia docs (https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-and-drivers-table) R535 is actually LTS, so it makes for a better default for Flatcar. Nvidia driver's on amd64 are also much more widely used.

How about I upgrade only arm64 to 570 in coreos-overlay?

Sounds good. amd64 will eventually catch up to arm64.

Another idea that came to my mind is to have a patch for R535 to fix the build issue, but I have no clue if it makes sense and how much effort that would be.

krnowak · 2025-03-06T07:07:40Z

kola/tests/misc/nvidia.go

+	// Earlier driver versions have issue building on arm64 with kernel 6.6
+	if kola.QEMUOptions.Board == "arm64-usr" {
+		params["NVIDIA_DRIVER_VERSION_LINE"] = "NVIDIA_DRIVER_VERSION=570.86.15"
+	} else {
+		params["NVIDIA_DRIVER_VERSION_LINE"] = ""
+	}


Would this still be necessary if we updated the nvidia-drivers package in overlay?

kola/tests/misc/nvidia.go

Signed-off-by: Jeremi Piotrowski <[email protected]>

This relies on the nvidia-runtime sysext from the bakery. Signed-off-by: Jeremi Piotrowski <[email protected]>

Signed-off-by: Jeremi Piotrowski <[email protected]>

So that it doesn't look like a subtest which messes with the retry logic in scripts. Signed-off-by: Jeremi Piotrowski <[email protected]>

Instead of a particular output, which only matches a single GPU type. Signed-off-by: Jeremi Piotrowski <[email protected]>

Signed-off-by: Jeremi Piotrowski <[email protected]>

The driver version for arm64 has been changed in Flatcar, so we can rely on the default now. Signed-off-by: Jeremi Piotrowski <[email protected]>

krnowak

I think that the PR is fine as it is. I have some ideas below about moving the version numbers to constants to make it easier to bump the them when a need appears. This could be very well be done in a follow-up PR, that could probably also add some automation. Up to you.

krnowak · 2025-03-07T14:44:09Z

kola/tests/misc/nvidia.go

+  - path: /opt/extensions/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw
+    contents:
+      source: https://github.com/flatcar/sysext-bakery/releases/download/latest/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw
+  - path: /opt/extensions/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw
+    contents:
+      source: https://github.com/flatcar/sysext-bakery/releases/download/latest/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw
+  links:
+  - path: /etc/extensions/kubernetes.raw
+    target: /opt/extensions/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw
+    hard: false
+  - path: /etc/extensions/nvidia_runtime.raw
+    target: /opt/extensions/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw


Can I ask you to make kubernetes version and nvidia runtime versions template parameters? And the actual version could be constants defined next to the CmdTimeout constant above. This would make it easier to update the versions in the future, either automatically or manually.

krnowak · 2025-03-07T14:45:15Z

kola/tests/misc/nvidia.go

+	_ = c.MustSSH(m, "/opt/bin/helm repo add nvidia https://helm.ngc.nvidia.com/nvidia  && /opt/bin/helm repo update")
+	_ = c.MustSSH(m, `/opt/bin/helm install --wait --generate-name \
+	-n gpu-operator --create-namespace \
+	--version v24.9.2 \


I think this version could also be defined as a constant, next to kubernetes and nvidia runtime version constants. Thanks.

krnowak · 2025-03-07T14:46:40Z

kola/tests/misc/nvidia.go

+  restartPolicy: OnFailure
+  containers:
+  - name: cuda-vectoradd
+    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"


Maybe this docker tag could also be a constant?

jepio requested a review from Copilot February 27, 2025 18:41

Copilot AI reviewed Feb 27, 2025

View reviewed changes

jepio mentioned this pull request Mar 5, 2025

nvidia.service arm64 support & fixes flatcar/scripts#2694

Open

2 tasks

jepio force-pushed the kola-nvidia-arm64-test branch 2 times, most recently from 9e7301d to dbb49cb Compare March 5, 2025 18:20

jepio marked this pull request as ready for review March 5, 2025 18:22

jepio requested a review from a team March 5, 2025 18:22

krnowak reviewed Mar 6, 2025

View reviewed changes

jepio force-pushed the kola-nvidia-arm64-test branch from dbb49cb to 119cd04 Compare March 6, 2025 19:05

jepio added 14 commits March 7, 2025 12:48

kola: Use skip func to skip cl.misc.nvidia

2cd4d0a

Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Add test case that uses nvidia GPU operator

434c6be

This relies on the nvidia-runtime sysext from the bakery. Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Support basic test on AWS arm64 GPU instances

3ecf28f

Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Support arm64 in operator test

bc74977

Signed-off-by: Jeremi Piotrowski <[email protected]>

aws: Support creating machines with larger root disk

c48030e

Signed-off-by: Jeremi Piotrowski <[email protected]>

platform: Introduce optional interface machine-with-opts

53d6ad5

Signed-off-by: Jeremi Piotrowski <[email protected]>

azure: Support creating machines with larger root disks

2351e11

Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Change operator test name

de00038

So that it doesn't look like a subtest which messes with the retry logic in scripts. Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Look for success from nvidia-smi

3e7d7c0

Instead of a particular output, which only matches a single GPU type. Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Create instance with big disk for operator test

6963dd5

kola/nvidia: Remove outdated comment about platform support

a02e56f

kola/nvidia: Switch to gpu operator v24.9.2

ee9b00e

Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Update to k8s 1.30.8

a11c629

Signed-off-by: Jeremi Piotrowski <[email protected]>

kola/nvidia: Use default driver versions for test

2480322

The driver version for arm64 has been changed in Flatcar, so we can rely on the default now. Signed-off-by: Jeremi Piotrowski <[email protected]>

jepio force-pushed the kola-nvidia-arm64-test branch from 119cd04 to 2480322 Compare March 7, 2025 12:37

krnowak approved these changes Mar 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia arm64 & GPU operator test #583

nvidia arm64 & GPU operator test #583

jepio commented Feb 27, 2025

krnowak Mar 6, 2025

jepio Mar 6, 2025

krnowak Mar 6, 2025

jepio Mar 6, 2025

krnowak Mar 7, 2025

krnowak Mar 6, 2025

krnowak left a comment

krnowak Mar 7, 2025

krnowak Mar 7, 2025

krnowak Mar 7, 2025

nvidia arm64 & GPU operator test #583

Are you sure you want to change the base?

nvidia arm64 & GPU operator test #583

Conversation

jepio commented Feb 27, 2025

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krnowak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment