Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia arm64 & GPU operator test #583

Open
wants to merge 14 commits into
base: flatcar-master
Choose a base branch
from

Conversation

jepio
Copy link
Member

@jepio jepio commented Feb 27, 2025

  • Add SkipFunc implementation for skipping test on unsupported instance types
  • Add GPU operator test (includes nvidia-runtime sysext test)
  • Add Arm64 support to both tests
  • Add AWS support

@jepio jepio requested a review from Copilot February 27, 2025 18:41

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request adds support for NVIDIA GPU testing by introducing a SkipFunc for unsupported instance types, adding a GPU operator test (including an NVIDIA runtime sysext test), and extending support to the ARM64 architecture and AWS platform.

  • Introduces skipOnNonGpu to conditionally skip tests on unsupported instances.
  • Adds a new test (cl.misc.nvidia.operator) with a complete GPU operator installation and validation workflow.
  • Updates existing NVIDIA installation test to incorporate ARM64 support via template configuration.

Reviewed Changes

File Description
kola/tests/misc/nvidia.go Added new constants, skip logic, GPU operator test implementation, and expanded platform/architecture support

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

kola/tests/misc/nvidia.go:162

  • The multi-line helm installation command uses backticks, which preserve literal newlines. Verify that the shell execution handles these newlines as intended, or consider converting it to a single-line command.
_ = c.MustSSH(m, `curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

kola/tests/misc/nvidia.go:101

  • The SSH check in waitForNvidiaDriver only verifies for the substring 'active (exited)', which may be too specific if the nvidia service enters other valid states. Consider broadening the check or adding comments to clarify the expected state.
out, err := c.SSH(*m, "systemctl status nvidia.service")
@jepio jepio force-pushed the kola-nvidia-arm64-test branch 2 times, most recently from 9e7301d to dbb49cb Compare March 5, 2025 18:20
@jepio jepio marked this pull request as ready for review March 5, 2025 18:22
@jepio jepio requested a review from a team March 5, 2025 18:22
contents:
inline: |
NVIDIA_DRIVER_VERSION=570.86.15
- path: /opt/extensions/kubernetes-v1.30.4-{{ .ARCH_SUFFIX }}.raw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for picking exactly this version? As opposed to, say, 1.30.8 or any newer major version. Asking, because I think that eventually we will need to be bumping those versions, so maybe in future we will want to change the code to always download the latest version of k8s or nvidia runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any version is good enough, i've bumped it to 1.30.8.

I think we'll need to update this eventually but i don't expect it to have to always be the latest - the GPU operator itself has a matrix of supported k8s version.

I think this test is going to be most helpful to test the nvidia-runtime sysext and any changes to flatcar's nvidia.service that may break the GPU operator.

- path: /etc/flatcar/nvidia-metadata
contents:
inline: |
NVIDIA_DRIVER_VERSION=570.86.15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we upgrade the nvidia drivers in the coreos-overlay instead of hardcoding this version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R570 seems to be the first one that compiles correctly for aarch64 with kernel 6.6 and our gcc version, older ones have a build error. But according to nvidia docs (https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-and-drivers-table) R535 is actually LTS, so it makes for a better default for Flatcar. Nvidia driver's on amd64 are also much more widely used.

How about I upgrade only arm64 to 570 in coreos-overlay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R570 seems to be the first one that compiles correctly for aarch64 with kernel 6.6 and our gcc version, older ones have a build error. But according to nvidia docs (https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-and-drivers-table) R535 is actually LTS, so it makes for a better default for Flatcar. Nvidia driver's on amd64 are also much more widely used.

How about I upgrade only arm64 to 570 in coreos-overlay?

Sounds good. amd64 will eventually catch up to arm64.

Another idea that came to my mind is to have a patch for R535 to fix the build issue, but I have no clue if it makes sense and how much effort that would be.

Comment on lines 114 to 119
// Earlier driver versions have issue building on arm64 with kernel 6.6
if kola.QEMUOptions.Board == "arm64-usr" {
params["NVIDIA_DRIVER_VERSION_LINE"] = "NVIDIA_DRIVER_VERSION=570.86.15"
} else {
params["NVIDIA_DRIVER_VERSION_LINE"] = ""
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this still be necessary if we updated the nvidia-drivers package in overlay?

@jepio jepio force-pushed the kola-nvidia-arm64-test branch from dbb49cb to 119cd04 Compare March 6, 2025 19:05
jepio added 14 commits March 7, 2025 12:48
This relies on the nvidia-runtime sysext from the bakery.

Signed-off-by: Jeremi Piotrowski <[email protected]>
So that it doesn't look like a subtest which messes with the retry logic in
scripts.

Signed-off-by: Jeremi Piotrowski <[email protected]>
Instead of a particular output, which only matches a single GPU type.

Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
The driver version for arm64 has been changed in Flatcar, so we can rely on the
default now.

Signed-off-by: Jeremi Piotrowski <[email protected]>
@jepio jepio force-pushed the kola-nvidia-arm64-test branch from 119cd04 to 2480322 Compare March 7, 2025 12:37
Copy link
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the PR is fine as it is. I have some ideas below about moving the version numbers to constants to make it easier to bump the them when a need appears. This could be very well be done in a follow-up PR, that could probably also add some automation. Up to you.

Comment on lines +30 to +41
- path: /opt/extensions/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw
contents:
source: https://github.com/flatcar/sysext-bakery/releases/download/latest/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw
- path: /opt/extensions/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw
contents:
source: https://github.com/flatcar/sysext-bakery/releases/download/latest/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw
links:
- path: /etc/extensions/kubernetes.raw
target: /opt/extensions/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw
hard: false
- path: /etc/extensions/nvidia_runtime.raw
target: /opt/extensions/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask you to make kubernetes version and nvidia runtime versions template parameters? And the actual version could be constants defined next to the CmdTimeout constant above. This would make it easier to update the versions in the future, either automatically or manually.

_ = c.MustSSH(m, "/opt/bin/helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && /opt/bin/helm repo update")
_ = c.MustSSH(m, `/opt/bin/helm install --wait --generate-name \
-n gpu-operator --create-namespace \
--version v24.9.2 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this version could also be defined as a constant, next to kubernetes and nvidia runtime version constants. Thanks.

restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this docker tag could also be a constant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants