-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia arm64 & GPU operator test #583
base: flatcar-master
Are you sure you want to change the base?
Conversation
jepio
commented
Feb 27, 2025
- Add SkipFunc implementation for skipping test on unsupported instance types
- Add GPU operator test (includes nvidia-runtime sysext test)
- Add Arm64 support to both tests
- Add AWS support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This pull request adds support for NVIDIA GPU testing by introducing a SkipFunc for unsupported instance types, adding a GPU operator test (including an NVIDIA runtime sysext test), and extending support to the ARM64 architecture and AWS platform.
- Introduces skipOnNonGpu to conditionally skip tests on unsupported instances.
- Adds a new test (cl.misc.nvidia.operator) with a complete GPU operator installation and validation workflow.
- Updates existing NVIDIA installation test to incorporate ARM64 support via template configuration.
Reviewed Changes
File | Description |
---|---|
kola/tests/misc/nvidia.go | Added new constants, skip logic, GPU operator test implementation, and expanded platform/architecture support |
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
kola/tests/misc/nvidia.go:162
- The multi-line helm installation command uses backticks, which preserve literal newlines. Verify that the shell execution handles these newlines as intended, or consider converting it to a single-line command.
_ = c.MustSSH(m, `curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
kola/tests/misc/nvidia.go:101
- The SSH check in waitForNvidiaDriver only verifies for the substring 'active (exited)', which may be too specific if the nvidia service enters other valid states. Consider broadening the check or adding comments to clarify the expected state.
out, err := c.SSH(*m, "systemctl status nvidia.service")
9e7301d
to
dbb49cb
Compare
kola/tests/misc/nvidia.go
Outdated
contents: | ||
inline: | | ||
NVIDIA_DRIVER_VERSION=570.86.15 | ||
- path: /opt/extensions/kubernetes-v1.30.4-{{ .ARCH_SUFFIX }}.raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for picking exactly this version? As opposed to, say, 1.30.8 or any newer major version. Asking, because I think that eventually we will need to be bumping those versions, so maybe in future we will want to change the code to always download the latest version of k8s or nvidia runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any version is good enough, i've bumped it to 1.30.8.
I think we'll need to update this eventually but i don't expect it to have to always be the latest - the GPU operator itself has a matrix of supported k8s version.
I think this test is going to be most helpful to test the nvidia-runtime sysext and any changes to flatcar's nvidia.service that may break the GPU operator.
kola/tests/misc/nvidia.go
Outdated
- path: /etc/flatcar/nvidia-metadata | ||
contents: | ||
inline: | | ||
NVIDIA_DRIVER_VERSION=570.86.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we upgrade the nvidia drivers in the coreos-overlay instead of hardcoding this version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R570 seems to be the first one that compiles correctly for aarch64 with kernel 6.6 and our gcc version, older ones have a build error. But according to nvidia docs (https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-and-drivers-table) R535 is actually LTS, so it makes for a better default for Flatcar. Nvidia driver's on amd64 are also much more widely used.
How about I upgrade only arm64 to 570 in coreos-overlay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R570 seems to be the first one that compiles correctly for aarch64 with kernel 6.6 and our gcc version, older ones have a build error. But according to nvidia docs (https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-and-drivers-table) R535 is actually LTS, so it makes for a better default for Flatcar. Nvidia driver's on amd64 are also much more widely used.
How about I upgrade only arm64 to 570 in coreos-overlay?
Sounds good. amd64 will eventually catch up to arm64.
Another idea that came to my mind is to have a patch for R535 to fix the build issue, but I have no clue if it makes sense and how much effort that would be.
kola/tests/misc/nvidia.go
Outdated
// Earlier driver versions have issue building on arm64 with kernel 6.6 | ||
if kola.QEMUOptions.Board == "arm64-usr" { | ||
params["NVIDIA_DRIVER_VERSION_LINE"] = "NVIDIA_DRIVER_VERSION=570.86.15" | ||
} else { | ||
params["NVIDIA_DRIVER_VERSION_LINE"] = "" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this still be necessary if we updated the nvidia-drivers package in overlay?
dbb49cb
to
119cd04
Compare
Signed-off-by: Jeremi Piotrowski <[email protected]>
This relies on the nvidia-runtime sysext from the bakery. Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
So that it doesn't look like a subtest which messes with the retry logic in scripts. Signed-off-by: Jeremi Piotrowski <[email protected]>
Instead of a particular output, which only matches a single GPU type. Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
The driver version for arm64 has been changed in Flatcar, so we can rely on the default now. Signed-off-by: Jeremi Piotrowski <[email protected]>
119cd04
to
2480322
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the PR is fine as it is. I have some ideas below about moving the version numbers to constants to make it easier to bump the them when a need appears. This could be very well be done in a follow-up PR, that could probably also add some automation. Up to you.
- path: /opt/extensions/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw | ||
contents: | ||
source: https://github.com/flatcar/sysext-bakery/releases/download/latest/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw | ||
- path: /opt/extensions/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw | ||
contents: | ||
source: https://github.com/flatcar/sysext-bakery/releases/download/latest/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw | ||
links: | ||
- path: /etc/extensions/kubernetes.raw | ||
target: /opt/extensions/kubernetes-v1.30.8-{{ .ARCH_SUFFIX }}.raw | ||
hard: false | ||
- path: /etc/extensions/nvidia_runtime.raw | ||
target: /opt/extensions/nvidia_runtime-v1.16.2-{{ .ARCH_SUFFIX }}.raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask you to make kubernetes version and nvidia runtime versions template parameters? And the actual version could be constants defined next to the CmdTimeout
constant above. This would make it easier to update the versions in the future, either automatically or manually.
_ = c.MustSSH(m, "/opt/bin/helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && /opt/bin/helm repo update") | ||
_ = c.MustSSH(m, `/opt/bin/helm install --wait --generate-name \ | ||
-n gpu-operator --create-namespace \ | ||
--version v24.9.2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this version could also be defined as a constant, next to kubernetes and nvidia runtime version constants. Thanks.
restartPolicy: OnFailure | ||
containers: | ||
- name: cuda-vectoradd | ||
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this docker tag could also be a constant?