Install AMD GPU Kernel drivers if required #5875

r2k1 · 2025-02-19T01:42:06Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add support for AMD GPUs. Install AMD GPU Kernel drivers if user didn't specify --skip-gpu-driver install.

AMD Operates differently to NVidia. A kernel driver is required on the host VM. Additionally containers should have ROCm suite installed (which is blowing container size images to 60GB+, but it's on AMD).

With this change .deb packages are cached on the VM, but doesn't install until required (adds about 30MB overhead).
Once installed it takes about 600-1000MB of additional space.

We don't have a reliable quota for the VM SKUs in the test subscription. And VM allocation isn't reliable either.
So automatic testing is disabled until we manage to sort it out somehow.

vhdbuilder/packer/install-dependencies.sh

djsly · 2025-02-19T02:00:42Z

parts/linux/cloud-init/artifacts/cse_main.sh

@@ -206,6 +206,16 @@ EOF
    fi
 fi

+if [[ "${AMD_GPU_NODE}" = true ]] && [[ "${skip_gpu_driver_install}" != "true" ]]; then
+    logs_to_events "AKS.CSE.ensureAMDGPUDrivers" ensureAMDGPUDrivers
+else


should we have those in a reusable function ? I would expect that we run a cleanup logic after we install the GPU drivers

Here I deleted AMD-related logic for non-AMD users.

Probably can cleanup for everyone.

djsly · 2025-02-19T02:03:13Z

parts/linux/cloud-init/artifacts/cse_config.sh

+
+    pushd /var/cache/amdgpu-apt
+    ls -l
+    sudo dpkg -i *.deb


why do we need autoconf automake autotools-dev those three packages are purely to compile the drivers from scratch. are we doing that ?

We shouldnt need to keep those... (not even sure why we need them)

It's dependency of amdgpu. So I'm not sure what can I do with it.

I can run another set of tests without it (the feedback loop is frutrasting).

djsly · 2025-02-19T02:03:45Z

parts/linux/cloud-init/artifacts/cse_config.sh

+    sudo dpkg -i *.deb
+    popd
+
+    REBOOTREQUIRED=true


this is causing us customer headache, we need to ensure the node doesn't get to a ready state before the reboot.

I though setting the variable automaticall does it for me. Any guidance on what needs to change?

@ganeshkumarashok what are we planning to do for the MIG use case ?

djsly · 2025-02-19T02:04:53Z

parts/linux/cloud-init/artifacts/cse_config.sh

+    echo "Installing AMD GPU drivers"
+
+     # delete amdgpu module from blacklist
+    sudo sed -i '/blacklist amdgpu/d' /etc/modprobe.d/blacklist-radeon-instinct.conf


can we get a better explanation why we have that ? why understanding was this was required on a VM to enable the access of the physical hardware on the bareVM

I'd like to know as well why we need it. It took me a lot of time to figure out why drivers didn't work.

I think in the image provided by Canonical it's blacklisted, but nothing explains why it's there.

djsly · 2025-02-19T02:10:37Z

e2e/kube.go

+			Containers: []corev1.Container{
+				{
+					Name:  "amdgpu-device-plugin-container",
+					Image: "rocm/k8s-device-plugin",


should pull from MCR, dockerhub pull isn't stable. (assuming this is from docker.io)

I'll need to add it to MCR at some stage. I don't think it's available there yet.

there are SFI to looks for that, we lost the contract preventing us from getting throttled on docker.io so expect throttling.

We have an ACR for E2E, we should setup a remote/cache for this image and use our own ACR

djsly · 2025-02-19T02:11:30Z

e2e/scenario_test.go

+}
+
+func Test_Ubuntu2204Gen2Containerd_AMDGPU_V710(t *testing.T) {
+	// the SKU isn't available in subscriptrion/region we run tests


can we add a TODO

djsly · 2025-02-19T02:11:43Z

e2e/scenario_test.go

@@ -1664,3 +1664,69 @@ func Test_Ubuntu2404ARM(t *testing.T) {
 		},
 	})
 }
+
+func Test_Ubuntu2404Gen2Containerd_AMDGPU_MI300(t *testing.T) {
+	t.Skip("Provisioning of Standard_ND96isr_MI300X_v5 isn't reliable yet")


can we add a TODO

djsly · 2025-02-19T02:13:26Z

e2e/config/azure.go

-		return "", err
-	}
-	return *identity.Properties.ClientID, nil
+	// HACK: temporary disable to allow running test in different subscription, without enough permissions


should we uncomment all this ?

I'll remove it before merging the PR. Once I stop testing it.

Currently it's required only for a single scriptless test that is disabled by default.
Not much harm if it accidentially leak through.

djsly · 2025-02-19T02:14:21Z

vhdbuilder/packer/install-dependencies.sh

@@ -619,3 +619,40 @@ rm -f ./azcopy # cleanup immediately after usage will return in two downloads
 echo "install-dependencies step completed successfully"
 capture_benchmark "${SCRIPT_NAME}_overall" true
 process_benchmarks
+
+
+download_amdgpu_drivers() {


do we have a device-plugin to install as well to ensure the GPU count shows up on the NOde labels ?

It exists, but probably not in MCR.

AFAIK we doesn't do it yet for Nvidia. Should be consistent for both of them?

We probably can addess it separatelly.

r2k1 added 15 commits February 17, 2025 14:14

set GPUNODE variable

8cd5f51

add placeholder to install AMD GPU drivers

af0c6a9

add script to install AMD drivers

034e069

add amd gpu test

7c9e2c2

cache dependencies

12cb3f9

install amdgpu dependencies

b71f5a9

install amdgpu dependencies

da9ce36

log ssh instructions earlier

8e3efd2

update installation step

09c051a

fix script

75a2722

simplify ssh key upload

511feb3

reduce noise

2a1b1d1

use default VMSKU

c749b3e

improve logging

adfed04

update e2e tests

af1b03f

r2k1 requested review from Devinwong, lilypan26, timmy-wright, juan-lee, cameronmeissner, UtheMan, ganeshkumarashok, anujmaheshwari1, AlisonB319, AbelHu, junjiezhang1997, jason1028kr, djsly, phealy and zachary-bailey as code owners February 19, 2025 01:42

r2k1 requested review from bravebeaver and smith1511 as code owners February 19, 2025 01:42

r2k1 temporarily deployed to test February 19, 2025 01:42 — with GitHub Actions Inactive

djsly reviewed Feb 19, 2025

View reviewed changes

vhdbuilder/packer/install-dependencies.sh Show resolved Hide resolved

djsly reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install AMD GPU Kernel drivers if required #5875

Install AMD GPU Kernel drivers if required #5875

r2k1 commented Feb 19, 2025 •

edited

Loading

djsly Feb 19, 2025

r2k1 Feb 19, 2025

djsly Feb 19, 2025

r2k1 Feb 19, 2025

djsly Feb 19, 2025

r2k1 Feb 19, 2025

djsly Feb 19, 2025

djsly Feb 19, 2025

r2k1 Feb 19, 2025

djsly Feb 19, 2025

r2k1 Feb 19, 2025

djsly Feb 19, 2025

djsly Feb 19, 2025

djsly Feb 19, 2025

djsly Feb 19, 2025

r2k1 Feb 19, 2025

djsly Feb 19, 2025

r2k1 Feb 19, 2025

Install AMD GPU Kernel drivers if required #5875

Are you sure you want to change the base?

Install AMD GPU Kernel drivers if required #5875

Conversation

r2k1 commented Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r2k1 commented Feb 19, 2025 •

edited

Loading