Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install AMD GPU Kernel drivers if required #5875

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Install AMD GPU Kernel drivers if required #5875

wants to merge 15 commits into from

Conversation

r2k1
Copy link
Contributor

@r2k1 r2k1 commented Feb 19, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add support for AMD GPUs. Install AMD GPU Kernel drivers if user didn't specify --skip-gpu-driver install.

AMD Operates differently to NVidia. A kernel driver is required on the host VM. Additionally containers should have ROCm suite installed (which is blowing container size images to 60GB+, but it's on AMD).

With this change .deb packages are cached on the VM, but doesn't install until required (adds about 30MB overhead).
Once installed it takes about 600-1000MB of additional space.

We don't have a reliable quota for the VM SKUs in the test subscription. And VM allocation isn't reliable either.
So automatic testing is disabled until we manage to sort it out somehow.

@@ -206,6 +206,16 @@ EOF
fi
fi

if [[ "${AMD_GPU_NODE}" = true ]] && [[ "${skip_gpu_driver_install}" != "true" ]]; then
logs_to_events "AKS.CSE.ensureAMDGPUDrivers" ensureAMDGPUDrivers
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have those in a reusable function ? I would expect that we run a cleanup logic after we install the GPU drivers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I deleted AMD-related logic for non-AMD users.

Probably can cleanup for everyone.


pushd /var/cache/amdgpu-apt
ls -l
sudo dpkg -i *.deb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need autoconf automake autotools-dev those three packages are purely to compile the drivers from scratch. are we doing that ?

We shouldnt need to keep those... (not even sure why we need them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dependency of amdgpu. So I'm not sure what can I do with it.

I can run another set of tests without it (the feedback loop is frutrasting).

sudo dpkg -i *.deb
popd

REBOOTREQUIRED=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is causing us customer headache, we need to ensure the node doesn't get to a ready state before the reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though setting the variable automaticall does it for me. Any guidance on what needs to change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganeshkumarashok what are we planning to do for the MIG use case ?

echo "Installing AMD GPU drivers"

# delete amdgpu module from blacklist
sudo sed -i '/blacklist amdgpu/d' /etc/modprobe.d/blacklist-radeon-instinct.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get a better explanation why we have that ? why understanding was this was required on a VM to enable the access of the physical hardware on the bareVM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know as well why we need it. It took me a lot of time to figure out why drivers didn't work.

I think in the image provided by Canonical it's blacklisted, but nothing explains why it's there.

Containers: []corev1.Container{
{
Name: "amdgpu-device-plugin-container",
Image: "rocm/k8s-device-plugin",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pull from MCR, dockerhub pull isn't stable. (assuming this is from docker.io)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to add it to MCR at some stage. I don't think it's available there yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are SFI to looks for that, we lost the contract preventing us from getting throttled on docker.io so expect throttling.

We have an ACR for E2E, we should setup a remote/cache for this image and use our own ACR

}

func Test_Ubuntu2204Gen2Containerd_AMDGPU_V710(t *testing.T) {
// the SKU isn't available in subscriptrion/region we run tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a TODO

@@ -1664,3 +1664,69 @@ func Test_Ubuntu2404ARM(t *testing.T) {
},
})
}

func Test_Ubuntu2404Gen2Containerd_AMDGPU_MI300(t *testing.T) {
t.Skip("Provisioning of Standard_ND96isr_MI300X_v5 isn't reliable yet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a TODO

return "", err
}
return *identity.Properties.ClientID, nil
// HACK: temporary disable to allow running test in different subscription, without enough permissions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we uncomment all this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it before merging the PR. Once I stop testing it.

Currently it's required only for a single scriptless test that is disabled by default.
Not much harm if it accidentially leak through.

@@ -619,3 +619,40 @@ rm -f ./azcopy # cleanup immediately after usage will return in two downloads
echo "install-dependencies step completed successfully"
capture_benchmark "${SCRIPT_NAME}_overall" true
process_benchmarks


download_amdgpu_drivers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a device-plugin to install as well to ensure the GPU count shows up on the NOde labels ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It exists, but probably not in MCR.

AFAIK we doesn't do it yet for Nvidia. Should be consistent for both of them?

We probably can addess it separatelly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants