-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Install AMD GPU Kernel drivers if required #5875
base: master
Are you sure you want to change the base?
Conversation
@@ -206,6 +206,16 @@ EOF | |||
fi | |||
fi | |||
|
|||
if [[ "${AMD_GPU_NODE}" = true ]] && [[ "${skip_gpu_driver_install}" != "true" ]]; then | |||
logs_to_events "AKS.CSE.ensureAMDGPUDrivers" ensureAMDGPUDrivers | |||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have those in a reusable function ? I would expect that we run a cleanup logic after we install the GPU drivers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I deleted AMD-related logic for non-AMD users.
Probably can cleanup for everyone.
|
||
pushd /var/cache/amdgpu-apt | ||
ls -l | ||
sudo dpkg -i *.deb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need autoconf automake autotools-dev
those three packages are purely to compile the drivers from scratch. are we doing that ?
We shouldnt need to keep those... (not even sure why we need them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dependency of amdgpu. So I'm not sure what can I do with it.
I can run another set of tests without it (the feedback loop is frutrasting).
sudo dpkg -i *.deb | ||
popd | ||
|
||
REBOOTREQUIRED=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is causing us customer headache, we need to ensure the node doesn't get to a ready state before the reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though setting the variable automaticall does it for me. Any guidance on what needs to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ganeshkumarashok what are we planning to do for the MIG use case ?
echo "Installing AMD GPU drivers" | ||
|
||
# delete amdgpu module from blacklist | ||
sudo sed -i '/blacklist amdgpu/d' /etc/modprobe.d/blacklist-radeon-instinct.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get a better explanation why we have that ? why understanding was this was required on a VM to enable the access of the physical hardware on the bareVM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know as well why we need it. It took me a lot of time to figure out why drivers didn't work.
I think in the image provided by Canonical it's blacklisted, but nothing explains why it's there.
Containers: []corev1.Container{ | ||
{ | ||
Name: "amdgpu-device-plugin-container", | ||
Image: "rocm/k8s-device-plugin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should pull from MCR, dockerhub pull isn't stable. (assuming this is from docker.io)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to add it to MCR at some stage. I don't think it's available there yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are SFI to looks for that, we lost the contract preventing us from getting throttled on docker.io so expect throttling.
We have an ACR for E2E, we should setup a remote/cache for this image and use our own ACR
} | ||
|
||
func Test_Ubuntu2204Gen2Containerd_AMDGPU_V710(t *testing.T) { | ||
// the SKU isn't available in subscriptrion/region we run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a TODO
@@ -1664,3 +1664,69 @@ func Test_Ubuntu2404ARM(t *testing.T) { | |||
}, | |||
}) | |||
} | |||
|
|||
func Test_Ubuntu2404Gen2Containerd_AMDGPU_MI300(t *testing.T) { | |||
t.Skip("Provisioning of Standard_ND96isr_MI300X_v5 isn't reliable yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a TODO
return "", err | ||
} | ||
return *identity.Properties.ClientID, nil | ||
// HACK: temporary disable to allow running test in different subscription, without enough permissions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we uncomment all this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it before merging the PR. Once I stop testing it.
Currently it's required only for a single scriptless test that is disabled by default.
Not much harm if it accidentially leak through.
@@ -619,3 +619,40 @@ rm -f ./azcopy # cleanup immediately after usage will return in two downloads | |||
echo "install-dependencies step completed successfully" | |||
capture_benchmark "${SCRIPT_NAME}_overall" true | |||
process_benchmarks | |||
|
|||
|
|||
download_amdgpu_drivers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a device-plugin to install as well to ensure the GPU count shows up on the NOde labels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exists, but probably not in MCR.
AFAIK we doesn't do it yet for Nvidia. Should be consistent for both of them?
We probably can addess it separatelly.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add support for AMD GPUs. Install AMD GPU Kernel drivers if user didn't specify
--skip-gpu-driver
install.AMD Operates differently to NVidia. A kernel driver is required on the host VM. Additionally containers should have ROCm suite installed (which is blowing container size images to 60GB+, but it's on AMD).
With this change .deb packages are cached on the VM, but doesn't install until required (adds about 30MB overhead).
Once installed it takes about 600-1000MB of additional space.
We don't have a reliable quota for the VM SKUs in the test subscription. And VM allocation isn't reliable either.
So automatic testing is disabled until we manage to sort it out somehow.