feat(notebooks): add AMD ROCm example notebook images #7656
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
/kind feature
Motivation
Following a discussion at the previous notebooks-wg meeting, adding support for having pytorch images for AMD GPUs was raised.
What does this PR do?
This PR adds ROCm pytorch images to Kubeflow with one caveat, a udev change required on the hosts on which these containers/pods run so that any user besides
root
can access and use the devices. This is due to the group id of therender
linux group being dynamic. This was raised to AMD upstream and their input was to do the following as we cannot bake therender
group into the image.For the issues see:
ROCm/ROCm-docker#90 and ROCm/k8s-device-plugin#39
Doing the following on the host:
Create a new file (if it doesn't exist)
/etc/udev/rules.d/70-amdgpu.rules
with the following content:KERNEL=="kfd", MODE="0666" SUBSYSTEM=="drm", KERNEL=="renderD*", MODE="0666"
Reload the udev rules with:
sudo udevadm control --reload-rules && sudo udevadm trigger
With these changes the
jovyan
user has access to the devices and can run workloads on AMD GPUs.Test Plan
These images were tested on a node of 8 MI300X AMD GPUs both on k3s (with the amd k8s device plugin) and docker.
On both of these setups an MNIST gpu example was run (https://github.com/pytorch/examples/tree/main/mnist)