Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(notebooks): add AMD ROCm example notebook images #7656

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gigabyte132
Copy link
Contributor

@gigabyte132 gigabyte132 commented Oct 23, 2024

/kind feature

Motivation

Following a discussion at the previous notebooks-wg meeting, adding support for having pytorch images for AMD GPUs was raised.

What does this PR do?

This PR adds ROCm pytorch images to Kubeflow with one caveat, a udev change required on the hosts on which these containers/pods run so that any user besides root can access and use the devices. This is due to the group id of the render linux group being dynamic. This was raised to AMD upstream and their input was to do the following as we cannot bake the render group into the image.

For the issues see:
ROCm/ROCm-docker#90 and ROCm/k8s-device-plugin#39

Doing the following on the host:

Create a new file (if it doesn't exist) /etc/udev/rules.d/70-amdgpu.rules with the following content:

KERNEL=="kfd", MODE="0666" SUBSYSTEM=="drm", KERNEL=="renderD*", MODE="0666"

Reload the udev rules with:

sudo udevadm control --reload-rules && sudo udevadm trigger

With these changes the jovyan user has access to the devices and can run workloads on AMD GPUs.

Test Plan

These images were tested on a node of 8 MI300X AMD GPUs both on k3s (with the amd k8s device plugin) and docker.
On both of these setups an MNIST gpu example was run (https://github.com/pytorch/examples/tree/main/mnist)

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thesuperzapper for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant