Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give Developer user full root privileges #172

Closed
wants to merge 2 commits into from

Conversation

harkgill-amd
Copy link
Contributor

Fix for #171.

Even after the installation of rocm-smi, GPUs are not visible due to lack of permissions in docker container. Permissions can be checked by running rocminfo which fails and sudo rocminfo which executes correctly.

This PR sets developer to a root user which is the status quo on other ROCm docker images such as rocm/pytorch, rocm/tensorflow and the base 24.04 ROCm image. Confirmed rocm-examples are successfully built and ran after this change.

@harkgill-amd harkgill-amd changed the title Docker fix Give Developer user full root privileges Oct 2, 2024
@dgaliffiAMD dgaliffiAMD requested a review from Beanavil October 2, 2024 20:51
@dgaliffiAMD
Copy link
Collaborator

Also, while examples build successful, they do not run without a sudo.
Let's look into adding a CI that validates that the examples build & within within the container. Unfortunately, this won't be something that we can do with the basic runners.

@Beanavil
Copy link
Collaborator

Beanavil commented Oct 3, 2024

It's weird because I'm able to run both rocminfo and the examples with the current setup, and I checked with freshly built images from the dockerfiles in develop. Maybe some other factor is the root cause of the issue?

@dgaliffiAMD
Copy link
Collaborator

That is strange. Can you share how you're building and running the image? Maybe it's a documentation issue? We'll try again.

@Beanavil
Copy link
Collaborator

Beanavil commented Oct 4, 2024

For building:

# From the rocm-examples/Dockerfiles folder
cp hip-libraries-rocm-ubuntu.Dockerfile Dockerfile
docker build -t hip-libraries-rocm-ubuntu .

For running:

docker run -d -it --device /dev/kfd --device /dev/dri --name test-hip-libraries-rocm-ubuntu hip-libraries-rocm-ubuntu  

@harkgill-amd
Copy link
Contributor Author

@Beanavil, followed those steps but am still seeing the HSA_STATUS_ERROR_OUT_OF_RESOURCES error. Below are the steps I tried, could you please let me know if there's any difference in how you're executing on your end?

  1. git clone https://github.com/ROCm/rocm-examples.git
  2. cd rocm-examples/Dockerfiles
  3. cp hip-libraries-rocm-ubuntu.Dockerfile Dockerfile
  4. docker build -t hip-libraries-rocm-ubuntu .
  5. docker run -d -it --device /dev/kfd --device /dev/dri --name test-hip-libraries-rocm-ubuntu hip-libraries-rocm-ubuntu
  6. docker exec -it <container_id>
  7. rocminfo #Inside Container

I've tried this with sudo infront of the docker commands and without sudo after adding my user to the docker group.

@Beanavil
Copy link
Collaborator

Beanavil commented Oct 7, 2024

@harkgill-amd hm and does rocminfo work correctly on your local machine? Maybe the issue is that you are not in the render group, which can cause the HSA_STATUS_ERROR_OUT_OF_RESOURCES error, although I'm not sure if this may be affecting the capabilities of the container.

Edit: in case you are in the render group, perhaps check the video one as well, as it's specified in the ROCm installation steps that the user should be added to those two

@harkgill-amd
Copy link
Contributor Author

ROCm is installed and rocminfo works correctly locally. Local user is in both video and render groups as well. I think it's worth mentioning that Umesh (reporter of issue) and I see the HSA_STATUS_ERROR_OUT_OF_RESOURCES without sudo whereas @dgaliffiAMD and @Mustaballer see a generic Unable to open /dev/kfd read-write: Permission denied error. In both cases, root access resolves the errors.

@Beanavil
Copy link
Collaborator

Beanavil commented Oct 7, 2024

Perhaps the ownership/permissions of the /dev/kfd/ are different in our setups? How do those look like in your end (for both of the cases you described)

@harkgill-amd
Copy link
Contributor Author

Locally on my system

rocm@rocm-System-Product-Name:~/test1/rocm-examples/Dockerfiles$ ls -ld /dev/kfd
crw-rw---- 1 root video 235, 0 Oct  7 10:14 /dev/kfd

Within container that encountersHSA_STATUS_ERROR_OUT_OF_RESOURCES (My system)

developer@d1872fce4007:/workspaces$ ls -ld /dev/kfd
crw-rw---- 1 root video 235, 0 Oct  7 14:18 /dev/kfd

Within container that encounters Unable to open /dev/kfd read-write: Permission denied (Mustafa's system)

developer@ac6f706bc771:/workspaces$ ls -ld /dev/kfd
crw-rw---- 1 root 992 235, 0 Oct  7 16:21 /dev/kfd

@harkgill-amd
Copy link
Contributor Author

@Beanavil, did some more digging on this and can confirm it's related to how render group IDs are generated. On the host system, amdgpu-install will randomly assign an ID to the render group. If the render gid differs between the host and the container, permission errors will be thrown when trying to access GPU resources.

The fix for this would be to add the below flag in the docker run command

--group-add $(getent group render | cut -d':' -f 3)

By doing this the container gains access to the same render group permissions as the host. This is identifiable by running id in the container after running with group add.

developer@d1815cf30320:/workspaces$ id
uid=1000(developer) gid=1000(developer) groups=1000(developer),27(sudo),44(video),109(render),110

*Note the 109 GID from the base container and the 110 GID passed from the host

This same issue was discussed for the PyTorch ROCm docker image and it was decided to just enable root privileges, see https://github.com/ROCm/rocAutomation/pull/194.

@Beanavil
Copy link
Collaborator

@harkgill-amd I see. Well if there is an alternative solution for this issue I'd like to do that instead of having a root user, as AFAIK this is usually not recommended (although we do have a non-root user with sudo privileges, which is also not ideal). What do you think? I'd like to read the discussion on the repo you mentioned but I get a 404 (perhaps I don't have permission to access it?)

@harkgill-amd
Copy link
Contributor Author

harkgill-amd commented Oct 16, 2024

Ah yes, it is a permissions issue. Here is the issue that prompted the fix on the PyTorch side, ROCm/ROCm-docker#90. I'd be open to either solution though I'm leaning more towards the root privileges as it maintains consistency with other images and a more concise run command.

@Beanavil
Copy link
Collaborator

@harkgill-amd actually I'm just noticing now, we had added this GID as a parameter of the Dockerfile, so actually it's not necessary to modify the container to get the right render group id (just adding --build-arg GID="$(getent group render | cut -d':' -f 3)" when building the image)

@harkgill-amd
Copy link
Contributor Author

Yep, just confirmed on my side as well, can pass in the value as build argument to access GPU resources. This sets the containers gid to match the hosts

developer@f09768918a93:/workspaces$ id
uid=1000(developer) gid=1000(developer) groups=1000(developer),27(sudo),44(video),110(render)

Instead of adding the hosts gid alongside the containers.

uid=1000(developer) gid=1000(developer) groups=1000(developer),27(sudo),44(video),109(render),110

Both methods work it's just a matter of which command to add the argument to. I think we can close out this PR and update the documentation if we have decided to go with rootless + render group access.

@Beanavil
Copy link
Collaborator

@harkgill-amd Sounds good to me. Do you guys open the PR for updating the docs or should we?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants