-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give Developer user full root privileges #172
Conversation
Also, while examples build successful, they do not run without a |
It's weird because I'm able to run both rocminfo and the examples with the current setup, and I checked with freshly built images from the dockerfiles in develop. Maybe some other factor is the root cause of the issue? |
That is strange. Can you share how you're building and running the image? Maybe it's a documentation issue? We'll try again. |
For building: # From the rocm-examples/Dockerfiles folder
cp hip-libraries-rocm-ubuntu.Dockerfile Dockerfile
docker build -t hip-libraries-rocm-ubuntu . For running: docker run -d -it --device /dev/kfd --device /dev/dri --name test-hip-libraries-rocm-ubuntu hip-libraries-rocm-ubuntu |
@Beanavil, followed those steps but am still seeing the
I've tried this with sudo infront of the docker commands and without sudo after adding my user to the docker group. |
@harkgill-amd hm and does rocminfo work correctly on your local machine? Maybe the issue is that you are not in the Edit: in case you are in the |
ROCm is installed and rocminfo works correctly locally. Local user is in both |
Perhaps the ownership/permissions of the |
Locally on my system
Within container that encounters
Within container that encounters Unable to open /dev/kfd read-write: Permission denied (Mustafa's system)
|
@Beanavil, did some more digging on this and can confirm it's related to how render group IDs are generated. On the host system, amdgpu-install will randomly assign an ID to the render group. If the render gid differs between the host and the container, permission errors will be thrown when trying to access GPU resources. The fix for this would be to add the below flag in the docker run command
By doing this the container gains access to the same render group permissions as the host. This is identifiable by running
*Note the 109 GID from the base container and the 110 GID passed from the host This same issue was discussed for the PyTorch ROCm docker image and it was decided to just enable root privileges, see https://github.com/ROCm/rocAutomation/pull/194. |
@harkgill-amd I see. Well if there is an alternative solution for this issue I'd like to do that instead of having a root user, as AFAIK this is usually not recommended (although we do have a non-root user with sudo privileges, which is also not ideal). What do you think? I'd like to read the discussion on the repo you mentioned but I get a 404 (perhaps I don't have permission to access it?) |
Ah yes, it is a permissions issue. Here is the issue that prompted the fix on the PyTorch side, ROCm/ROCm-docker#90. I'd be open to either solution though I'm leaning more towards the root privileges as it maintains consistency with other images and a more concise run command. |
@harkgill-amd actually I'm just noticing now, we had added this GID as a parameter of the Dockerfile, so actually it's not necessary to modify the container to get the right render group id (just adding |
Yep, just confirmed on my side as well, can pass in the value as build argument to access GPU resources. This sets the containers gid to match the hosts
Instead of adding the hosts gid alongside the containers.
Both methods work it's just a matter of which command to add the argument to. I think we can close out this PR and update the documentation if we have decided to go with rootless + render group access. |
@harkgill-amd Sounds good to me. Do you guys open the PR for updating the docs or should we? |
Fix for #171.
Even after the installation of rocm-smi, GPUs are not visible due to lack of permissions in docker container. Permissions can be checked by running
rocminfo
which fails andsudo rocminfo
which executes correctly.This PR sets
developer
to a root user which is the status quo on other ROCm docker images such as rocm/pytorch, rocm/tensorflow and the base 24.04 ROCm image. Confirmed rocm-examples are successfully built and ran after this change.