Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Device Capabilities for AMD GPUs #417

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

BatSmacker84
Copy link
Contributor

AMD GPUs already function using the tinygrad inference engine on Linux. However, the capabilities of the specific GPU are never queried or displayed, despite the info for AMD GPUs being in the project already.

The changes I made are fairly simple. A new dependency, "pyamdgpuinfo", has been added to get the info about an AMD GPU. If Device.DEFAULT detects an AMD GPU, then the library is imported and the first GPU in the system is queried. The raw info is then parsed before being passed to the DeviceCapabilities class.

I tried to keep the code similar to what is being done for detecting and querying NVIDIA GPUs, including debugging prints. I have an RX 7800XT and I have verified that the detection and info query works as intended. The list of AMD GPUs is relatively small, though, so expanding it in the future to include more AMD GPUs would be beneficial.

Copy link
Contributor

@AlexCheema AlexCheema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, just need to make sure this doesn't break other platforms.

setup.py Outdated
@@ -18,6 +18,7 @@
"prometheus-client==0.20.0",
"protobuf==5.27.1",
"psutil==6.0.0",
"pyamdgpuinfo==2.1.6",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is this fails when running on Mac.
we will need a special case for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe changing the line to "pyamdgpuinfo==2.1.6;platform_system=='Linux'" will prevent the dependency from being requested on MacOS. The library only supports Linux anyways, so x86 Macs and Windows PCs (when it possibly gets support in the future) with AMD GPUs would need a different solution anyways.

@FlorianHeigl
Copy link

from what I saw this doesn't yet work with GPU=1 - which could be needed for when tinygrad gives you an Unsupported architecture: gfx906...
(Though it totally went off the rails when I tried to get it working with a workaround)

@BatSmacker84
Copy link
Contributor Author

from what I saw this doesn't yet work with GPU=1

I don't think this is an issue created by my PR. Running GPU=1 exo using the main branch breaks in the same way on my Linux PC with an AMD GPU. My M1 Pro MBP runs fine even running with GPU=1 exo --inference-engine tinygrad.

The error seems to be caused by the NVML Shared Library not being found at runtime when attempting to get device capabilities. Specifically, the tinygrad Device.DEFAULT field is being set to GPU rather than AMD. When Device.DEFAULT='GPU' is set, the program tries to get device capabilities as if it is an NVIDIA GPU. So if the system instead has an AMD GPU, then the program fails to get device capabilities and fails.

I think I know how to fix the issue, but it is not relevant to this PR, imo. I'll work on another PR that helps to fix some of the issues with AMD GPUs, but this one is just for getting the GPU info and not better functional support.

import pyamdgpuinfo

gpu_raw_info = pyamdgpuinfo.get_gpu(0)
gpu_name = gpu_raw_info.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my NixOS box there's no /usr/share/libdrm/amdgpu.ids, so this value is assigned None
You might want to consider using this instead (for cases where that happens or there's no matching card):

gpu_name = gpu_raw_info.name if gpu_raw_info.name is not None else "Unknown AMD"

@deftdawg
Copy link
Contributor

Please merge @BatSmacker84's PR...

Without it my RX 6900 XT GPU is not detected by exo at all when starting AMD=1 exo
image

With it, it is finally detected!
image

Thanks to this I'm very close to having exo running on my system... It currently errors out after downloading the model because NixOS Tinygrad doesn't have the proper RoCM dependencies (RuntimeError("library amd_comgr not found")), which I'm going to try to fix next.

@BatSmacker84
Copy link
Contributor Author

I will once again note that this PR does not enable exo to utilize an AMD GPU, as that was already possible. This PR enables AMD GPUs to have their capabilities properly detected by exo, allowing for better partitioning of work among connected nodes and higher user confidence that their GPU is being used properly.

Because of this (and that AMD GPUs are unpopular in AI inference) I perfectly understand why this PR is not a priority. Much more important things to attend to!

@deftdawg
Copy link
Contributor

Because of this (and that AMD GPUs are unpopular in AI inference) I perfectly understand why this PR is not a priority.

Not popular for good reason, even though I was able to fix NixOS's Tinygrad RoCM (PR) and get exo to operate on my AMD GPU, their drivers are such 💩 that my desktop crashed after running an 8B model 😅 .

If I shut off the firewall (sudo systemctl stop firewall), I'm able to cluster it with my M3, but it doesn't seem like inferance works across both (M3 seems to be using Qwen on MLX which I'm not sure 100% is supported on Linux/Tinygrad)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants