Get Device Capabilities for AMD GPUs #417

BatSmacker84 · 2024-11-08T21:51:36Z

AMD GPUs already function using the tinygrad inference engine on Linux. However, the capabilities of the specific GPU are never queried or displayed, despite the info for AMD GPUs being in the project already.

The changes I made are fairly simple. A new dependency, "pyamdgpuinfo", has been added to get the info about an AMD GPU. If Device.DEFAULT detects an AMD GPU, then the library is imported and the first GPU in the system is queried. The raw info is then parsed before being passed to the DeviceCapabilities class.

I tried to keep the code similar to what is being done for detecting and querying NVIDIA GPUs, including debugging prints. I have an RX 7800XT and I have verified that the detection and info query works as intended. The list of AMD GPUs is relatively small, though, so expanding it in the future to include more AMD GPUs would be beneficial.

AlexCheema

Great idea, just need to make sure this doesn't break other platforms.

AlexCheema · 2024-11-12T05:48:07Z

setup.py

@@ -18,6 +18,7 @@
  "prometheus-client==0.20.0",
  "protobuf==5.27.1",
  "psutil==6.0.0",
+  "pyamdgpuinfo==2.1.6",


the problem is this fails when running on Mac.
we will need a special case for this.

I believe changing the line to "pyamdgpuinfo==2.1.6;platform_system=='Linux'" will prevent the dependency from being requested on MacOS. The library only supports Linux anyways, so x86 Macs and Windows PCs (when it possibly gets support in the future) with AMD GPUs would need a different solution anyways.

FlorianHeigl · 2024-11-14T21:46:52Z

from what I saw this doesn't yet work with GPU=1 - which could be needed for when tinygrad gives you an Unsupported architecture: gfx906...
(Though it totally went off the rails when I tried to get it working with a workaround)

BatSmacker84 · 2024-11-19T18:58:06Z

from what I saw this doesn't yet work with GPU=1

I don't think this is an issue created by my PR. Running GPU=1 exo using the main branch breaks in the same way on my Linux PC with an AMD GPU. My M1 Pro MBP runs fine even running with GPU=1 exo --inference-engine tinygrad.

The error seems to be caused by the NVML Shared Library not being found at runtime when attempting to get device capabilities. Specifically, the tinygrad Device.DEFAULT field is being set to GPU rather than AMD. When Device.DEFAULT='GPU' is set, the program tries to get device capabilities as if it is an NVIDIA GPU. So if the system instead has an AMD GPU, then the program fails to get device capabilities and fails.

I think I know how to fix the issue, but it is not relevant to this PR, imo. I'll work on another PR that helps to fix some of the issues with AMD GPUs, but this one is just for getting the GPU info and not better functional support.

deftdawg · 2024-12-22T07:00:29Z

exo/topology/device_capabilities.py

+    import pyamdgpuinfo
+
+    gpu_raw_info = pyamdgpuinfo.get_gpu(0)
+    gpu_name = gpu_raw_info.name


On my NixOS box there's no /usr/share/libdrm/amdgpu.ids, so this value is assigned None
You might want to consider using this instead (for cases where that happens or there's no matching card):

gpu_name = gpu_raw_info.name if gpu_raw_info.name is not None else "Unknown AMD"

deftdawg · 2024-12-22T22:15:18Z

Please merge @BatSmacker84's PR...

Without it my RX 6900 XT GPU is not detected by exo at all when starting AMD=1 exo

With it, it is finally detected!

Thanks to this I'm very close to having exo running on my system... It currently errors out after downloading the model because NixOS Tinygrad doesn't have the proper RoCM dependencies (RuntimeError("library amd_comgr not found")), which I'm going to try to fix next.

BatSmacker84 · 2024-12-23T15:17:33Z

I will once again note that this PR does not enable exo to utilize an AMD GPU, as that was already possible. This PR enables AMD GPUs to have their capabilities properly detected by exo, allowing for better partitioning of work among connected nodes and higher user confidence that their GPU is being used properly.

Because of this (and that AMD GPUs are unpopular in AI inference) I perfectly understand why this PR is not a priority. Much more important things to attend to!

deftdawg · 2024-12-23T15:32:12Z

Because of this (and that AMD GPUs are unpopular in AI inference) I perfectly understand why this PR is not a priority.

Not popular for good reason, even though I was able to fix NixOS's Tinygrad RoCM (PR) and get exo to operate on my AMD GPU, their drivers are such 💩 that my desktop crashed after running an 8B model 😅 .

If I shut off the firewall (sudo systemctl stop firewall), I'm able to cluster it with my M3, but it doesn't seem like inferance works across both (M3 seems to be using Qwen on MLX which I'm not sure 100% is supported on Linux/Tinygrad)

BatSmacker84 added 2 commits November 8, 2024 15:33

added pyamdgpuinfo dep

1adb73e

query amd gpu info and return capabilities

99e1284

AlexCheema requested changes Nov 12, 2024

View reviewed changes

make dep only for linux to avoid macos error

04f98b8

BatSmacker84 requested a review from AlexCheema November 12, 2024 19:00

BatSmacker84 mentioned this pull request Nov 14, 2024

Rocm support planned #434

Open

Merge branch 'exo-explore:main' into amd-gpu-info

0060cd3

deftdawg reviewed Dec 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Device Capabilities for AMD GPUs #417

Get Device Capabilities for AMD GPUs #417

BatSmacker84 commented Nov 8, 2024

AlexCheema left a comment

AlexCheema Nov 12, 2024

BatSmacker84 Nov 12, 2024

FlorianHeigl commented Nov 14, 2024

BatSmacker84 commented Nov 19, 2024

deftdawg Dec 22, 2024

deftdawg commented Dec 22, 2024

BatSmacker84 commented Dec 23, 2024

deftdawg commented Dec 23, 2024

Get Device Capabilities for AMD GPUs #417

Are you sure you want to change the base?

Get Device Capabilities for AMD GPUs #417

Conversation

BatSmacker84 commented Nov 8, 2024

AlexCheema left a comment

Choose a reason for hiding this comment

AlexCheema Nov 12, 2024

Choose a reason for hiding this comment

BatSmacker84 Nov 12, 2024

Choose a reason for hiding this comment

FlorianHeigl commented Nov 14, 2024

BatSmacker84 commented Nov 19, 2024

deftdawg Dec 22, 2024

Choose a reason for hiding this comment

deftdawg commented Dec 22, 2024

BatSmacker84 commented Dec 23, 2024

deftdawg commented Dec 23, 2024