-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get Device Capabilities for AMD GPUs #417
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, just need to make sure this doesn't break other platforms.
setup.py
Outdated
@@ -18,6 +18,7 @@ | |||
"prometheus-client==0.20.0", | |||
"protobuf==5.27.1", | |||
"psutil==6.0.0", | |||
"pyamdgpuinfo==2.1.6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is this fails when running on Mac.
we will need a special case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe changing the line to "pyamdgpuinfo==2.1.6;platform_system=='Linux'" will prevent the dependency from being requested on MacOS. The library only supports Linux anyways, so x86 Macs and Windows PCs (when it possibly gets support in the future) with AMD GPUs would need a different solution anyways.
from what I saw this doesn't yet work with |
I don't think this is an issue created by my PR. Running The error seems to be caused by the NVML Shared Library not being found at runtime when attempting to get device capabilities. Specifically, the I think I know how to fix the issue, but it is not relevant to this PR, imo. I'll work on another PR that helps to fix some of the issues with AMD GPUs, but this one is just for getting the GPU info and not better functional support. |
import pyamdgpuinfo | ||
|
||
gpu_raw_info = pyamdgpuinfo.get_gpu(0) | ||
gpu_name = gpu_raw_info.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my NixOS box there's no /usr/share/libdrm/amdgpu.ids
, so this value is assigned None
You might want to consider using this instead (for cases where that happens or there's no matching card):
gpu_name = gpu_raw_info.name if gpu_raw_info.name is not None else "Unknown AMD"
Please merge @BatSmacker84's PR... Without it my RX 6900 XT GPU is not detected by exo at all when starting With it, it is finally detected! Thanks to this I'm very close to having exo running on my system... It currently errors out after downloading the model because NixOS Tinygrad doesn't have the proper RoCM dependencies ( |
I will once again note that this PR does not enable exo to utilize an AMD GPU, as that was already possible. This PR enables AMD GPUs to have their capabilities properly detected by exo, allowing for better partitioning of work among connected nodes and higher user confidence that their GPU is being used properly. Because of this (and that AMD GPUs are unpopular in AI inference) I perfectly understand why this PR is not a priority. Much more important things to attend to! |
Not popular for good reason, even though I was able to fix NixOS's Tinygrad RoCM (PR) and get exo to operate on my AMD GPU, their drivers are such 💩 that my desktop crashed after running an 8B model 😅 . If I shut off the firewall ( |
AMD GPUs already function using the tinygrad inference engine on Linux. However, the capabilities of the specific GPU are never queried or displayed, despite the info for AMD GPUs being in the project already.
The changes I made are fairly simple. A new dependency, "pyamdgpuinfo", has been added to get the info about an AMD GPU. If
Device.DEFAULT
detects an AMD GPU, then the library is imported and the first GPU in the system is queried. The raw info is then parsed before being passed to theDeviceCapabilities
class.I tried to keep the code similar to what is being done for detecting and querying NVIDIA GPUs, including debugging prints. I have an RX 7800XT and I have verified that the detection and info query works as intended. The list of AMD GPUs is relatively small, though, so expanding it in the future to include more AMD GPUs would be beneficial.