exporter endpoint svc to check for gpu health #100

spraveenio · 2025-01-28T02:29:04Z

No description provided.

y2kenny-amd

GPU health is core part of the plugin and therefore it should not require an external dependency to the exporter in order to function (a standalone install of the plugin must continue to work.) Since the plugin has direct access to the GPU like the labeller, health status should be read directly from the GPU (or via relevant library like libsmi) instead of via rpc.

cmd/k8s-node-labeller/main.go

sajmera-pensando · 2025-01-30T16:36:42Z

GPU health is core part of the plugin and therefore it should not require an external dependency to the exporter in order to function (a standalone install of the plugin must continue to work.) Since the plugin has direct access to the GPU like the labeller, health status should be read directly from the GPU (or via relevant library like libsmi) instead of via rpc.

Thanks @y2kenny-amd for the review. With this PR as well, a standalone install of the plugin will keep working. If the health information from device-metrics-exporter is not available, then device plugin will fall back to the existing basic health checking functionality.

K8s device-plugin's documentation does not say that device-plugin has to calculate the health. It just provides the API to mark the device as unhealthy. The health calculation could happen in any external component which has access to the data needed to find out whether the GPU is healthy or not. We are planning to implement the health calculation functionality in the device-metrics-exporter because it has access to all the metrics of the GPUs, but theoretically it could also happen in any external analytics systems like a Prometheus time-series database that has access to all the gpu metrics and can predict gpu health based on current metrics.

spraveenio · 2025-01-30T17:27:26Z

GPU health is core part of the plugin and therefore it should not require an external dependency to the exporter in order to function (a standalone install of the plugin must continue to work.) Since the plugin has direct access to the GPU like the labeller, health status should be read directly from the GPU (or via relevant library like libsmi) instead of via rpc.

Yes stand alone keeps working without any change in behavior. This extends when there is metrics exporter is running on the node. To mark each and every gpu healthy or unhealthy.

spraveenio requested review from y2kenny-amd and yansun1996 January 28, 2025 02:29

y2kenny-amd requested changes Jan 30, 2025

View reviewed changes

cmd/k8s-node-labeller/main.go Outdated Show resolved Hide resolved

exporter endpoint svc to check for gpu health

90c93c7

spraveenio force-pushed the feature/pergpuhealthcheck branch from 8b0c012 to 90c93c7 Compare January 30, 2025 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exporter endpoint svc to check for gpu health #100

exporter endpoint svc to check for gpu health #100

spraveenio commented Jan 28, 2025

y2kenny-amd left a comment

sajmera-pensando commented Jan 30, 2025 •

edited

Loading

spraveenio commented Jan 30, 2025

exporter endpoint svc to check for gpu health #100

Are you sure you want to change the base?

exporter endpoint svc to check for gpu health #100

Conversation

spraveenio commented Jan 28, 2025

y2kenny-amd left a comment

Choose a reason for hiding this comment

sajmera-pensando commented Jan 30, 2025 • edited Loading

spraveenio commented Jan 30, 2025

sajmera-pensando commented Jan 30, 2025 •

edited

Loading