Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exporter endpoint svc to check for gpu health #100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

spraveenio
Copy link
Contributor

No description provided.

Copy link
Collaborator

@y2kenny-amd y2kenny-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU health is core part of the plugin and therefore it should not require an external dependency to the exporter in order to function (a standalone install of the plugin must continue to work.) Since the plugin has direct access to the GPU like the labeller, health status should be read directly from the GPU (or via relevant library like libsmi) instead of via rpc.

cmd/k8s-node-labeller/main.go Outdated Show resolved Hide resolved
@sajmera-pensando
Copy link
Collaborator

sajmera-pensando commented Jan 30, 2025

GPU health is core part of the plugin and therefore it should not require an external dependency to the exporter in order to function (a standalone install of the plugin must continue to work.) Since the plugin has direct access to the GPU like the labeller, health status should be read directly from the GPU (or via relevant library like libsmi) instead of via rpc.

Thanks @y2kenny-amd for the review. With this PR as well, a standalone install of the plugin will keep working. If the health information from device-metrics-exporter is not available, then device plugin will fall back to the existing basic health checking functionality.

K8s device-plugin's documentation does not say that device-plugin has to calculate the health. It just provides the API to mark the device as unhealthy. The health calculation could happen in any external component which has access to the data needed to find out whether the GPU is healthy or not. We are planning to implement the health calculation functionality in the device-metrics-exporter because it has access to all the metrics of the GPUs, but theoretically it could also happen in any external analytics systems like a Prometheus time-series database that has access to all the gpu metrics and can predict gpu health based on current metrics.

@spraveenio
Copy link
Contributor Author

GPU health is core part of the plugin and therefore it should not require an external dependency to the exporter in order to function (a standalone install of the plugin must continue to work.) Since the plugin has direct access to the GPU like the labeller, health status should be read directly from the GPU (or via relevant library like libsmi) instead of via rpc.

Yes stand alone keeps working without any change in behavior. This extends when there is metrics exporter is running on the node. To mark each and every gpu healthy or unhealthy.

@spraveenio spraveenio force-pushed the feature/pergpuhealthcheck branch from 8b0c012 to 90c93c7 Compare January 30, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants