Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU device plugin interval health check #97

Open
nadav213000 opened this issue Mar 11, 2024 · 4 comments
Open

GPU device plugin interval health check #97

nadav213000 opened this issue Mar 11, 2024 · 4 comments

Comments

@nadav213000
Copy link

nadav213000 commented Mar 11, 2024

Hey,

We use NVIDIA GPU Operator on OpenShift to expose Passthrough GPU with KubeVirt.

Issue

We experienced an issue when one of the GPUs on the Node became unavailable, but the Node didn't change the reported GPU capacity or Alloctable resources. The GPU itself wasn't available and when I tried to create a new VM it reached crashLoopBack state until the GPU became available again.

Only after I restarted the pod nvidia-sandbox-device-plugin-daemonset on the specific Node the number of Alloctable and Capacity GPUs changed to the right number.

I checked the pods on this Nodes:

  • nvidia-sandbox-device-plugin
  • nvidia-sandbox-validator
  • nvidia-vfio-manager
    and there were no errors in their logs and I couldn't see any new logs from these pods.

It looks like the pods run an initial healthCheck, and then don't run them again. Is there a way to make the Operator pods validate the health of the GPUs on an interval, so the resources available on the Node will be reflected correctly?

How to reproduce

I reproduced the issue by logically removing one of the GPU PCI devices from the node using the command:

echo "1" > /sys/bus/pci/devices/<gpu_pci_id>/remove

and validated the GPU is no longer visible from the host using lspci.

Then, using oc describe <node> the number of GPUs exposed didn't change. After restarting the sandbox pod, the number of GPUs was updated to the right number.

To re-add the GPU you can run the command:

echo "1" > /sys/bus/pci/rescan

and restart the sandbox pod again

Versions

  • NVIDIA Operator - 23.6.0
  • NVIDIA KubeVirt GPU Device Plugin - v1.2.2
  • OpenShift - 4.12.35
  • nvidia sandbox device plugin image - nvcr.io/nvidia/kubevirt-gpu-device-plugin@sha256:9484110986c80ab83bc404066ca4b7be115124ec04ca16bce775403e92bfd890
@rthallisey
Copy link
Collaborator

Thanks for the feature request.

GPU health checks are an important feature that we'd love to have yesterday. However, the trouble is finding the most effective way to solve the problem, so that we're correctly detecting failures and remediating. The areas we're investigating are fault-tolerant scheduling, so that we avoid problematic GPUs and identifying proper remediation steps so that users aren't impacted.

I'll follow up on this issue when we've aligned on a solution.

@cdesiniotis cc

@doronkg
Copy link

doronkg commented Jun 27, 2024

Hey @rthallisey, I'm writing here on behalf of my colleague @nadav213000.
It seems as the resolution to this issue was brought in #105, and is introduced in v.1.2.8, correct?

@visheshtanksale
Copy link
Contributor

Hey @rthallisey, I'm writing here on behalf of my colleague @nadav213000. It seems as the resolution to this issue was brought in #105, and is introduced in v.1.2.8, correct?

Yes, #105 resolves the scenarios that you have mentioned here and is released with v1.2.8.

@cdesiniotis
Copy link
Contributor

@nadav213000 GPU Operator 24.6.0 has been released and contains the fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants