Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to turn on advanced upgrade controller #688

Closed
age9990 opened this issue Mar 27, 2024 · 2 comments
Closed

Unable to turn on advanced upgrade controller #688

age9990 opened this issue Mar 27, 2024 · 2 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@age9990
Copy link

age9990 commented Mar 27, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu20.04
  • Kernel Version:5.15.0-69
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):crio
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):K8s
  • GPU Operator Version:v23.9.2 with NvidiaDriver CRD on

2. Issue or feature description

In our cluster, one GPU has disk issue so its status is NotReady. When I turn on advanced upgrade controller by setting driver.upgradePolicy.autoUpgrade to true, the advanced upgrade controller is not enabled, showing the error messages below.
I tried to set nvidia.com/gpu-driver-upgrade.skip=true on the broken GPU, the same error occurred.
The advanced upgrade controller works as expected when every node is ready in another k8s cluster. However, since some node may be down temporarily, would it be reasonable to bypass broken nodes rather than failed straight away?

GPU Operator error logs:
{"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build node upgrade state for pod","pod":{"namespace":"gpu-operator","name":"nvidia-gpu-driver-ubuntu20.04-797bd4457c-x4czx"},"error":"unable to get node : resource name may not be empty"}
{"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build cluster upgrade state","error":"unable to get node : resource name may not be empty"}
{"level":"error","ts":"2024-03-27T06:00:03.292Z","msg":"Reconciler error","controller":"upgrade-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"474846e5-07f9-445a-9107-a452581f1a69","error":"unable to get node : resource name may not be empty"}

@cdesiniotis
Copy link
Contributor

would it be reasonable to bypass broken nodes rather than failed straight away?

Yes, the controller should be able to handle such a scenario and skip unhealthy nodes. A change was introduced to our upgrade library to address this. In particular, this conditional should resolve the issue you are encountering: https://github.com/NVIDIA/k8s-operator-libs/blob/main/pkg/upgrade/upgrade_state.go#L263-267

We bumped the version of the upgrade library included in the gpu-operator recently, so this issue should be addressed in our next release: 8a7a442

@cdesiniotis cdesiniotis self-assigned this Apr 10, 2024
@cdesiniotis cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Apr 10, 2024
mikemckiernan added a commit to NVIDIA/cloud-native-docs that referenced this issue Apr 30, 2024
@cdesiniotis
Copy link
Contributor

Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue.
https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

2 participants