Unable to turn on advanced upgrade controller #688

age9990 · 2024-03-27T10:03:37Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu20.04
Kernel Version:5.15.0-69
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):crio
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):K8s
GPU Operator Version:v23.9.2 with NvidiaDriver CRD on

2. Issue or feature description

In our cluster, one GPU has disk issue so its status is NotReady. When I turn on advanced upgrade controller by setting driver.upgradePolicy.autoUpgrade to true, the advanced upgrade controller is not enabled, showing the error messages below.
I tried to set nvidia.com/gpu-driver-upgrade.skip=true on the broken GPU, the same error occurred.
The advanced upgrade controller works as expected when every node is ready in another k8s cluster. However, since some node may be down temporarily, would it be reasonable to bypass broken nodes rather than failed straight away?

GPU Operator error logs:
{"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build node upgrade state for pod","pod":{"namespace":"gpu-operator","name":"nvidia-gpu-driver-ubuntu20.04-797bd4457c-x4czx"},"error":"unable to get node : resource name may not be empty"}
{"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build cluster upgrade state","error":"unable to get node : resource name may not be empty"}
{"level":"error","ts":"2024-03-27T06:00:03.292Z","msg":"Reconciler error","controller":"upgrade-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"474846e5-07f9-445a-9107-a452581f1a69","error":"unable to get node : resource name may not be empty"}

cdesiniotis · 2024-04-10T16:28:39Z

would it be reasonable to bypass broken nodes rather than failed straight away?

Yes, the controller should be able to handle such a scenario and skip unhealthy nodes. A change was introduced to our upgrade library to address this. In particular, this conditional should resolve the issue you are encountering: https://github.com/NVIDIA/k8s-operator-libs/blob/main/pkg/upgrade/upgrade_state.go#L263-267

We bumped the version of the upgrade library included in the gpu-operator recently, so this issue should be addressed in our next release: 8a7a442

CNT-4913 NVIDIA/gpu-operator#688 Signed-off-by: Mike McKiernan <[email protected]>

cdesiniotis · 2024-05-02T20:46:18Z

Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue.
https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

cdesiniotis self-assigned this Apr 10, 2024

cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Apr 10, 2024

mikemckiernan added a commit to NVIDIA/cloud-native-docs that referenced this issue Apr 30, 2024

Driver upgrade with unhealthy nodes

296e3ad

CNT-4913 NVIDIA/gpu-operator#688 Signed-off-by: Mike McKiernan <[email protected]>

cdesiniotis closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to turn on advanced upgrade controller #688

Unable to turn on advanced upgrade controller #688

age9990 commented Mar 27, 2024

cdesiniotis commented Apr 10, 2024

cdesiniotis commented May 2, 2024

Unable to turn on advanced upgrade controller #688

Unable to turn on advanced upgrade controller #688

Comments

age9990 commented Mar 27, 2024

1. Quick Debug Information

2. Issue or feature description

cdesiniotis commented Apr 10, 2024

cdesiniotis commented May 2, 2024