Skip to content

Commit b0c8d2e

Browse files
committed
enhancements/update/do-not-block-on-degraded: New enhancement proposal
The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are sorted into a task-node graph, and the CVO walks the graph reconciling. Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on Degraded=True ClusterOperator. This enhancement proposes ignoring Degraded when deciding whether to block on a ClusterOperator manifest. The goal of blocking on manifests with sad resources is to avoid further destabilization. For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. Or if we are unable to update the Kube-API-server operator, we don't want to inject unsupported kubelet skew by asking the machine-config operator to update nodes. However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful. Our expirience with Degraded=True blocks turns up cases where blocking is not helpful, so this enhancement proposes no longer blocking on that condition. We will conditinue to block on Available=False ClusterOperator, or when the ClusterOperator versions have not yet reached the values requested by the ClusterOperator's release manifest.
1 parent 038cbd1 commit b0c8d2e

File tree

3 files changed

+184
-2
lines changed

3 files changed

+184
-2
lines changed

dev-guide/cluster-version-operator/dev/clusteroperator.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,7 @@ Conditions determine when the CVO considers certain actions complete, the follow
233233
| Begin upgrade(patch) | any | any | any | any | any
234234
| Begin upgrade(minor) | any | any | any | any | not false
235235
| Begin upgrade (w/ force) | any | any | any | any | any
236-
| Upgrade completion[2]| newVersion(target version for the upgrade) | true | false | any | any
236+
| Upgrade completion[2]| newVersion(target version for the upgrade) | true | any | any | any
237237

238238
[1] Install works on all components in parallel, it does not wait for any component to complete before starting another one.
239239

dev-guide/cluster-version-operator/user/reconciliation.md

-1
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,6 @@ The ClusterOperator builder only monitors the in-cluster object and blocks until
157157
```
158158
159159
would block until the in-cluster ClusterOperator reported `operator` at version 4.1.0.
160-
* Not degraded (except during initialization, where we ignore the degraded status)
161160

162161
### CustomResourceDefinition
163162

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
---
2+
title: do-not-block-on-degraded-true-clusteroperators
3+
authors:
4+
- "@wking"
5+
reviewers:
6+
- "@PratikMahajan, update team lead"
7+
- "@sdodson, update staff engineer"
8+
approvers:
9+
- "@PratikMahajan, update team lead"
10+
api-approvers:
11+
- None
12+
creation-date: 2024-11-25
13+
last-updated: 2024-11-25
14+
tracking-link:
15+
- https://issues.redhat.com/browse/OTA-540
16+
---
17+
18+
# Do not block on Degraded=True ClusterOperators
19+
20+
## Summary
21+
22+
The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are [sorted into a task-node graph](/dev-guide/cluster-version-operator/user/reconciliation.md#manifest-graph), and the CVO walks the graph reconciling.
23+
Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on `Degraded=True` ClusterOperator.
24+
This enhancement proposes ignoring `Degraded` when deciding whether to block on a ClusterOperator manifest.
25+
26+
## Motivation
27+
28+
The goal of blocking on manifests with sad resources is to avoid further destabilization.
29+
For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment.
30+
Or if we are unable to update the Kube-API-server operator, we don't want to inject [unsupported kubelet skew][kubelet-skew] by asking the machine-config operator to update nodes.
31+
32+
However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness.
33+
We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.
34+
35+
Our expirience with `Degraded=True` blocks turns up cases like:
36+
37+
* 4.6 `Degraded=True` on an unreachable, user-provided node, with monitoring reporting `UpdatingnodeExporterFailed`, network reporting `RolloutHung`, and machine-config reporting `MachineConfigDaemonFailed`.
38+
But those ClusterOperator were all still `Available=True`, and in 4.10 and later, monitoring workloads are guarded by PodDisruptionBudgets (PDBs)
39+
40+
### User Stories
41+
42+
43+
> "As a _role_, I want to _take some action_ so that I can _accomplish a
44+
goal_."
45+
46+
Make the change feel real for users, without getting bogged down in
47+
implementation details.
48+
49+
Here are some example user stories to show what they might look like:
50+
51+
* As a cluster administrator, I want the ability to defer recovering `Degraded=True` ClusterOperators without slowing ClusterVersion updates.
52+
53+
### Goals
54+
55+
ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`.
56+
57+
Summarize the specific goals of the proposal. How will we know that
58+
this has succeeded? A good goal describes something a user wants from
59+
their perspective, and does not include the implementation details
60+
from the proposal.
61+
62+
### Non-Goals
63+
64+
* Adjusting how the cluster-version operator treats `Available` and `versions` in ClusterOperator status.
65+
The CVO will still block on `Available=False` ClusterOperator, and will also still block on `status.versions` reported in the ClusterOperator's release manifest.
66+
67+
* Adjusting whether `Degraded` ClusterOperator conditions propagated through to the ClusterVersion `Failing` condition.
68+
As with the current install mode, the sad condition will be propagated through to `Failing=True`, unless outweighed by a more serious condition like `Available=False`.
69+
70+
## Proposal
71+
72+
The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`.
73+
This enhancement proposes making that an unconditional `UpdateEffectReport`, regardless of the CVO's current mode (installing, updating, reconciling, etc.).
74+
75+
### Workflow Description
76+
77+
Cluster administrators will be largely unaware of this feature.
78+
They will no longer have ClusterVersion update progress slowed by `Degraded=True` ClusterOperators, so there will be less admin involvement there.
79+
They will continue to be notified of `Degraded=True` ClusterOperators via [the `warning` `ClusterOperatorDegraded` alert][ClusterOperatorDegraded] and the `Failing=True` ClusterVersion condition.
80+
81+
### API Extensions
82+
83+
No API extensions are needed for this proposal.
84+
85+
### Topology Considerations
86+
87+
#### Hypershift / Hosted Control Planes
88+
89+
HyperShift's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration.
90+
91+
#### Standalone Clusters
92+
93+
Yes, the enhancement is expected to improve the update experience on standalone, by decoupling ClusterVersion update completion from recovering `Degraded=True` ClusterOperators, granting the cluster administrator the flexibility to address update speed and operator degradation independently.
94+
95+
#### Single-node Deployments or MicroShift
96+
97+
Single-node's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration.
98+
This change is a minor tweak to existing CVO code, so it is not expected to impact resource consumption.
99+
100+
MicroShift updates are managed via RPMs, without a cluster-version operator, so it is not exposed to the ClusterVersion updates this enhancement is refining, and not affected by the changes proposed in this enhancement.
101+
102+
### Implementation Details/Notes/Constraints
103+
104+
The code change is expected to be a handful of lines, as discussed in [the *Proposal* section](#proposal), so there are no further implementation details needed.
105+
106+
### Risks and Mitigations
107+
108+
The risk would be that there are some ClusterOperators who currently rely on the cluster-version operator blocking during updates on ClusterOperators that are `Available=True`, `Degraded=True`, and which set the release manifest's expected `versions`.
109+
As discussed in [the *Motivation* section](#motivation), we're not currently aware of any such ClusterOperators.
110+
If any turn up, we can mitigate by [declaring conditional update risks](targeted-update-edge-blocking.md) using the existing `cluster_operator_conditions{condition="Degraded"}` PromQL metric, while teaching the relevant operators to set `Available=False` and/or without their `versions` bumps until the issue that needs to block further ClusterVersion update progress has been resolved.
111+
112+
How will security be reviewed and by whom?
113+
Unclear. Feedback welcome.
114+
115+
How will UX be reviewed and by whom?
116+
Unclear. Feedback welcome.
117+
118+
### Drawbacks
119+
120+
As discussed in [the *Risks* section](#risks-and-mitigations), the main drawback is changing behavior that we've had in place for many years.
121+
But we do not expect much customer pushback based on "hey, my update completed?! I expected it to stick on this sad component...".
122+
We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do no have the ability to recover a component from minor degradation themselves.
123+
124+
## Test Plan
125+
126+
**Note:** *Section not required until targeted at a release.*
127+
128+
Consider the following in developing a test plan for this enhancement:
129+
- Will there be e2e and integration tests, in addition to unit tests?
130+
- How will it be tested in isolation vs with other components?
131+
- What additional testing is necessary to support managed OpenShift service-based offerings?
132+
133+
No need to outline all of the test cases, just the general strategy. Anything
134+
that would count as tricky in the implementation and anything particularly
135+
challenging to test should be called out.
136+
137+
All code is expected to have adequate tests (eventually with coverage
138+
expectations).
139+
140+
## Graduation Criteria
141+
142+
There are no API changes proposed by this enhancement, which only affects sad-path handling, so we expect the code change to go straight to the next generally-available release, without feature gating or staged graduation.
143+
144+
### Dev Preview -> Tech Preview
145+
146+
Not applicable.
147+
148+
### Tech Preview -> GA
149+
150+
Not applicable.
151+
152+
### Removing a deprecated feature
153+
154+
Not applicable.
155+
156+
## Upgrade / Downgrade Strategy
157+
158+
This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.
159+
160+
## Version Skew Strategy
161+
162+
This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.
163+
164+
## Operational Aspects of API Extensions
165+
166+
There are no API changes proposed by this enhancement.
167+
168+
## Support Procedures
169+
170+
This enhancement is a small pivot in how the cluster-version operator processes ClusterOperator manifests during updates.
171+
As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins open support cases related to this change.
172+
173+
## Alternatives
174+
175+
We could continue with the current approach, and absorb the occasional friction it causes.
176+
177+
## Infrastructure Needed
178+
179+
No additional infrastructure is needed for this enhancement.
180+
181+
[ClusterOperatorDegraded]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106-L124
182+
[cvo-mode-switch]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/pkg/cvo/internal/operatorstatus.go#L241-L245
183+
[kubelet-skew]: https://kubernetes.io/releases/version-skew-policy/#kubelet

0 commit comments

Comments
 (0)