Skip to content

OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dev-guide/cluster-version-operator/dev/clusteroperator.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ Conditions determine when the CVO considers certain actions complete, the follow
| Begin upgrade(patch) | any | any | any | any | any
| Begin upgrade(minor) | any | any | any | any | not false
| Begin upgrade (w/ force) | any | any | any | any | any
| Upgrade completion[2]| newVersion(target version for the upgrade) | true | false | any | any
| Upgrade completion[2]| newVersion(target version for the upgrade) | true | any | any | any

[1] Install works on all components in parallel, it does not wait for any component to complete before starting another one.

Expand Down
3 changes: 2 additions & 1 deletion dev-guide/cluster-version-operator/user/reconciliation.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,8 @@ The ClusterOperator builder only monitors the in-cluster object and blocks until
```

would block until the in-cluster ClusterOperator reported `operator` at version 4.1.0.
* Not degraded (except during initialization, where we ignore the degraded status)
* [Prior to 4.19](/enhancements/update/do-not-block-on-degraded-true-clusteroperators.md), not degraded (except during initialization, where we ignore the degraded status).
Since 4.19, ClusterOperator `Degraded` conditions are no longer blocking.

### CustomResourceDefinition

Expand Down
211 changes: 211 additions & 0 deletions enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
---
title: do-not-block-on-degraded-true-clusteroperators
authors:
- "@wking"
reviewers:
- "@PratikMahajan, update team lead"
- "@sdodson, update staff engineer"
approvers:
- "@PratikMahajan, update team lead"
api-approvers:
- None
creation-date: 2024-11-25
last-updated: 2025-01-31
tracking-link:
- https://issues.redhat.com/browse/OTA-540
---

# Do not block on Degraded=True ClusterOperators

## Summary

The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are [sorted into a task-node graph](/dev-guide/cluster-version-operator/user/reconciliation.md#manifest-graph), and the CVO walks the graph reconciling.
Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on `Degraded=True` ClusterOperator.
This enhancement proposes ignoring `Degraded` when deciding whether to block on a ClusterOperator manifest.

## Motivation

The goal of blocking on manifests with sad resources is to avoid further destabilization.
For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment.
Or if we are unable to update the Kube-API-server operator, we don't want to inject [unsupported kubelet skew][kubelet-skew] by asking the machine-config operator to update nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we observed another kind of upgrade blocker here. Applying the infrastructures.config.openshift.io manifest failed as the CRD had introduced some validations and that needed the apiserver to be upgraded to support it. Unfortunately, the upgrade didn't progress and we had to manually step in to update the kube-apiserver to let the upgrade proceed. Is there a way to enhance these cases to at least let the apiserver upgrade before blocking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to enhance these cases...

I've been trying to talk folks into the narrow Degraded handling pivot this enhancement currently covers since 2021. I accept that there may be other changes that we could make to help updates go more smoothly, but I'd personally rather limit the scope of this enhancement to the Degraded handling.


However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness.
We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.

Our experience with `Degraded=True` blocks turns up cases like:

* 4.6 `Degraded=True` on an unreachable, user-provided node, with monitoring reporting `UpdatingnodeExporterFailed`, network reporting `RolloutHung`, and machine-config reporting `MachineConfigDaemonFailed`.
But those ClusterOperator were all still `Available=True`, and in 4.10 and later, monitoring workloads are guarded by PodDisruptionBudgets (PDBs)

### User Stories

* As a cluster administrator, I want the ability to defer recovering `Degraded=True` ClusterOperators without slowing ClusterVersion updates.

### Goals

ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean, if no operator is unavailable, then the upgrade should always complete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterOperators aren't the only CVO-manifested resources, and if something else breaks like we fail to reconcile a RoleBinding or whatever, that will block further update progress. And for ClusterOperators, we'll still block on status.versions not being as far along as the manifest claimed, in addition to blocking if Available isn't True. Personally, status.versions seems like the main thing that's relevant, e.g. a component coming after the Kube API server knows it can use 4.18 APIs if the Kube API server has declared 4.18 versions. As an example of what the 4.18 Kube API server asks the CVO to wait on:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.18.0-rc.0-x86_64
Extracted release payload from digest sha256:054e75395dd0879e8c29cd059cf6b782742123177a303910bf78f28880431d1c created at 2024-12-02T21:11:00Z
$ yaml2json <manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml | jq -c '.status.versions[]'
{"name":"operator","version":"4.18.0-rc.0"}
{"name":"raw-internal","version":"4.18.0-rc.0"}
{"name":"kube-apiserver","version":"1.31.3"}

A recent example of this being useful is openshift/machine-config-operator#4637, which got the CVO to block until the MCO had rolled out a single-arch -> multi-arch transition, without the MCO needing to touch its Degraded or Available conditions to slow the CVO down.

Copy link

@jiajliu jiajliu Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so could I say, if failing=true for an upgrade, the reason should not be ClusterOperatorDegraded only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we'll still propagate ClusterOperator(s)Degraded through to Failing, it just will no longer block the update's progress. So if the only issue Failing is talking about is ClusterOperator(s)Degraded, we expect the update to be moving towards completion, and not stalling.


### Non-Goals

* Adjusting how the cluster-version operator treats `Available` and `versions` in ClusterOperator status.
The CVO will still block on `Available=False` ClusterOperator, and will also still block on `status.versions` reported in the ClusterOperator's release manifest.

* Adjusting whether `Degraded` ClusterOperator conditions propagated through to the ClusterVersion `Failing` condition.
As with the current install mode, the sad condition will be propagated through to `Failing=True`, unless outweighed by a more serious condition like `Available=False`.

## Proposal

The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still propagated through to `Failing`.
This enhancement proposes making that an unconditional `UpdateEffectReport`, regardless of the CVO's current mode (installing, updating, reconciling, etc.).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/cluster-version-operator#482 is in flight with this change, if folks want to test pre-merge.


### Workflow Description

Cluster administrators will be largely unaware of this feature.
They will no longer have ClusterVersion update progress slowed by `Degraded=True` ClusterOperators, so there will be less admin involvement there.
They will continue to be notified of `Degraded=True` ClusterOperators via [the `warning` `ClusterOperatorDegraded` alert][ClusterOperatorDegraded] and the `Failing=True` ClusterVersion condition.

### API Extensions

No API extensions are needed for this proposal.

### Topology Considerations

#### Hypershift / Hosted Control Planes

HyperShift's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration.

#### Standalone Clusters

Yes, the enhancement is expected to improve the update experience on standalone, by decoupling ClusterVersion update completion from recovering `Degraded=True` ClusterOperators, granting the cluster administrator the flexibility to address update speed and operator degradation independently.

#### Single-node Deployments or MicroShift

Single-node's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration.
This change is a minor tweak to existing CVO code, so it is not expected to impact resource consumption.

MicroShift updates are managed via RPMs, without a cluster-version operator, so it is not exposed to the ClusterVersion updates this enhancement is refining, and not affected by the changes proposed in this enhancement.

### Implementation Details/Notes/Constraints

The code change is expected to be a handful of lines, as discussed in [the *Proposal* section](#proposal), so there are no further implementation details needed.

### Risks and Mitigations

The risk would be that there are some ClusterOperators who currently rely on the cluster-version operator blocking during updates on ClusterOperators that are `Available=True`, `Degraded=True`, and which set the release manifest's expected `versions`.
As discussed in [the *Motivation* section](#motivation), we're not currently aware of any such ClusterOperators.
If any turn up, we can mitigate by [declaring conditional update risks](targeted-update-edge-blocking.md) using the existing `cluster_operator_conditions{condition="Degraded"}` PromQL metric, while teaching the relevant operators to set `Available=False` and/or without their `versions` bumps until the issue that needs to block further ClusterVersion update progress has been resolved.

How will security be reviewed and by whom?
Unclear. Feedback welcome.

How will UX be reviewed and by whom?
Unclear. Feedback welcome.

### Drawbacks

As discussed in [the *Risks* section](#risks-and-mitigations), the main drawback is changing behavior that we've had in place for many years.
But we do not expect much customer pushback based on "hey, my update completed?! I expected it to stick on this sad component...".
We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do not have the ability to recover a component from minor degradation themselves.

## Test Plan

**Note:** *Section not required until targeted at a release.*
Copy link
Contributor

@DavidHurta DavidHurta Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enhancement and the tracking card OTA-541 are not targeted at a release. However, changes in the dev-guide/cluster-version-operator/user/reconciliation.md file suggest that the enhancement is targeted at the 4.19 release, and thus the Test Plan section should be addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strongly opinionated on what the test plan looks like. We don't do a lot of intentional-sad-path update testing today in CI, and I'm fuzzy on what QE does in that space that could be expanded into this new space (or maybe they already test pushing a ClusterOperator component to Degraded=True mid update to see how the cluster handles that?).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test pushing a ClusterOperator component to Degraded=True mid update to see how the cluster handles that?

+1, that's also what I want to explore during test. I also had some other immature checkpoints in my mind when I read this enhancement doc at the first time, but I still need some inputs from @wking to help me tidy up them. For example #1719 (comment).
I asked this because there's already some cv.conditions check in CI, I'm thinking about if we could update the logic to help catching issues once the feature implemented.


Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
- What additional testing is necessary to support managed OpenShift service-based offerings?

No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.

All code is expected to have adequate tests (eventually with coverage
expectations).

## Graduation Criteria

There are no API changes proposed by this enhancement, which only affects sad-path handling, so we expect the code change to go straight to the next generally-available release, without feature gating or staged graduation.

### Enhancement merge

To ensure we do not surprise anyone with this shift, we are collecting acks from at least one maintainer for each ClusterOperator approving this pull request, and the [existing semantics for `status.versions[name=operator]`](/dev-guide/cluster-version-operator/dev/clusteroperator.md#version):

> There MUST be a version with the name `operator`, which is watched by the CVO to know if a cluster operator has achieved the new level.

* [ ] authentication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty new to the code base for cluster-authentication-operator but scanning through the code there is nothing that stands out in this operator that is concerning with this change.

Ack from @liouk or @ibihim would also be nice to have as an additional sanity check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking internal org docs, the Auth team seems like they might be responsible for the service-ca ClusterOperator, in addition to this line's authentication ClusterOperator. In case those maintainers want to comment with something like:

I approve this pull and the existing status.versions[name=operator] semantics for the following ClusterOperators, where I'm a maintainer: authentication, service-ca.

or whatever, assuming they are ok making that assertion for the operators they maintain. Also fine if they want to say "I'm a maintainer for $CLUSTER_OPERATORS, and I'm not ok with this enhancement as it stands, because..." or whatever, I'm just trying to give folks a way to satisfy David's requested sign-off if they do happen to be on board.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I ack the authentication operator, I'd like to clarify the existing semantics for status.versions[name=operator].

As far as I understand, the operator sets its status.versions[name=operator] as follows:

  • operator boots up
  • it records its version in a version recorder
  • starts up its controllers (including controllers for the deployments of its operands, and a ClusterOperatorStatusController)
  • when the status controller runs, it will update the CO status field with its version

AFAIU, this does not guarantee the mixed state, as described in the semantics:

While any of your operands are still running software from the previous version then you are in a mixed version state, and you should continue to report the previous version. As soon as you can guarantee you are not and will not run any old versions of your operands, you can update the operator version on your ClusterOperator status.

When the operator starts running on the new version during an upgrade, it seems that it will update its version in the CO status, probably even before the operands have been upgraded to their new versions via the workload controllers. This seems to offend the mixed-version-state requirement as described above.

@wking any thoughts on this?

Copy link
Member Author

@wking wking Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the Auth operator's implementation, but checking CI to see if there's externally-measurable evidence of how it's currently working: https://amd64.ocp.releases.ci.openshift.org/ -> 4.18.0-rc.10 -> update from 4.17.17 -> Artifacts -> ... -> template-job artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1891342561033850880/artifacts/e2e-aws-upgrade/clusteroperators.json | jq -c '.items[] | select(.metadata.name == "authentication").status.versions[]'
{"name":"operator","version":"4.18.0-rc.10"}
{"name":"oauth-apiserver","version":"4.18.0-rc.10"}
{"name":"oauth-openshift","version":"4.18.0-rc.10_openshift"}

You're currently asking the CVO to wait on operator and oauth-openshift, so the CVO doesn't care what you say for oauth-apiserver. Back to the CI artifacts to see how those gather-time values arrived:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1891342561033850880/artifacts/e2e-aws-upgrade/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-authentication-operator" and .reason == "OperatorStatusChanged") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | grep versions
2025-02-17T04:42:52Z 1 Status for clusteroperator/authentication changed: Degraded set to Unknown (""),Progressing set to Unknown (""),Available set to Unknown (""),Upgradeable set to Unknown (""),EvaluationConditionsDetected set to Unknown (""),status.relatedObjects changed from [] to [{"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"route.openshift.io" "routes" "openshift-authentication" "oauth-openshift"} {"" "services" "openshift-authentication" "oauth-openshift"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}],status.versions changed from [] to [{"operator" "4.17.17"}]
2025-02-17T04:44:44Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.17.17"}] to [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"}]
2025-02-17T04:55:56Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"}] to [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}]
2025-02-17T05:38:09Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}] to [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}]
2025-02-17T05:40:08Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}] to [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.18.0-rc.10_openshift"}]
2025-02-17T05:41:43Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.18.0-rc.10_openshift"}] to [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.18.0-rc.10"} {"oauth-openshift" "4.18.0-rc.10_openshift"}]

So:

  • 4:42:52, operator wakes up, sets the install version for versions[name="operator"]. Largely irrelevant, because during install-time, we're usually blocked on a slower Available, among the things the CVO waits on. And this enhancement proposals isn't suggesting changes to install-time behavior. But sure, if you wanted to be more conformant with the doc'ed semantics of operator, you could adjust things to not set operator this early.
  • 4:44:44, a few minutes later, oauth-apiserver is added with the install version. Still orthogonal-to-this-enhancement install-time behavior.
  • 4:55:56, ~11 minutes later, oauth-openshift set. Still orthogonal-to-this-enhancement install-time behavior.
  • 5:38:09, into the update now, operator bumped to 4.18.0-rc.10, but not the others. This is worth changing, and I've opened OCPBUGS-51059 to track.
  • 5:40:08, oauth-openshift bumped. Now the CVO will no longer block on versions in the transition to 4.18.
  • 5:41:43, oauth-apiserver bumped. Not sure what this tracks, but you aren't asking the CVO to wait on it. Maybe that's intentional? Or maybe you want to start asking the CVO to wait on it, by listing it in your ClusterOperator manifest?

Copy link
Contributor

@ibihim ibihim Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service-ca-operator looks good.

The version is set, once the operand hits the expected generation in at all replicas.

* [ ] baremetal
* [ ] cloud-controller-manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge there is no reason that a degradation of the CCMO would have an impact on later components in a way that would be spectacularly bad.

The most likely case would be that the ingress operator needs to recreate a service, and a broken CCM prevents that from happening. I believe this would just delay effectuation of a change, rather than actually break the cluster though.

Though should that inability be better represented through the available condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scoping on Available is not all that consistent between ClusterOperators today, but yeah, if Service handling is part of the cloud controller's functionality, and it's broken for all Services, that sounds like it would trip this:

Available=False means at least part of the component is non-functional, and that the condition requires immediate administrator intervention.

But for the purpose of updates, I don't expect things like "the ingress operator needs to recreate a service" to be update-triggered. I'd expect that to just be either "new customer workload wants a new Service" or "unrelated to the update, something deleted an existing Service, and now we need to create a replacement". Both Available=False and Degraded=True will summon admin attention, with different latency/urgency. But I'm not seeing how "slow down the current ClusterVersion update" would help reduce risk for broken Service creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we are probably ok here

The one potential area for concern would be if the CIO suddenly decide that it was going to use the ALBO in the future (this is actually on the table) and started requiring services to be recreated during/post upgrade. Normally an admin would be responsible for actually deleting the service, but CCM would be needed to remove the actual LB and finalizer.

Odds of this being a problem that this EP exacerbates I think are low though

* [ ] cloud-credential
* [ ] cluster-api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tech preview currently, anything that gets broken by this EP can be fixed before we go GA

* [ ] cluster-autoscaler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the ability to autoscale new capacity when MCO is trying to reboot new nodes?

In theory, if the autoscaler wasn't working, then we may not be able to bring in new capacity, which may then halt the MCO roll out

I guess in the case that autoscaling was completely unavailable, we would want to be Available=False and block the update that way instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, same Available context as discussed in the cloud-controller-manager thread. I don't think autoscaling comes into updates either, though. If you update into a broken autoscaler and lose autoscaling, that's obviously not good. But most MachineConfigPool updates don't rely on the autoscaler, right? They're updating existing Machines in place? So that can all move smoothly along regardless of autoscaler liveness, and the cluster admin can launch a new update or hack around the autoscaler issue to recover that orthogonally.
Even with this enhancement, we'd still block further update progress on an Available=False autoscaler (I'm only floating a change to Degraded). I'm just saying that regardless of Available or Degraded, I'm not seeing an autoscaler/MCO connection yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're updating existing Machines in place?

If your cluster is tightly packed, and you have no capacity to move the workloads to be able to complete a drain, then yes, autoscaling might come into an upgrade.

But yes, I think we are probably ok on the degraded front, and we need to make sure we are setting available correctly

* [ ] config-operator
* [ ] console
* [ ] control-plane-machine-set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this CO goes degraded, it means the control plane is in a bad state. (Too many replicas, too few replicas, not enough ready replicas). I think that would be a good reason to stop the upgrade.

Should we be leveraging available for some of these? I suspect in most cases something else (KAS?) would likely go unavailable before we did

* [ ] csi-snapshot-controller
* [ ] dns
* [ ] etcd
* [ ] image-registry
* [ ] ingress
* [ ] insights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I think we should be fine when this enhancement is implemented. The degraded Insights Operator doesn't affect any other components in the cluster.

It's marked as degraded when it cannot upload the Insights data to the console.redhat.com Ingress (here an option is to disable the data gathering) or if any other connection (entitlements or cluster transfer) can't be made (here it depends on the HTTP response code > 500)

* [ ] kube-apiserver
* [ ] kube-controller-manager
* [ ] kube-scheduler
* [ ] kube-storage-version-migrator
* [ ] machine-api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Degradation here likely doesn't impact other COs, though would tie into the cluster-autoscalers ability to scale up

* [ ] machine-approver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CO sets it's status available true and degraded false and has no concept of changing them once set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems optimistic 😅 Is there nothing that could go wrong in the machine-approver component that might call for admin intervention? Or is all the current admin-summoning here routed through alerts or reliant on other components (e.g. if we stop approving new Machines, something on the machine-api side will let an admin know, so machine-approver can assume it's being reported).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems optimistic

We've had around 6 years of that optimism so far and it's going well I guess?

We have alerting through alerts when there are too many pending CSRs, which pokes the admin sufficiently, and AFAIK, that's the only real failure mode we have that needs admin intervention, so I guess we are ok for the moment

* [x] machine-config, [acked](https://github.com/openshift/enhancements/pull/1719#discussion_r1934857504) by @yuqi-zhang
* [ ] marketplace
* [ ] monitoring
* [ ] network
* [ ] node-tuning
* [ ] olm
* [ ] openshift-apiserver
* [ ] openshift-controller-manager
* [ ] openshift-samples
* [ ] operator-lifecycle-manager
* [ ] operator-lifecycle-manager-catalog
* [ ] operator-lifecycle-manager-packageserver
* [ ] service-ca
* [ ] storage

### Dev Preview -> Tech Preview

Not applicable.

### Tech Preview -> GA

Not applicable.

### Removing a deprecated feature

Not applicable.

## Upgrade / Downgrade Strategy

This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatibility issues.

## Version Skew Strategy

This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatibility issues.

## Operational Aspects of API Extensions

There are no API changes proposed by this enhancement.

## Support Procedures

This enhancement is a small pivot in how the cluster-version operator processes ClusterOperator manifests during updates.
As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins to open support cases related to this change.

## Alternatives

We could continue with the current approach, and absorb the occasional friction it causes.

## Infrastructure Needed

No additional infrastructure is needed for this enhancement.

[ClusterOperatorDegraded]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106-L124
[cvo-degraded-mode-switch]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/pkg/cvo/internal/operatorstatus.go#L241-L245
[kubelet-skew]: https://kubernetes.io/releases/version-skew-policy/#kubelet