Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
Changes from all commits
9498fb9
111c8fe
d4a4682
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we observed another kind of upgrade blocker here. Applying the
infrastructures.config.openshift.io
manifest failed as the CRD had introduced some validations and that needed the apiserver to be upgraded to support it. Unfortunately, the upgrade didn't progress and we had to manually step in to update the kube-apiserver to let the upgrade proceed. Is there a way to enhance these cases to at least let the apiserver upgrade before blocking?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to talk folks into the narrow
Degraded
handling pivot this enhancement currently covers since 2021. I accept that there may be other changes that we could make to help updates go more smoothly, but I'd personally rather limit the scope of this enhancement to theDegraded
handling.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean, if no operator is unavailable, then the upgrade should always complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterOperators aren't the only CVO-manifested resources, and if something else breaks like we fail to reconcile a RoleBinding or whatever, that will block further update progress. And for ClusterOperators, we'll still block on
status.versions
not being as far along as the manifest claimed, in addition to blocking ifAvailable
isn'tTrue
. Personally,status.versions
seems like the main thing that's relevant, e.g. a component coming after the Kube API server knows it can use 4.18 APIs if the Kube API server has declared 4.18versions
. As an example of what the 4.18 Kube API server asks the CVO to wait on:A recent example of this being useful is openshift/machine-config-operator#4637, which got the CVO to block until the MCO had rolled out a single-arch -> multi-arch transition, without the MCO needing to touch its
Degraded
orAvailable
conditions to slow the CVO down.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so could I say, if failing=true for an upgrade, the reason should not be
ClusterOperatorDegraded
only.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we'll still propagate
ClusterOperator(s)Degraded
through toFailing
, it just will no longer block the update's progress. So if the only issueFailing
is talking about isClusterOperator(s)Degraded
, we expect the update to be moving towards completion, and not stalling.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/cluster-version-operator#482 is in flight with this change, if folks want to test pre-merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enhancement and the tracking card OTA-541 are not targeted at a release. However, changes in the
dev-guide/cluster-version-operator/user/reconciliation.md
file suggest that the enhancement is targeted at the 4.19 release, and thus theTest Plan
section should be addressed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not strongly opinionated on what the test plan looks like. We don't do a lot of intentional-sad-path update testing today in CI, and I'm fuzzy on what QE does in that space that could be expanded into this new space (or maybe they already test pushing a ClusterOperator component to
Degraded=True
mid update to see how the cluster handles that?).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, that's also what I want to explore during test. I also had some other immature checkpoints in my mind when I read this enhancement doc at the first time, but I still need some inputs from @wking to help me tidy up them. For example #1719 (comment).
I asked this because there's already some cv.conditions check in CI, I'm thinking about if we could update the logic to help catching issues once the feature implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty new to the code base for cluster-authentication-operator but scanning through the code there is nothing that stands out in this operator that is concerning with this change.
Ack from @liouk or @ibihim would also be nice to have as an additional sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking internal org docs, the Auth team seems like they might be responsible for the
service-ca
ClusterOperator, in addition to this line'sauthentication
ClusterOperator. In case those maintainers want to comment with something like:or whatever, assuming they are ok making that assertion for the operators they maintain. Also fine if they want to say "I'm a maintainer for
$CLUSTER_OPERATORS
, and I'm not ok with this enhancement as it stands, because..." or whatever, I'm just trying to give folks a way to satisfy David's requested sign-off if they do happen to be on board.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I ack the authentication operator, I'd like to clarify the existing semantics for
status.versions[name=operator]
.As far as I understand, the operator sets its
status.versions[name=operator]
as follows:ClusterOperatorStatusController
)AFAIU, this does not guarantee the mixed state, as described in the semantics:
When the operator starts running on the new version during an upgrade, it seems that it will update its version in the CO status, probably even before the operands have been upgraded to their new versions via the workload controllers. This seems to offend the mixed-version-state requirement as described above.
@wking any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the Auth operator's implementation, but checking CI to see if there's externally-measurable evidence of how it's currently working: https://amd64.ocp.releases.ci.openshift.org/ -> 4.18.0-rc.10 -> update from 4.17.17 -> Artifacts -> ... -> template-job artifacts:
You're currently asking the CVO to wait on
operator
andoauth-openshift
, so the CVO doesn't care what you say foroauth-apiserver
. Back to the CI artifacts to see how those gather-time values arrived:So:
versions[name="operator"]
. Largely irrelevant, because during install-time, we're usually blocked on a slowerAvailable
, among the things the CVO waits on. And this enhancement proposals isn't suggesting changes to install-time behavior. But sure, if you wanted to be more conformant with the doc'ed semantics ofoperator
, you could adjust things to not setoperator
this early.oauth-apiserver
is added with the install version. Still orthogonal-to-this-enhancement install-time behavior.oauth-openshift
set. Still orthogonal-to-this-enhancement install-time behavior.operator
bumped to4.18.0-rc.10
, but not the others. This is worth changing, and I've opened OCPBUGS-51059 to track.oauth-openshift
bumped. Now the CVO will no longer block onversions
in the transition to 4.18.oauth-apiserver
bumped. Not sure what this tracks, but you aren't asking the CVO to wait on it. Maybe that's intentional? Or maybe you want to start asking the CVO to wait on it, by listing it in your ClusterOperator manifest?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service-ca-operator looks good.
The version is set, once the operand hits the expected generation in at all replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge there is no reason that a degradation of the CCMO would have an impact on later components in a way that would be spectacularly bad.
The most likely case would be that the ingress operator needs to recreate a service, and a broken CCM prevents that from happening. I believe this would just delay effectuation of a change, rather than actually break the cluster though.
Though should that inability be better represented through the available condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scoping on
Available
is not all that consistent between ClusterOperators today, but yeah, if Service handling is part of the cloud controller's functionality, and it's broken for all Services, that sounds like it would trip this:But for the purpose of updates, I don't expect things like "the ingress operator needs to recreate a service" to be update-triggered. I'd expect that to just be either "new customer workload wants a new Service" or "unrelated to the update, something deleted an existing Service, and now we need to create a replacement". Both
Available=False
andDegraded=True
will summon admin attention, with different latency/urgency. But I'm not seeing how "slow down the current ClusterVersion update" would help reduce risk for broken Service creation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think we are probably ok here
The one potential area for concern would be if the CIO suddenly decide that it was going to use the ALBO in the future (this is actually on the table) and started requiring services to be recreated during/post upgrade. Normally an admin would be responsible for actually deleting the service, but CCM would be needed to remove the actual LB and finalizer.
Odds of this being a problem that this EP exacerbates I think are low though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tech preview currently, anything that gets broken by this EP can be fixed before we go GA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about the ability to autoscale new capacity when MCO is trying to reboot new nodes?
In theory, if the autoscaler wasn't working, then we may not be able to bring in new capacity, which may then halt the MCO roll out
I guess in the case that autoscaling was completely unavailable, we would want to be Available=False and block the update that way instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, same
Available
context as discussed in thecloud-controller-manager
thread. I don't think autoscaling comes into updates either, though. If you update into a broken autoscaler and lose autoscaling, that's obviously not good. But most MachineConfigPool updates don't rely on the autoscaler, right? They're updating existing Machines in place? So that can all move smoothly along regardless of autoscaler liveness, and the cluster admin can launch a new update or hack around the autoscaler issue to recover that orthogonally.Even with this enhancement, we'd still block further update progress on an
Available=False
autoscaler (I'm only floating a change toDegraded
). I'm just saying that regardless ofAvailable
orDegraded
, I'm not seeing an autoscaler/MCO connection yet.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your cluster is tightly packed, and you have no capacity to move the workloads to be able to complete a drain, then yes, autoscaling might come into an upgrade.
But yes, I think we are probably ok on the degraded front, and we need to make sure we are setting available correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this CO goes degraded, it means the control plane is in a bad state. (Too many replicas, too few replicas, not enough ready replicas). I think that would be a good reason to stop the upgrade.
Should we be leveraging available for some of these? I suspect in most cases something else (KAS?) would likely go unavailable before we did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I think we should be fine when this enhancement is implemented. The degraded Insights Operator doesn't affect any other components in the cluster.
It's marked as degraded when it cannot upload the Insights data to the console.redhat.com Ingress (here an option is to disable the data gathering) or if any other connection (entitlements or cluster transfer) can't be made (here it depends on the HTTP response code > 500)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Degradation here likely doesn't impact other COs, though would tie into the cluster-autoscalers ability to scale up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CO sets it's status available true and degraded false and has no concept of changing them once set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems optimistic 😅 Is there nothing that could go wrong in the machine-approver component that might call for admin intervention? Or is all the current admin-summoning here routed through alerts or reliant on other components (e.g. if we stop approving new Machines, something on the
machine-api
side will let an admin know, somachine-approver
can assume it's being reported).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had around 6 years of that optimism so far and it's going well I guess?
We have alerting through alerts when there are too many pending CSRs, which pokes the admin sufficiently, and AFAIK, that's the only real failure mode we have that needs admin intervention, so I guess we are ok for the moment