-
Notifications
You must be signed in to change notification settings - Fork 195
OCPBUGS-23514: Failing=Unknown upon long CO updating #1165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
cc @dis016 |
bd6ed4c
to
577f975
Compare
Testing with
and build1: and build2:
### upgrade to build1
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-nmgvdzt/release:latest --force --allow-explicit-upgrade
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 41m Working towards 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest:
711 of 903 done (78% complete), waiting on image-registry
### upgrade to build2
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-zx89yxb/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings
### Issue1: After a couple of mins, we see "longer than expected" on etcd and kube-apiserver.
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 47m Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest:
111 of 903 done (12% complete), waiting on etcd, kube-apiserver over 30 minutes which is longer than expected
### Be patient: Expected result showed up.
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 99m Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
### Issue2: status-command showed nothing about image-registry.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=operators
Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment: Progressing
Target Version: 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion: 85% (29 operators updated, 0 updating, 5 waiting)
Duration: 56m (Est. Time Remaining: 1h22m)
Operator Health: 34 Healthy
Control Plane Nodes
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-11-19.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-26-148.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-95-12.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Worker Upgrade =
WORKER POOL ASSESSMENT COMPLETION STATUS
worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining
Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-24-13.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-54-250.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-80-23.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Update Health =
SINCE LEVEL IMPACT MESSAGE
55m56s Warning None Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
Run with --details=health for additional description and links to related online documentation
$ oc get co kube-apiserver image-registry
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest True False False 130m
image-registry 4.19.0-ec.2 True False False 124m |
54e9f08
to
34a5a59
Compare
Triggered build3: build 4.19,openshift/cluster-image-registry-operator#1184,openshift/cluster-version-operator#1165 Repeated the test, updating to build1 and then build3: $ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-r67x3s2/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings
### the non-zero guard on the CO update start times seems working as issue1 is gone
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 34m Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 111 of 903 done (12% complete), waiting on etcd, kube-apiserver
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 75m Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest:
711 of 903 done (78% complete), waiting on image-registry
### be paticent and there it goes
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 90m Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
Upgradeable=False
Reason: UpdateInProgress
Message: An update is already in progress and the details are in the Progressing condition
Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.19
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest not found in the "candidate-4.19" channel
$ oc get co kube-apiserver image-registry
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest True False False 121m
image-registry 4.19.0-ec.2 True False False 113m
### issue2 is still there as expected
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment: Progressing
Target Version: 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion: 85% (29 operators updated, 0 updating, 5 waiting)
Duration: 57m (Est. Time Remaining: 18m)
Operator Health: 34 Healthy
Control Plane Nodes
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-11-67.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-34-60.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-71-225.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Worker Upgrade =
WORKER POOL ASSESSMENT COMPLETION STATUS
worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining
Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-24-151.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-57-13.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-97-93.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Update Health =
Message: Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
Since: 56m59s
Level: Warning
Impact: None
Reference: https://docs.openshift.com/container-platform/latest/updating/troubleshooting_updates/gathering-data-cluster-update.html#gathering-clusterversion-history-cli_troubleshooting_updates
Resources:
clusterversions.config.openshift.io: version
Description: Current update to 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest was initiated while the previous update to version 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest was still in progress I think issue 2 above is caused by |
Yeah, I agree - the crash is actually the easy case, because then there is at least some symptom that can be noticed and it is somewhat clear that if CVO says it is waiting for an
Surfacing the condition in the message makes the situation slightly better, but my concern is that we cannot easily consume this data for Status API / command. I want I'd like to come up with at least something. Personally I would do
I do not want this for the reasons above. |
I am going to do the following (the other three options above may lead to questions/trouble from users. I might come back to them if neither of the following goes thro :knock :knock :knock):
|
SGTM |
34a5a59
to
1c87755
Compare
@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Test Scenario: Failing=unknown when slow update happen for MCO operator for >90 minutes.
Step2: create a build release with MCO magic PR(openshift/machine-config-operator#4980)
Step3: Upgrade to newly created build
Step4: Upgrade should get triggered and proceed normally, After sometime(approximately>90minutes for machine-config ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True
Upgrade should never complete and keep progressing. |
/label qe-approved |
/retest-required |
2 similar comments
/override ci/prow/e2e-aws-ovn-techpreview
Tests passed but then the job tripped on post steps |
/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
Tests passed but then the job tripped on infra issues in post steps |
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-aws-ovn-techpreview In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-into-change In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override ci/prow/e2e-agnostic-ovn-upgrade-into-change |
@petr-muller: petr-muller unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Weird... /override ci/prow/e2e-agnostic-ovn-upgrade-into-change |
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-into-change, ci/prow/e2e-aws-ovn-techpreview In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2 similar comments
@hongkailiu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/override ci/prow/e2e-aws-ovn-techpreview |
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-aws-ovn-techpreview In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
b01f931
into
openshift:main
@hongkailiu: Jira Issue OCPBUGS-23514: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-23514 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fix included in accepted release 4.19.0-0.nightly-2025-04-04-170728 |
[ART PR BUILD NOTIFIER] Distgit: cluster-version-operator |
When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.
This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.
Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.
In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.
[1]. kubernetes/kubernetes#106054
[2].
cluster-version-operator/lib/resourcebuilder/apps.go
Lines 79 to 136 in 08c0459