Skip to content

OCPBUGS-23514: Failing=Unknown upon long CO updating #1165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 19, 2025

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Feb 28, 2025

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 28, 2025
@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue.

This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it.

Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up.

For now, we just modify the condition's message. We could propagate Fail=True in case such requirements are collected from customers.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from petr-muller and wking February 28, 2025 01:56
@hongkailiu hongkailiu changed the title OCPBUGS-23514: Better a message upon long CO updating [wip]OCPBUGS-23514: Better a message upon long CO updating Feb 28, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025
@hongkailiu
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 28, 2025
@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jiajliu February 28, 2025 01:58
@jiajliu
Copy link

jiajliu commented Feb 28, 2025

cc @dis016

@hongkailiu
Copy link
Member Author

Testing with

launch 4.19.0-ec.2 aws

and build1:

build 4.19,openshift/cluster-image-registry-operator#1184

and build2:

build 4.19,openshift/cluster-image-registry-operator#1184,#1165

### upgrade to build1
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-nmgvdzt/release:latest --force --allow-explicit-upgrade

$ oc get clusterversion version                                               
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          41m     Working towards 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest:
 711 of 903 done (78% complete), waiting on image-registry

### upgrade to build2
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-zx89yxb/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings

### Issue1: After a couple of mins, we see "longer than expected" on etcd and kube-apiserver.
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          47m     Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest:
 111 of 903 done (12% complete), waiting on etcd, kube-apiserver over 30 minutes which is longer than expected

### Be patient: Expected result showed up.
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          99m     Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

### Issue2: status-command showed nothing about image-registry.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=operators
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        56m (Est. Time Remaining: 1h22m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-11-19.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-26-148.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-95-12.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-24-13.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-54-250.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-80-23.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
SINCE    LEVEL     IMPACT   MESSAGE
55m56s   Warning   None     Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2

Run with --details=health for additional description and links to related online documentation

$ oc get co kube-apiserver image-registry
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest   True        False         False      130m
image-registry   4.19.0-ec.2                                            True        False         False      124m

@hongkailiu hongkailiu force-pushed the OCPBUGS-23514 branch 5 times, most recently from 54e9f08 to 34a5a59 Compare March 1, 2025 05:12
@hongkailiu
Copy link
Member Author

Triggered build3:

build 4.19,openshift/cluster-image-registry-operator#1184,openshift/cluster-version-operator#1165

Repeated the test, updating to build1 and then build3:

$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-r67x3s2/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings

### the non-zero guard on the CO update start times seems working as issue1 is gone
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          34m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 111 of 903 done (12% complete), waiting on etcd, kube-apiserver

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          75m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest:
711 of 903 done (78% complete), waiting on image-registry

### be paticent and there it goes
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          90m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

Upgradeable=False

  Reason: UpdateInProgress
  Message: An update is already in progress and the details are in the Progressing condition

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.19
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest not found in the "candidate-4.19" channel

$ oc get co kube-apiserver image-registry
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest   True        False         False      121m
image-registry   4.19.0-ec.2                                            True        False         False      113m

### issue2 is still there as expected
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        57m (Est. Time Remaining: 18m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-11-67.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-34-60.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-71-225.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-24-151.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-57-13.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-97-93.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
Message: Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
  Since:       56m59s
  Level:       Warning
  Impact:      None
  Reference:   https://docs.openshift.com/container-platform/latest/updating/troubleshooting_updates/gathering-data-cluster-update.html#gathering-clusterversion-history-cli_troubleshooting_updates
  Resources:
    clusterversions.config.openshift.io: version
  Description: Current update to 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest was initiated while the previous update to version 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest was still in progress

I think issue 2 above is caused by --details=operators works only when CO's Progressing=True. But for CO/image-registry, it was not the case at the time. We may want the status cmd to pick it up again from cv/version.

@hongkailiu hongkailiu changed the title [wip]OCPBUGS-23514: Better a message upon long CO updating OCPBUGS-23514: Better a message upon long CO updating Mar 1, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2025
@petr-muller
Copy link
Member

petr-muller commented Mar 6, 2025

Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up.

Yeah, I agree - the crash is actually the easy case, because then there is at least some symptom that can be noticed and it is somewhat clear that if CVO says it is waiting for an operator-x, and you discover it is not running, then something is wrong. I think our fix needs to handle the worst case of this failure case - CVO waits for operator-x to bump a version and it never does that, but there is no bad symptom otherwise - operator-x is running, not degraded, available. Just no progress (like the reproducer I made).

For now, we just modify the condition's message. We could propagate Fail=True in case such requirements are collected from customers.

Surfacing the condition in the message makes the situation slightly better, but my concern is that we cannot easily consume this data for Status API / command. I want oc adm upgrade status to be at least a bit loud when there is a stuck update. In the current state, status cannot detect the stuck update, because all machine-readable parts of the API are identical to the happy update (Progressing=True, Failing=False).

I'd like to come up with at least something. Personally I would do Failing=True and would be happy about it :D Some proposals:

  • Failing=Unknown with a Reason=SlowClusterOperator or similar. Technically "unknown" describes the state well but I am not sure if consumers can handle the Unknown value well (would need to be tested with at least Web Console
  • Progressing=Unknown with a Reason=SlowClusterOperator or similar. You could argue that if we wait for something to happen, and it does not happen for a long time, we do not know if we are really progressing anymore. The downside is that Progressing now has a high-level meaning of "cluster started updating and have not finished yet", and it is likely that some consumers do not expect that semantics to change (=expect Progressing=True to be consistently up until the update is completed)
  • Progressing=False with a Reason=SlowClusterOperator or similar. Stronger variant of the above. If we do not see progress then we are not progressing. Same downside, but stronger (Progressing=True means ongoing update, Progressing=False means no ongoing update)
  • Progressing=True with a Reason=SlowClusterOperato or similar. Minimal variant that does not change important signals but at least allows us to dfferentiate Reason=HappyProgress from Reason=SlowClusterOperator. Tehcnically this does not meet the condition contract, because reason should be tied to last transition not to the current fine state (it should say why Progressing went from False to True, not what is the current fine reason for being True).
  • UpdateProcedureProgressing=False with a Reason=SlowClusterOperator. Basically a new condition that has a similar, but much finer semantics like Failing=True. If it is True it means that everything is 100% healthy from CVO standpoint. If it is False then we are waiting for something to happen, and the Reason and Message would say what.

We may want the status cmd to pick it up again from cv/version.

I do not want this for the reasons above.

@hongkailiu
Copy link
Member Author

I am going to do the following (the other three options above may lead to questions/trouble from users. I might come back to them if neither of the following goes thro :knock :knock :knock):

  • Failing=Unknown: I would go for this (Plan A) if console is happy with it OR the unhappiness of console can be easily fixed.
  • UpdateProcedureProgressing=False: Plan B because learning/maintaining a new thing is always expensive for both us/maintainers and others/users.

@petr-muller
Copy link
Member

SGTM

@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu hongkailiu changed the title OCPBUGS-23514: Better a message upon long CO updating OCPBUGS-23514: Failing=Unknown upon long CO updating Mar 7, 2025
@dis016
Copy link

dis016 commented Apr 11, 2025

Test Scenario: Failing=unknown when slow update happen for MCO operator for >90 minutes.
Step1: Install a cluster with 4.19.0-ec.2 and check status of CVO.

dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        False         128m    Cluster version is 4.19.0-ec.2
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-04-11T09:54:04Z"
status: "False"
type: Failing
lastTransitionTime: "2025-04-11T10:34:48Z"
message: Cluster version is 4.19.0-ec.2
status: "False"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ % 

Step2: create a build release with MCO magic PR(openshift/machine-config-operator#4980)

build 4.19,openshift/machine-config-operator#4980,openshift/cluster-version-operator#1165

Step3: Upgrade to newly created build

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-x6iy65b/release@sha256:30a5bd2e60490e2fa6afc92ec0fa68b77a789a104de64da809e6b65094370c28  --allow-explicit-upgrade --force --allow-upgrade-with-warnings

warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-x6iy65b/release@sha256:30a5bd2e60490e2fa6afc92ec0fa68b77a789a104de64da809e6b65094370c28
dinesh@Dineshs-MacBook-Pro ~ % 

Step4: Upgrade should get triggered and proceed normally, After sometime(approximately>90minutes for machine-config ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True


dinesh@Dineshs-MacBook-Pro ~ % while true; do oc get clusterversion ; oc adm upgrade status; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing" or .type=="Progressing")' ; oc get co machine-config ;  sleep 300; done
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          55s     Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 111 of 908 done (12% complete), waiting on etcd, kube-apiserver
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest (from 4.19.0-ec.2)
Updating:        etcd
Completion:      3% (1 operators updated, 1 updating, 32 waiting)
Duration:        56s (Est. Time Remaining: 48m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-master-0.us-central1-a.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-1.us-central1-b.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-2.us-central1-c.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-worker-a-v7q7r   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-b-5zcrc   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-c-p52wb   Outdated     Pending   4.19.0-ec.2   ?     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
56s     Info    None     Update is proceeding well
{
  "lastTransitionTime": "2025-04-11T12:43:59Z",
  "status": "False",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-04-11T12:43:51Z",
  "message": "Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 111 of 908 done (12% complete), waiting on etcd, kube-apiserver",
  "reason": "ClusterOperatorsUpdating",
  "status": "True",
  "type": "Progressing"
}
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.19.0-ec.2   True        False         False      9h  
...
...
...
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          148m    Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 779 of 908 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Stalled
Target Version:  4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest (from 4.19.0-ec.2)
Completion:      97% (33 operators updated, 0 updating, 1 waiting)
Duration:        2h29m (Est. Time Remaining: N/A; estimate duration was 1h37m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-master-0.us-central1-a.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-1.us-central1-b.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-2.us-central1-c.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-worker-a-v7q7r   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-b-5zcrc   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-c-p52wb   Outdated     Pending   4.19.0-ec.2   ?     

= Update Health =
SINCE    LEVEL     IMPACT           MESSAGE
10m25s   Warning   Update Stalled   Cluster Version version is failing to proceed with the update (SlowClusterOperator)

Run with --details=health for additional description and links to related online documentation
{
  "lastTransitionTime": "2025-04-11T15:02:04Z",
  "message": "waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-04-11T12:43:51Z",
  "message": "Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 779 of 908 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.19.0-ec.2   True        False         False      12h 

Upgrade should never complete and keep progressing.

@dis016
Copy link

dis016 commented Apr 11, 2025

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 11, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b247772 and 2 for PR HEAD 8892f42 in total

@hongkailiu
Copy link
Member Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b247772 and 2 for PR HEAD 8892f42 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5524608 and 1 for PR HEAD 8892f42 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5524608 and 2 for PR HEAD 8892f42 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4210eef and 2 for PR HEAD 8892f42 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ee8b5ef and 2 for PR HEAD 8892f42 in total

2 similar comments
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ee8b5ef and 2 for PR HEAD 8892f42 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ee8b5ef and 2 for PR HEAD 8892f42 in total

@petr-muller
Copy link
Member

/override ci/prow/e2e-aws-ovn-techpreview

INFO[2025-04-16T10:55:54Z] Step e2e-aws-ovn-techpreview-openshift-e2e-test succeeded after 1h1m19s. 
INFO[2025-04-16T10:55:54Z] Step phase test succeeded after 1h1m19s. 

Tests passed but then the job tripped on post steps

@petr-muller
Copy link
Member

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change

INFO[2025-04-16T12:07:38Z] Step e2e-agnostic-ovn-upgrade-into-change-openshift-e2e-test succeeded after 1h9m17s. 
INFO[2025-04-16T12:07:38Z] Step phase test succeeded after 1h9m17s. 

Tests passed but then the job tripped on infra issues in post steps

Copy link
Contributor

openshift-ci bot commented Apr 16, 2025

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-aws-ovn-techpreview

In response to this:

/override ci/prow/e2e-aws-ovn-techpreview

INFO[2025-04-16T10:55:54Z] Step e2e-aws-ovn-techpreview-openshift-e2e-test succeeded after 1h1m19s. 
INFO[2025-04-16T10:55:54Z] Step phase test succeeded after 1h1m19s. 

Tests passed but then the job tripped on post steps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Apr 16, 2025

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-into-change

In response to this:

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change

INFO[2025-04-16T12:07:38Z] Step e2e-agnostic-ovn-upgrade-into-change-openshift-e2e-test succeeded after 1h9m17s. 
INFO[2025-04-16T12:07:38Z] Step phase test succeeded after 1h9m17s. 

Tests passed but then the job tripped on infra issues in post steps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@petr-muller
Copy link
Member

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

Copy link
Contributor

openshift-ci bot commented Apr 17, 2025

@petr-muller: petr-muller unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers.

In response to this:

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@petr-muller
Copy link
Member

petr-muller commented Apr 17, 2025

Weird...

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

Copy link
Contributor

openshift-ci bot commented Apr 17, 2025

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-into-change, ci/prow/e2e-aws-ovn-techpreview

In response to this:

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1c90cbb and 2 for PR HEAD 8892f42 in total

2 similar comments
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1c90cbb and 2 for PR HEAD 8892f42 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1c90cbb and 2 for PR HEAD 8892f42 in total

Copy link
Contributor

openshift-ci bot commented Apr 18, 2025

@hongkailiu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-operator-devpreview 8892f42 link false /test e2e-agnostic-operator-devpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member

/override ci/prow/e2e-aws-ovn-techpreview

Copy link
Contributor

openshift-ci bot commented Apr 19, 2025

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-aws-ovn-techpreview

In response to this:

/override ci/prow/e2e-aws-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit b01f931 into openshift:main Apr 19, 2025
14 of 15 checks passed
@openshift-ci-robot
Copy link
Contributor

@hongkailiu: Jira Issue OCPBUGS-23514: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-23514 has been moved to the MODIFIED state.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.19.0-0.nightly-2025-04-04-170728

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.19.0-202504191210.p0.gb01f931.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants