OCPBUGS-23514: Failing=Unknown upon long CO updating #1165

hongkailiu · 2025-02-28T01:56:31Z

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

cluster-version-operator/lib/resourcebuilder/apps.go

Lines 79 to 136 in 08c0459

    
           func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error { 
        
           	if b.mode == InitializingMode { 
        
           		return nil 
        
           	} 
        
           	iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name) 
        
           	if deployment.DeletionTimestamp != nil { 
        
           		return fmt.Errorf("deployment %s is being deleted", iden) 
        
           	} 
        
           	var availableCondition *appsv1.DeploymentCondition 
        
           	var progressingCondition *appsv1.DeploymentCondition 
        
           	var replicaFailureCondition *appsv1.DeploymentCondition 
        
           	for idx, dc := range deployment.Status.Conditions { 
        
           		switch dc.Type { 
        
           		case appsv1.DeploymentProgressing: 
        
           			progressingCondition = &deployment.Status.Conditions[idx] 
        
           		case appsv1.DeploymentAvailable: 
        
           			availableCondition = &deployment.Status.Conditions[idx] 
        
           		case appsv1.DeploymentReplicaFailure: 
        
           			replicaFailureCondition = &deployment.Status.Conditions[idx] 
        
           		} 
        
           	} 
        
           	if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue { 
        
           		return &payload.UpdateError{ 
        
           			Nested:  fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas), 
        
           			Reason:  "WorkloadNotProgressing", 
        
           			Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message), 
        
           			Name:    iden, 
        
           		} 
        
           	} 
        
           	if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse { 
        
           		return &payload.UpdateError{ 
        
           			Nested:  fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas), 
        
           			Reason:  "WorkloadNotAvailable", 
        
           			Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message), 
        
           			Name:    iden, 
        
           		} 
        
           	} 
        
           	if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" { 
        
           		return &payload.UpdateError{ 
        
           			Nested:  fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message), 
        
           			Reason:  "WorkloadNotProgressing", 
        
           			Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message), 
        
           			Name:    iden, 
        
           		} 
        
           	} 
        
           	if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil { 
        
           		klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden) 
        
           	} 
        
           	return nil 
        
           }

openshift-ci-robot · 2025-02-28T01:56:38Z

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue.

This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it.

Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up.

For now, we just modify the condition's message. We could propagate Fail=True in case such requirements are collected from customers.

[1]. kubernetes/kubernetes#106054

[2].

cluster-version-operator/lib/resourcebuilder/apps.go

Lines 79 to 136 in 08c0459

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {

if b.mode == InitializingMode {

return nil

}

iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)

if deployment.DeletionTimestamp != nil {

return fmt.Errorf("deployment %s is being deleted", iden)

}

var availableCondition *appsv1.DeploymentCondition

var progressingCondition *appsv1.DeploymentCondition

var replicaFailureCondition *appsv1.DeploymentCondition

for idx, dc := range deployment.Status.Conditions {

switch dc.Type {

case appsv1.DeploymentProgressing:

progressingCondition = &deployment.Status.Conditions[idx]

case appsv1.DeploymentAvailable:

availableCondition = &deployment.Status.Conditions[idx]

case appsv1.DeploymentReplicaFailure:

replicaFailureCondition = &deployment.Status.Conditions[idx]

}

}

if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),

Reason: "WorkloadNotProgressing",

Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),

Name: iden,

}

}

if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),

Reason: "WorkloadNotAvailable",

Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),

Name: iden,

}

}

if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),

Reason: "WorkloadNotProgressing",

Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),

Name: iden,

}

}

if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {

klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)

}

return nil

}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-02-28T01:58:20Z

/jira refresh

openshift-ci-robot · 2025-02-28T01:58:29Z

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jiajliu · 2025-02-28T08:12:11Z

cc @dis016

hongkailiu · 2025-02-28T16:47:47Z

Testing with

launch 4.19.0-ec.2 aws

and build1:

build 4.19,openshift/cluster-image-registry-operator#1184

and build2:

build 4.19,openshift/cluster-image-registry-operator#1184,#1165

### upgrade to build1
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-nmgvdzt/release:latest --force --allow-explicit-upgrade

$ oc get clusterversion version                                               
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          41m     Working towards 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest:
 711 of 903 done (78% complete), waiting on image-registry

### upgrade to build2
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-zx89yxb/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings

### Issue1: After a couple of mins, we see "longer than expected" on etcd and kube-apiserver.
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          47m     Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest:
 111 of 903 done (12% complete), waiting on etcd, kube-apiserver over 30 minutes which is longer than expected

### Be patient: Expected result showed up.
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          99m     Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

### Issue2: status-command showed nothing about image-registry.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=operators
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        56m (Est. Time Remaining: 1h22m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-11-19.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-26-148.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-95-12.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-24-13.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-54-250.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-80-23.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
SINCE    LEVEL     IMPACT   MESSAGE
55m56s   Warning   None     Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2

Run with --details=health for additional description and links to related online documentation

$ oc get co kube-apiserver image-registry
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest   True        False         False      130m
image-registry   4.19.0-ec.2                                            True        False         False      124m

hongkailiu · 2025-03-01T17:14:22Z

Triggered build3:

build 4.19,openshift/cluster-image-registry-operator#1184,openshift/cluster-version-operator#1165

Repeated the test, updating to build1 and then build3:

$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-r67x3s2/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings

### the non-zero guard on the CO update start times seems working as issue1 is gone
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          34m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 111 of 903 done (12% complete), waiting on etcd, kube-apiserver

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          75m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest:
711 of 903 done (78% complete), waiting on image-registry

### be paticent and there it goes
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          90m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

Upgradeable=False

  Reason: UpdateInProgress
  Message: An update is already in progress and the details are in the Progressing condition

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.19
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest not found in the "candidate-4.19" channel

$ oc get co kube-apiserver image-registry
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest   True        False         False      121m
image-registry   4.19.0-ec.2                                            True        False         False      113m

### issue2 is still there as expected
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        57m (Est. Time Remaining: 18m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-11-67.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-34-60.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-71-225.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-24-151.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-57-13.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-97-93.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
Message: Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
  Since:       56m59s
  Level:       Warning
  Impact:      None
  Reference:   https://docs.openshift.com/container-platform/latest/updating/troubleshooting_updates/gathering-data-cluster-update.html#gathering-clusterversion-history-cli_troubleshooting_updates
  Resources:
    clusterversions.config.openshift.io: version
  Description: Current update to 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest was initiated while the previous update to version 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest was still in progress

I think issue 2 above is caused by --details=operators works only when CO's Progressing=True. But for CO/image-registry, it was not the case at the time. We may want the status cmd to pick it up again from cv/version.

petr-muller · 2025-03-06T16:06:46Z

Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up.

Yeah, I agree - the crash is actually the easy case, because then there is at least some symptom that can be noticed and it is somewhat clear that if CVO says it is waiting for an operator-x, and you discover it is not running, then something is wrong. I think our fix needs to handle the worst case of this failure case - CVO waits for operator-x to bump a version and it never does that, but there is no bad symptom otherwise - operator-x is running, not degraded, available. Just no progress (like the reproducer I made).

For now, we just modify the condition's message. We could propagate Fail=True in case such requirements are collected from customers.

Surfacing the condition in the message makes the situation slightly better, but my concern is that we cannot easily consume this data for Status API / command. I want oc adm upgrade status to be at least a bit loud when there is a stuck update. In the current state, status cannot detect the stuck update, because all machine-readable parts of the API are identical to the happy update (Progressing=True, Failing=False).

I'd like to come up with at least something. Personally I would do Failing=True and would be happy about it :D Some proposals:

Failing=Unknown with a Reason=SlowClusterOperator or similar. Technically "unknown" describes the state well but I am not sure if consumers can handle the Unknown value well (would need to be tested with at least Web Console
Progressing=Unknown with a Reason=SlowClusterOperator or similar. You could argue that if we wait for something to happen, and it does not happen for a long time, we do not know if we are really progressing anymore. The downside is that Progressing now has a high-level meaning of "cluster started updating and have not finished yet", and it is likely that some consumers do not expect that semantics to change (=expect Progressing=True to be consistently up until the update is completed)
Progressing=False with a Reason=SlowClusterOperator or similar. Stronger variant of the above. If we do not see progress then we are not progressing. Same downside, but stronger (Progressing=True means ongoing update, Progressing=False means no ongoing update)
Progressing=True with a Reason=SlowClusterOperato or similar. Minimal variant that does not change important signals but at least allows us to dfferentiate Reason=HappyProgress from Reason=SlowClusterOperator. Tehcnically this does not meet the condition contract, because reason should be tied to last transition not to the current fine state (it should say why Progressing went from False to True, not what is the current fine reason for being True).
UpdateProcedureProgressing=False with a Reason=SlowClusterOperator. Basically a new condition that has a similar, but much finer semantics like Failing=True. If it is True it means that everything is 100% healthy from CVO standpoint. If it is False then we are waiting for something to happen, and the Reason and Message would say what.

We may want the status cmd to pick it up again from cv/version.

I do not want this for the reasons above.

hongkailiu · 2025-03-07T00:22:50Z

I am going to do the following (the other three options above may lead to questions/trouble from users. I might come back to them if neither of the following goes thro :knock :knock :knock):

Failing=Unknown: I would go for this (Plan A) if console is happy with it OR the unhappiness of console can be easily fixed.
UpdateProcedureProgressing=False: Plan B because learning/maintaining a new thing is always expensive for both us/maintainers and others/users.

petr-muller · 2025-03-07T13:08:11Z

SGTM

openshift-ci-robot · 2025-03-07T20:31:54Z

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

cluster-version-operator/lib/resourcebuilder/apps.go

Lines 79 to 136 in 08c0459

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {

if b.mode == InitializingMode {

return nil

}

iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)

if deployment.DeletionTimestamp != nil {

return fmt.Errorf("deployment %s is being deleted", iden)

}

var availableCondition *appsv1.DeploymentCondition

var progressingCondition *appsv1.DeploymentCondition

var replicaFailureCondition *appsv1.DeploymentCondition

for idx, dc := range deployment.Status.Conditions {

switch dc.Type {

case appsv1.DeploymentProgressing:

progressingCondition = &deployment.Status.Conditions[idx]

case appsv1.DeploymentAvailable:

availableCondition = &deployment.Status.Conditions[idx]

case appsv1.DeploymentReplicaFailure:

replicaFailureCondition = &deployment.Status.Conditions[idx]

}

}

if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),

Reason: "WorkloadNotProgressing",

Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),

Name: iden,

}

}

if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),

Reason: "WorkloadNotAvailable",

Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),

Name: iden,

}

}

if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),

Reason: "WorkloadNotProgressing",

Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),

Name: iden,

}

}

if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {

klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)

}

return nil

}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dis016 · 2025-04-11T15:26:55Z

Test Scenario: Failing=unknown when slow update happen for MCO operator for >90 minutes.
Step1: Install a cluster with 4.19.0-ec.2 and check status of CVO.

dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        False         128m    Cluster version is 4.19.0-ec.2
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-04-11T09:54:04Z"
status: "False"
type: Failing
lastTransitionTime: "2025-04-11T10:34:48Z"
message: Cluster version is 4.19.0-ec.2
status: "False"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ %

Step2: create a build release with MCO magic PR(openshift/machine-config-operator#4980)

build 4.19,openshift/machine-config-operator#4980,openshift/cluster-version-operator#1165

Step3: Upgrade to newly created build

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-x6iy65b/release@sha256:30a5bd2e60490e2fa6afc92ec0fa68b77a789a104de64da809e6b65094370c28  --allow-explicit-upgrade --force --allow-upgrade-with-warnings

warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-x6iy65b/release@sha256:30a5bd2e60490e2fa6afc92ec0fa68b77a789a104de64da809e6b65094370c28
dinesh@Dineshs-MacBook-Pro ~ %

Step4: Upgrade should get triggered and proceed normally, After sometime(approximately>90minutes for machine-config ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True


dinesh@Dineshs-MacBook-Pro ~ % while true; do oc get clusterversion ; oc adm upgrade status; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing" or .type=="Progressing")' ; oc get co machine-config ;  sleep 300; done
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          55s     Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 111 of 908 done (12% complete), waiting on etcd, kube-apiserver
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest (from 4.19.0-ec.2)
Updating:        etcd
Completion:      3% (1 operators updated, 1 updating, 32 waiting)
Duration:        56s (Est. Time Remaining: 48m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-master-0.us-central1-a.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-1.us-central1-b.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-2.us-central1-c.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-worker-a-v7q7r   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-b-5zcrc   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-c-p52wb   Outdated     Pending   4.19.0-ec.2   ?     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
56s     Info    None     Update is proceeding well
{
  "lastTransitionTime": "2025-04-11T12:43:59Z",
  "status": "False",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-04-11T12:43:51Z",
  "message": "Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 111 of 908 done (12% complete), waiting on etcd, kube-apiserver",
  "reason": "ClusterOperatorsUpdating",
  "status": "True",
  "type": "Progressing"
}
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.19.0-ec.2   True        False         False      9h  
...
...
...
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          148m    Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 779 of 908 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Stalled
Target Version:  4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest (from 4.19.0-ec.2)
Completion:      97% (33 operators updated, 0 updating, 1 waiting)
Duration:        2h29m (Est. Time Remaining: N/A; estimate duration was 1h37m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-master-0.us-central1-a.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-1.us-central1-b.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-master-2.us-central1-c.c.openshift-qe.internal   Outdated     Pending   4.19.0-ec.2   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
dis016-pwvwr-worker-a-v7q7r   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-b-5zcrc   Outdated     Pending   4.19.0-ec.2   ?     
dis016-pwvwr-worker-c-p52wb   Outdated     Pending   4.19.0-ec.2   ?     

= Update Health =
SINCE    LEVEL     IMPACT           MESSAGE
10m25s   Warning   Update Stalled   Cluster Version version is failing to proceed with the update (SlowClusterOperator)

Run with --details=health for additional description and links to related online documentation
{
  "lastTransitionTime": "2025-04-11T15:02:04Z",
  "message": "waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-04-11T12:43:51Z",
  "message": "Working towards 4.19.0-0.2025-04-11-110331-test-ci-ln-x6iy65b-latest: 779 of 908 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.19.0-ec.2   True        False         False      12h

Upgrade should never complete and keep progressing.

dis016 · 2025-04-11T15:29:27Z

/label qe-approved

openshift-ci-robot · 2025-04-11T19:12:08Z

/retest-required

Remaining retests: 0 against base HEAD b247772 and 2 for PR HEAD 8892f42 in total

hongkailiu · 2025-04-12T02:55:16Z

/retest-required

openshift-ci-robot · 2025-04-14T12:00:27Z

/retest-required

Remaining retests: 0 against base HEAD b247772 and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-14T21:04:18Z

/retest-required

Remaining retests: 0 against base HEAD 5524608 and 1 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-15T14:45:21Z

/retest-required

Remaining retests: 0 against base HEAD 5524608 and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-15T18:10:20Z

/retest-required

Remaining retests: 0 against base HEAD 4210eef and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-16T03:54:39Z

/retest-required

Remaining retests: 0 against base HEAD ee8b5ef and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-16T08:50:58Z

/retest-required

Remaining retests: 0 against base HEAD ee8b5ef and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-16T13:57:10Z

/retest-required

Remaining retests: 0 against base HEAD ee8b5ef and 2 for PR HEAD 8892f42 in total

petr-muller · 2025-04-16T14:06:03Z

/override ci/prow/e2e-aws-ovn-techpreview

INFO[2025-04-16T10:55:54Z] Step e2e-aws-ovn-techpreview-openshift-e2e-test succeeded after 1h1m19s. 
INFO[2025-04-16T10:55:54Z] Step phase test succeeded after 1h1m19s.

Tests passed but then the job tripped on post steps

petr-muller · 2025-04-16T14:07:02Z

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change

INFO[2025-04-16T12:07:38Z] Step e2e-agnostic-ovn-upgrade-into-change-openshift-e2e-test succeeded after 1h9m17s. 
INFO[2025-04-16T12:07:38Z] Step phase test succeeded after 1h9m17s.

Tests passed but then the job tripped on infra issues in post steps

openshift-ci · 2025-04-16T14:16:23Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-aws-ovn-techpreview

In response to this:

/override ci/prow/e2e-aws-ovn-techpreview
INFO[2025-04-16T10:55:54Z] Step e2e-aws-ovn-techpreview-openshift-e2e-test succeeded after 1h1m19s. 
INFO[2025-04-16T10:55:54Z] Step phase test succeeded after 1h1m19s. 
Tests passed but then the job tripped on post steps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-04-16T14:16:51Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-into-change

In response to this:

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
INFO[2025-04-16T12:07:38Z] Step e2e-agnostic-ovn-upgrade-into-change-openshift-e2e-test succeeded after 1h9m17s. 
INFO[2025-04-16T12:07:38Z] Step phase test succeeded after 1h9m17s. 
Tests passed but then the job tripped on infra issues in post steps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

petr-muller · 2025-04-17T14:46:42Z

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

openshift-ci · 2025-04-17T15:27:54Z

@petr-muller: petr-muller unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers.

In response to this:

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

petr-muller · 2025-04-17T16:15:21Z

Weird...

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

openshift-ci · 2025-04-17T16:16:37Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-into-change, ci/prow/e2e-aws-ovn-techpreview

In response to this:

/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-aws-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-04-17T19:18:38Z

/retest-required

Remaining retests: 0 against base HEAD 1c90cbb and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-17T22:30:47Z

/retest-required

Remaining retests: 0 against base HEAD 1c90cbb and 2 for PR HEAD 8892f42 in total

openshift-ci-robot · 2025-04-18T08:17:53Z

/retest-required

Remaining retests: 0 against base HEAD 1c90cbb and 2 for PR HEAD 8892f42 in total

openshift-ci · 2025-04-18T11:11:28Z

@hongkailiu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agnostic-operator-devpreview	`8892f42`	link	false	`/test e2e-agnostic-operator-devpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

petr-muller · 2025-04-19T09:08:10Z

/override ci/prow/e2e-aws-ovn-techpreview

openshift-ci · 2025-04-19T09:08:37Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-aws-ovn-techpreview

In response to this:

/override ci/prow/e2e-aws-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-04-19T09:13:13Z

@hongkailiu: Jira Issue OCPBUGS-23514: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-23514 has been moved to the MODIFIED state.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

cluster-version-operator/lib/resourcebuilder/apps.go

Lines 79 to 136 in 08c0459

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {

if b.mode == InitializingMode {

return nil

}

iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)

if deployment.DeletionTimestamp != nil {

return fmt.Errorf("deployment %s is being deleted", iden)

}

var availableCondition *appsv1.DeploymentCondition

var progressingCondition *appsv1.DeploymentCondition

var replicaFailureCondition *appsv1.DeploymentCondition

for idx, dc := range deployment.Status.Conditions {

switch dc.Type {

case appsv1.DeploymentProgressing:

progressingCondition = &deployment.Status.Conditions[idx]

case appsv1.DeploymentAvailable:

availableCondition = &deployment.Status.Conditions[idx]

case appsv1.DeploymentReplicaFailure:

replicaFailureCondition = &deployment.Status.Conditions[idx]

}

}

if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),

Reason: "WorkloadNotProgressing",

Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),

Name: iden,

}

}

if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),

Reason: "WorkloadNotAvailable",

Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),

Name: iden,

}

}

if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {

return &payload.UpdateError{

Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),

Reason: "WorkloadNotProgressing",

Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),

Name: iden,

}

}

if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {

klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)

}

return nil

}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2025-04-19T10:42:36Z

Fix included in accepted release 4.19.0-0.nightly-2025-04-04-170728

openshift-bot · 2025-04-19T12:35:10Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.19.0-202504191210.p0.gb01f931.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot requested review from petr-muller and wking February 28, 2025 01:56

hongkailiu changed the title ~~OCPBUGS-23514: Better a message upon long CO updating~~ [wip]OCPBUGS-23514: Better a message upon long CO updating Feb 28, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 28, 2025

openshift-ci bot requested a review from jiajliu February 28, 2025 01:58

hongkailiu force-pushed the OCPBUGS-23514 branch from bd6ed4c to 577f975 Compare February 28, 2025 15:14

hongkailiu force-pushed the OCPBUGS-23514 branch 5 times, most recently from 54e9f08 to 34a5a59 Compare March 1, 2025 05:12

hongkailiu changed the title ~~[wip]OCPBUGS-23514: Better a message upon long CO updating~~ OCPBUGS-23514: Better a message upon long CO updating Mar 1, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2025

hongkailiu force-pushed the OCPBUGS-23514 branch from 34a5a59 to 1c87755 Compare March 7, 2025 20:30

hongkailiu changed the title ~~OCPBUGS-23514: Better a message upon long CO updating~~ OCPBUGS-23514: Failing=Unknown upon long CO updating Mar 7, 2025

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 11, 2025

openshift-merge-bot bot merged commit b01f931 into openshift:main Apr 19, 2025
14 of 15 checks passed

	func (b builder) checkDeploymentHealth(ctx context.Context, deployment appsv1.Deployment) error {
	if b.mode == InitializingMode {
	return nil
	}

	iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)

	if deployment.DeletionTimestamp != nil {
	return fmt.Errorf("deployment %s is being deleted", iden)
	}

	var availableCondition *appsv1.DeploymentCondition
	var progressingCondition *appsv1.DeploymentCondition
	var replicaFailureCondition *appsv1.DeploymentCondition
	for idx, dc := range deployment.Status.Conditions {
	switch dc.Type {
	case appsv1.DeploymentProgressing:
	progressingCondition = &deployment.Status.Conditions[idx]
	case appsv1.DeploymentAvailable:
	availableCondition = &deployment.Status.Conditions[idx]
	case appsv1.DeploymentReplicaFailure:
	replicaFailureCondition = &deployment.Status.Conditions[idx]
	}
	}

	if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
	return &payload.UpdateError{
	Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
	Reason: "WorkloadNotProgressing",
	Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
	Name: iden,
	}
	}

	if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
	return &payload.UpdateError{
	Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
	Reason: "WorkloadNotAvailable",
	Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
	Name: iden,
	}
	}

	if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
	return &payload.UpdateError{
	Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
	Reason: "WorkloadNotProgressing",
	Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
	Name: iden,
	}
	}

	if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
	klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
	}

	return nil
	}

OCPBUGS-23514: Failing=Unknown upon long CO updating #1165

OCPBUGS-23514: Failing=Unknown upon long CO updating #1165

Conversation

hongkailiu commented Feb 28, 2025 • edited Loading

openshift-ci-robot commented Feb 28, 2025

hongkailiu commented Feb 28, 2025

openshift-ci-robot commented Feb 28, 2025

jiajliu commented Feb 28, 2025

hongkailiu commented Feb 28, 2025

hongkailiu commented Mar 1, 2025

petr-muller commented Mar 6, 2025 • edited Loading

hongkailiu commented Mar 7, 2025

petr-muller commented Mar 7, 2025

openshift-ci-robot commented Mar 7, 2025

dis016 commented Apr 11, 2025 • edited Loading

dis016 commented Apr 11, 2025

openshift-ci-robot commented Apr 11, 2025

hongkailiu commented Apr 12, 2025

openshift-ci-robot commented Apr 14, 2025

openshift-ci-robot commented Apr 14, 2025

openshift-ci-robot commented Apr 15, 2025

openshift-ci-robot commented Apr 15, 2025

openshift-ci-robot commented Apr 16, 2025

openshift-ci-robot commented Apr 16, 2025

openshift-ci-robot commented Apr 16, 2025

petr-muller commented Apr 16, 2025

petr-muller commented Apr 16, 2025

openshift-ci bot commented Apr 16, 2025

openshift-ci bot commented Apr 16, 2025

petr-muller commented Apr 17, 2025

openshift-ci bot commented Apr 17, 2025

petr-muller commented Apr 17, 2025 • edited Loading

openshift-ci bot commented Apr 17, 2025

openshift-ci-robot commented Apr 17, 2025

openshift-ci-robot commented Apr 17, 2025

openshift-ci-robot commented Apr 18, 2025

openshift-ci bot commented Apr 18, 2025 • edited Loading

petr-muller commented Apr 19, 2025

openshift-ci bot commented Apr 19, 2025

openshift-ci-robot commented Apr 19, 2025

openshift-merge-robot commented Apr 19, 2025

openshift-bot commented Apr 19, 2025

hongkailiu commented Feb 28, 2025 •

edited

Loading

petr-muller commented Mar 6, 2025 •

edited

Loading

dis016 commented Apr 11, 2025 •

edited

Loading

petr-muller commented Apr 17, 2025 •

edited

Loading

openshift-ci bot commented Apr 18, 2025 •

edited

Loading