Unexpected Transient Degraded Status Change During Application Rolling Deployments in Argo CD 2.13.1 #21198

atilsensalduz · 2024-12-16T12:19:07Z

After upgrading Argo CD to version 2.13.1, I've started observing unusual status behavior in my applications during rolling deployments or restarts. Specifically:
The application status briefly changes to Degraded, and then immediately back to Healthy, even though I can't identify any actual health issues in the deployment components.
I’ve reviewed the application events and didn’t find any readiness or liveness probe failures, or other indicators of degraded health in the resources.
This behavior triggers Degraded state notifications, but since there’s no option to send notifications for transitions from Degraded to Healthy, I end up receiving alerts for transient states that quickly resolve themselves.

Normal  ResourceUpdated  69s    argocd-application-controller  Updated health status: Healthy -> Progressing  
Normal  ResourceUpdated  42s    argocd-application-controller  Updated health status: Progressing -> Healthy  
Normal  ResourceUpdated  37s    argocd-application-controller  Updated health status: Healthy -> Degraded  
Normal  ResourceUpdated  37s    argocd-application-controller  Updated health status: Degraded -> Healthy

This issue was not observed before the upgrade and only seems to happen during rolling deployments or restarts.
Has anyone encountered similar behavior or could this be a regression or configuration issue with health checks in 2.13.1? Any advice on fixing or mitigating this would be greatly appreciated!
Thanks in advance for your help!

The text was updated successfully, but these errors were encountered:

neiljain · 2024-12-18T00:49:03Z

we've observed the same issue with 2.13.2 as well

{"application":"app***","dest-namespace":"default","dest-server":"https://kubernetes.default.svc","level":"info","msg":"Updated health status: Progressing -\u003e Healthy","reason":"ResourceUpdated","time":"2024-12-18T00:49:20Z","type":"Normal"}
{"application":"app***","dest-namespace":"default","dest-server":"https://kubernetes.default.svc","level":"info","msg":"Updated health status: Healthy -\u003e Degraded","reason":"ResourceUpdated","time":"2024-12-18T00:50:20Z","type":"Normal"}
{"application":"app***","dest-namespace":"default","dest-server":"https://kubernetes.default.svc","level":"info","msg":"Updated health status: Degraded -\u003e Healthy","reason":"ResourceUpdated","time":"2024-12-18T00:50:21Z","type":"Normal"}

todaywasawesome · 2024-12-19T16:22:23Z

Thank you @andrii-korotkov-verkada for volunteering to investigate.

@atilsensalduz @neiljain Can you share what your apps look like and what healthchecks your resources are using?

atilsensalduz · 2024-12-20T10:41:24Z

Hey guys, our applications follow a similar structure to the one below. We’ve encountered the same issue across different applications, not specific to one.

ArgoCD application:

kind: Application
metadata:
  labels:
    app.kubernetes.io/part-of: api
    argocd.argoproj.io/instance: aws-api
  name: dev
  namespace: argocd
spec:
  destination:
    name: aws-api
    namespace: dev
  project: api
  source:
    helm:
      releaseName: dev
      valueFiles:
      - values.yaml
    path: charts/dev
    repoURL: [email protected]
    targetRevision: main
  syncPolicy:
    automated:
      allowEmpty: true
      prune: true
      selfHeal: true
status:
  controllerNamespace: argocd
  health:
    status: Healthy
  history:
  operationState:
    message: successfully synced (no more tasks)
    operation:
      initiatedBy:
        automated: true
      retry:
        limit: 5
      sync:
        prune: true
    phase: Succeeded
    syncResult:
      resources:
      - group: policy
        hookPhase: Succeeded
        kind: PodDisruptionBudget
        message: PodDisruptionBudget has SufficientPods
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ServiceAccount
        message: serviceaccount/dev-deployment-restart-sa unchanged
        name: dev-deployment-restart-sa
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ConfigMap
        message: configmap/dora-metrics-script-dev unchanged
        name: dora-metrics-script-dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ConfigMap
        message: configmap/dev configured
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ConfigMap
        message: configmap/grafana-annotation-script-dev unchanged
        name: grafana-annotation-script-dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: rbac.authorization.k8s.io
        hookPhase: Succeeded
        kind: ClusterRoleBinding
        message: clusterrolebinding.rbac.authorization.k8s.io/dev-discovery
          reconciled. clusterrolebinding.rbac.authorization.k8s.io/dev-discovery
          unchanged
        name: dev-discovery
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: rbac.authorization.k8s.io
        hookPhase: Succeeded
        kind: Role
        message: role.rbac.authorization.k8s.io/dev-deployment-restart-role
          reconciled. role.rbac.authorization.k8s.io/dev-deployment-restart-role
          unchanged
        name: dev-deployment-restart-role
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: rbac.authorization.k8s.io
        hookPhase: Succeeded
        kind: RoleBinding
        message: rolebinding.rbac.authorization.k8s.io/dev-deployment-restart-rolebinding
          reconciled. rolebinding.rbac.authorization.k8s.io/dev-deployment-restart-rolebinding
          unchanged
        name: dev-deployment-restart-rolebinding
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: Service
        message: service/dev unchanged
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: apps
        hookPhase: Succeeded
        kind: Deployment
        message: deployment.apps/dev configured
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: autoscaling
        hookPhase: Succeeded
        kind: HorizontalPodAutoscaler
        message: recommended size matches current size
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v2
      - group: batch
        hookPhase: Succeeded
        kind: CronJob
        message: cronjob.batch/rolling-restart-job unchanged
        name: rolling-restart-job
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: external-secrets.io
        hookPhase: Succeeded
        kind: ExternalSecret
        message: Secret was synced
        name: dev-external-secret
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1beta1
      - group: batch
        hookPhase: Succeeded
        hookType: PostSync
        kind: Job
        message: job.batch/dora-dev-c589e19-postsync-1733733244 created
        name: dora-dev-c589e19-postsync-1733733244
        namespace: dev
        syncPhase: PostSync
        version: v1
      - group: batch
        hookPhase: Succeeded
        hookType: PostSync
        kind: Job
        message: job.batch/dev-c589e19-postsync-1733733244 created
        name: dev-c589e19-postsync-1733733244
        namespace: dev
        syncPhase: PostSync
        version: v1
      source:
        helm:
          valueFiles:
          - values.yaml
        path: charts/dev
        repoURL: [email protected]:argocd.git
        targetRevision: main
  resources:
  - kind: ConfigMap
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - kind: ConfigMap
    name: dora-metrics-script-dev
    namespace: dev
    status: Synced
    version: v1
  - kind: ConfigMap
    name: grafana-annotation-script-dev
    namespace: dev
    status: Synced
    version: v1
  - health:
      status: Healthy
    kind: Service
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - kind: ServiceAccount
    name: dev-deployment-restart-sa
    namespace: dev
    status: Synced
    version: v1
  - group: apps
    health:
      status: Healthy
    kind: Deployment
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - group: autoscaling
    health:
      message: recommended size matches current size
      status: Healthy
    kind: HorizontalPodAutoscaler
    name: dev
    namespace: dev
    status: Synced
    version: v2
  - group: batch
    kind: CronJob
    name: rolling-restart-job
    namespace: dev
    status: Synced
    version: v1
  - group: external-secrets.io
    health:
      message: Secret was synced
      status: Healthy
    kind: ExternalSecret
    name: dev-external-secret
    namespace: dev
    status: Synced
    version: v1beta1
  - group: policy
    health:
      message: PodDisruptionBudget has SufficientPods
      status: Healthy
    kind: PodDisruptionBudget
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - group: rbac.authorization.k8s.io
    kind: ClusterRoleBinding
    name: dev-discovery
    status: Synced
    version: v1
  - group: rbac.authorization.k8s.io
    kind: Role
    name: dev-deployment-restart-role
    namespace: dev
    status: Synced
    version: v1
  - group: rbac.authorization.k8s.io
    kind: RoleBinding
    name: dev-deployment-restart-rolebinding
    namespace: dev
    status: Synced
    version: v1
  sourceType: Helm
  summary:
    images:
    - bitnami/kubectl:latest
    - dev:5.13.0
  sync:
    comparedTo:
      destination:
        namespace: dev
      source:
        helm:
          parameters:
          valueFiles:
          - values.yaml
        path: charts/dev
        repoURL: [email protected]:argocd.git
        targetRevision: main
    status: Synced```


Probes:
```startupProbe:
  httpGet: 
    path: '/{{ include ".getMainSubPath" $ }}/health/liveness'
    port: http
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 12
livenessProbe:
  httpGet:
    path: '/{{ include ".getMainSubPath" $ }}/health/liveness'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 2
  successThreshold: 1
  periodSeconds: 30
  failureThreshold: 10
readinessProbe:
  httpGet:
    path: '/{{ include ".getMainSubPath" $ }}/health/readiness'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 2
  successThreshold: 1
  periodSeconds: 30
  failureThreshold: 10```

andrii-korotkov-verkada · 2024-12-20T14:42:04Z

Can you enable debug logs for the application controller and share all the logs relevant to the application, please?

atilsensalduz added the bug Something isn't working label Dec 16, 2024

todaywasawesome added the regression Bug is a regression, should be handled with high priority label Dec 19, 2024

crenshaw-dev added the version:2.13 Latest confirmed affected version is 2.13 label Dec 19, 2024

todaywasawesome assigned andrii-korotkov-verkada Dec 19, 2024

crenshaw-dev added the component:health-check label Dec 19, 2024

andrii-korotkov-verkada added the more-information-needed Further information is requested label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Transient Degraded Status Change During Application Rolling Deployments in Argo CD 2.13.1 #21198

Unexpected Transient Degraded Status Change During Application Rolling Deployments in Argo CD 2.13.1 #21198

atilsensalduz commented Dec 16, 2024

neiljain commented Dec 18, 2024 •

edited

Loading

todaywasawesome commented Dec 19, 2024

atilsensalduz commented Dec 20, 2024

andrii-korotkov-verkada commented Dec 20, 2024

Unexpected Transient Degraded Status Change During Application Rolling Deployments in Argo CD 2.13.1 #21198

Unexpected Transient Degraded Status Change During Application Rolling Deployments in Argo CD 2.13.1 #21198

Comments

atilsensalduz commented Dec 16, 2024

neiljain commented Dec 18, 2024 • edited Loading

todaywasawesome commented Dec 19, 2024

atilsensalduz commented Dec 20, 2024

andrii-korotkov-verkada commented Dec 20, 2024

neiljain commented Dec 18, 2024 •

edited

Loading