Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow is sometimes interrupted #14022

Open
3 of 4 tasks
panicboat opened this issue Dec 20, 2024 · 2 comments
Open
3 of 4 tasks

Workflow is sometimes interrupted #14022

panicboat opened this issue Dec 20, 2024 · 2 comments

Comments

@panicboat
Copy link
Contributor

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

A CronWorkflow running every 10 minutes occasionally gave the error `Stopped with strategy 'Terminate'' and detected cronworkflows_concurrencypolicy_triggered. We checked the workflow executed immediately afterwards and found no elements triggering concurrencyPolicy.

Failed Workflow
monitor-argo-workflows-1734498600.txt

Workflow immediately after
monitor-argo-workflows-1734498000.txt

Looking at the failed workflow, it is interrupted 10 seconds after start (2024-12-18T05:00:10Z), but has not met the next schedule (2024-12-18T05:10:00Z) of CronWorkflow.

    monitor-argo-workflows-1734498000-4221058204:
      id: monitor-argo-workflows-1734498000-4221058204
      name: monitor-argo-workflows-1734498000(0)
      displayName: monitor-argo-workflows-1734498000(0)
      type: Pod
      templateName: main
      templateScope: local/monitor-argo-workflows-1734498000
      phase: Failed
      message: "workflow shutdown with strategy:  Terminate"
      startedAt: "2024-12-18T05:00:00Z"
      finishedAt: "2024-12-18T05:00:10Z"
      progress: 0/1
      nodeFlag:
        retried: true

Version(s)

v3.6.2

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: monitor-argo-workflows
spec:
  suspend: false
  schedule: "*/10 * * * *"
  timezone: "Asia/Tokyo"
  concurrencyPolicy: Replace
  startingDeadlineSeconds: 60
  workflowSpec:
    entrypoint: main
    activeDeadlineSeconds: 480
    templates:
      - name: main
        retryStrategy:
          retryPolicy: Always
        container:
          name: main
          image: public.ecr.aws/docker/library/busybox:latest
          command: [echo]
          args: ["hello world"]
          resources:
            limits:
              cpu: 100m
              memory: 32Mi
            requests:
              cpu: 100m
              memory: 32Mi

Logs from the workflow controller

I got this from cloudwatch, so the format may not be what it is supposed to be
---
| @timestamp | pod | log |
| --- | --- | --- |
| 2024-12-18 05:00:41.020 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:41.019Z" level=info msg="Successfully request 'edge-workflow/monitor-argo-workflows-1734498000' to be deleted" |
| 2024-12-18 05:00:41.001 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:41.001Z" level=info msg="Deleting garbage collected workflow 'edge-workflow/monitor-argo-workflows-1734498000'" |
| 2024-12-18 05:00:15.395 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:15.394Z" level=info msg="cleaning up pod" action=deletePod key=edge-workflow/monitor-argo-workflows-1734498000-main-4221058204/deletePod |
| 2024-12-18 05:00:10.435 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.435Z" level=info msg="Queueing Failed workflow edge-workflow/monitor-argo-workflows-1734498000 for delete in 30s due to TTL" |
| 2024-12-18 05:00:10.394 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.394Z" level=info msg="archiving workflow" namespace=edge-workflow uid=ca75c8ee-0609-4f16-994c-7fe168f8f1eb workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.376 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.376Z" level=info msg="Workflow update successful" namespace=edge-workflow phase=Failed resourceVersion=410659688 workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.364 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.364Z" level=info msg="cleaning up pod" action=deletePod key=edge-workflow/monitor-argo-workflows-1734498000-main-4221058204/deletePod |
| 2024-12-18 05:00:10.358 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.358Z" level=info msg="cleaning up pod" action=terminateContainers key=edge-workflow/monitor-argo-workflows-1734498000-main-4221058204/terminateContainers |
| 2024-12-18 05:00:10.354 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.354Z" level=info msg="Running OnExit handler: exit-handler" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.354 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.354Z" level=info msg="Updated phase Running -> Failed" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.354 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.354Z" level=info msg="Updated message  -> Stopped with strategy 'Terminate'" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.354 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.354Z" level=info msg="Marking workflow completed" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.354 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.354Z" level=info msg="Marking workflow as pending archiving" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.352Z" level=info msg="Task-result reconciliation" namespace=edge-workflow numObjs=0 workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node changed" namespace=edge-workflow new.message="Unschedulable: 0/8 nodes are available: 1 node(s) had untolerated taint {quipper.dev/karpenter: deprecated-amd64-no-disruption-nodepool}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling." new.phase=Pending new.progress=0/1 nodeID=monitor-argo-workflows-1734498000-4221058204 old.message= old.phase=Pending old.progress=0/1 workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="Terminating pod as part of workflow shutdown" namespace=edge-workflow podName=monitor-argo-workflows-1734498000-main-4221058204 shutdownStrategy=Terminate workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node monitor-argo-workflows-1734498000-4221058204 phase Pending -> Failed" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node monitor-argo-workflows-1734498000-4221058204 message: workflow shutdown with strategy:  Terminate" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node monitor-argo-workflows-1734498000-4221058204 finished: 2024-12-18 05:00:10.353200692 +0000 UTC" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="Stopped with strategy 'Terminate'" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node monitor-argo-workflows-1734498000 phase Running -> Failed" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node monitor-argo-workflows-1734498000 message: Stopped with strategy 'Terminate'" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.353 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.353Z" level=info msg="node monitor-argo-workflows-1734498000 finished: 2024-12-18 05:00:10.353807224 +0000 UTC" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:10.351 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:10.351Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=410659555 namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.475 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.474Z" level=info msg="Update retry attempt 1 successful" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.475 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.474Z" level=info msg="Workflow update successful" namespace=edge-workflow phase=Running resourceVersion=410659555 workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.436 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.436Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"monitor-argo-workflows-1734498000\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.436 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.436Z" level=info msg="Re-applying updates on latest version and retrying update" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.423 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.422Z" level=info msg=reconcileAgentPod namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.422 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.422Z" level=info msg="Created pod: monitor-argo-workflows-1734498000(0) (monitor-argo-workflows-1734498000-main-4221058204)" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.422 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.422Z" level=info msg="TaskSet Reconciliation" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.411 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.411Z" level=info msg="stopping 'monitor-argo-workflows-1734498000'" namespace=edge-workflow workflow=monitor-argo-workflows |
| 2024-12-18 05:00:00.280 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.279Z" level=info msg="stopping 'monitor-argo-workflows-1734498000'" namespace=edge-workflow workflow=monitor-argo-workflows |
| 2024-12-18 05:00:00.164 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.164Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.164 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.164Z" level=info msg="Retry node monitor-argo-workflows-1734498000 initialized Running" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.164 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.164Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.164 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.164Z" level=info msg="Pod node monitor-argo-workflows-1734498000-4221058204 initialized Pending" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.164 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.164Z" level=warning msg="couldn't get boundaryTemplate through nodeName monitor-argo-workflows-1734498000(0)" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.163 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.163Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.161 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.161Z" level=info msg="Task-result reconciliation" namespace=edge-workflow numObjs=0 workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.161 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.161Z" level=info msg="Updated phase  -> Running" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.152 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.152Z" level=info msg="Processing workflow" Phase= ResourceVersion=410659527 namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
---

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@jswxstw
Copy link
Member

jswxstw commented Dec 20, 2024

| 2024-12-18 05:00:00.436 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.436Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"monitor-argo-workflows-1734498000\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |
| 2024-12-18 05:00:00.436 | workflow-controller-987cc9b94-fr88f | time="2024-12-18T05:00:00.436Z" level=info msg="Re-applying updates on latest version and retrying update" namespace=edge-workflow workflow=monitor-argo-workflows-1734498000 |

The logs indicate that a conflict occurred while updating the workflow. Is it possible that the current workflow was terminated somewhere else?

@panicboat
Copy link
Contributor Author

Thanks for confirming.
No elements were found to interrupt this Workflow.
The Workflow manifest attached to the body of the Issue is for this and the immediate runtime, but it did not take long enough for the ConcurrencyPolicy to have any impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants