Job retry not being understood/not working? #2866

zoza1982 · 2023-08-10T17:48:09Z

zoza1982
Aug 10, 2023

Hi there, I was wondering for a little help, I am testing Armada Quickstart for my uses-cases and I have trouble understanding the scheduler Job retry functionality.

Reading through the scheduler.go, I understood if a job fails, it will be retried until the max number of retries is reached.

So I am running this pod which randomly exits with either a 0 or 1 return code so that I emulate job failures:

queue: queue-a
jobSetId: job-set-1
jobs:
  - priority: 0
    podSpec:
      terminationGracePeriodSeconds: 0
      restartPolicy: OnFailure
      containers:
        - name: sleeper
          image: alpine:latest
          command:
            - sh
          args:
            - -c
            - sleep $(( (RANDOM % 60) + 10 )); exit $(( RANDOM % 2 ))
          resources:
            limits:
              memory: 128Mi
              cpu: 0.2
            requests:
              memory: 128Mi
              cpu: 0.2

I submit let's say 4 of these jobs, 2 succeed, and 2 fail. Those which fail would never be re-tried and Armada would clean up all the records of them from Kubernetes after ~10 min.

In addition, I tried to use restartPolicy: OnFailure to utilize kubernetes native feature, which also did not work. I did kubectl get po <failed_pod> -o yaml and found that restartPolicy: OnFailure was not passed and it was set to "Never".

Please explain to me how Job Retry works on job/whatever level ( without using K8s native RestartPolicy: OnFailure) and how can I make this example work.

Any help would be appreciated as I could not find any info/existing complaints on this topic!
Thanks!

kannon92 · 2023-08-11T12:41:01Z

kannon92
Aug 11, 2023

Armada retries jobs that didn't start correctly. We don't do restarts once the pod is running. It is an interesting feature request though!

Typically Armada is coupled with some kind of workflow engine so Airflow (for example) would handle restarts if a job failed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job retry not being understood/not working? #2866

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Job retry not being understood/not working? #2866

zoza1982 Aug 10, 2023

Replies: 1 comment

kannon92 Aug 11, 2023

zoza1982
Aug 10, 2023

kannon92
Aug 11, 2023