Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional SIGTERM at Job Failure #60

Open
Xtigyro opened this issue Apr 12, 2020 · 18 comments
Open

Optional SIGTERM at Job Failure #60

Xtigyro opened this issue Apr 12, 2020 · 18 comments
Labels

Comments

@Xtigyro
Copy link

Xtigyro commented Apr 12, 2020

I would like to THANK YOU for this brilliant piece of software!

I have a use case in which supercronic acts as the entry-point of a container. That leads to the necessity to restart the container if the job that is run by supercronic fails.

@krallin
Copy link
Collaborator

krallin commented Apr 19, 2020

Can you clarify your use case a little bit, and explain what you're using Supercronic for here? I'm a little confused by exactly what you're trying to achieve. Thanks!

@Xtigyro
Copy link
Author

Xtigyro commented Apr 19, 2020

@krallin At Puppetlabs we're about to include supercronic in the official r10k Docker image. We need it so we can run r10k as a sidecar container in K8s using our Puppet Server Helm chart.

When the run of r10k fails - we need the container itself to declare in its run status that there's a problem. It would be handy if it marks its run as failed - so that K8s will restart it, thus making clear that there was a problem with the run of r10k.

Right now we work around this through (https://github.com/Xtigyro/puppetserver-helm-chart/blob/edge/templates/r10k-code.configmap.yaml#L25):

    exec /docker-entrypoint.sh deploy environment --config /etc/puppetlabs/puppet/r10k_code.yaml \
    --puppetfile {{ template "r10k.code.args" . }} > ~/.r10k_code_cronjob.out 2>&1
    retVal=$?
    if [ "$retVal" -eq "0" ]; then
      touch ~/.r10k_code_cronjob.success > /dev/null 2>&1
    else
      rm ~/.r10k_code_cronjob.success > /dev/null 2>&1
    fi
    exit $retVal

CC: @underscorgan @slconley

@krallin
Copy link
Collaborator

krallin commented Apr 19, 2020

So, you're saying you want to run r10k on a periodic basis, and you want to the container to exit if it ever fails. Is that correct?

@Xtigyro
Copy link
Author

Xtigyro commented Apr 19, 2020

Yes, correct.
Thanks for your time, BTW!

@Xtigyro
Copy link
Author

Xtigyro commented Apr 26, 2020

@krallin Another option can be supercronic to make externally visible the status of the scheduled jobs - and I mean not only by logging info in stdout.

If you set readiness/liveness K8s probes for the container that supercronic is always running in - you can check the state of the jobs (just examples):

  • from a supercronic local log file that matches the name of the job.
  • by running supercronic status [name_of_your_job].
  • through an HTTP API of supercronic where the status of the scheduled jobs is available.

CC: @underscorgan @slconley @Iristyle

@hartwork
Copy link

The SIGTERM feature that is being asked for in the first post can already by emulated by wrapping the command in the crontab line by bash -c "original command here || kill -15 $PPID" where $PPID is the process ID of supercronic as provided by the shell. For a demo:

# ./supercronic -debug <(echo '* * * * * * * bash -c "false || kill -15 $PPID"')
INFO[2020-11-14T15:36:12+01:00] read crontab: /dev/fd/63                     
DEBU[2020-11-14T15:36:12+01:00] try parse(7): * * * * * * * bash -c "false || kill -15 $PPID"[0:13] = * * * * * * * 
DEBU[2020-11-14T15:36:12+01:00] job will run next at 2020-11-14 15:36:13 +0100 CET  job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] starting                                      iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] received terminated, shutting down           
INFO[2020-11-14T15:36:13+01:00] waiting for jobs to finish                   
INFO[2020-11-14T15:36:13+01:00] job succeeded                                 iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] job will run next at 2020-11-14 15:36:14 +0100 CET  job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] shutting down                                 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] exiting

Would that be too brittle?

On bird eye level, terminating supercronic on failures seems to be intended as a mean of signaling to the outer world. From a monitoring point of view, rather than waiting for someone to signal "I'm dead", checking for repeated "I'm alive" would probably make more robust monitoring. So if the context is monitoring, either the existing Prometheus HTTP endpoint or a new HTTP status endpoint would serve the bigger picture best, probably. What do you think?

@hartwork
Copy link

@Xtigyro any thoughts?

@Xtigyro
Copy link
Author

Xtigyro commented Nov 24, 2020

@hartwork It's required so Kubernetes can monitor the Liveness/Readiness probes. A new HTTP status endpoint will be perfect!

@hartwork
Copy link

@Xtigyro what about the bash wrapper and the existing Prometheus endpoint?

@Xtigyro
Copy link
Author

Xtigyro commented Nov 24, 2020

@hartwork The BASH wrapper is a bit dirty but can be used. The Prom endpoint cannot work with a vanilla K8s cluster.

@krallin
Copy link
Collaborator

krallin commented Nov 27, 2020

@Xtigyro : can you clarify what you mean by:

The Prom endpoint cannot work with a vanilla K8s cluster.

I'm not sure I follow what cannot work there. FWIW, I do think exposing a status API would be reasonable to do, but I also would like to make sure that whatever does not work with the Prometheus endpoint would work with a status API :)

@hartwork
Copy link

hartwork commented Nov 27, 2020

The Prom endpoint cannot work with a vanilla K8s cluster.

I think he refers to Kubernetes liveness HTTP requests. Kubernetes looks at HTTP status codes and considers 200 <= x < 400 as success. It doesn't seem to support a check for absence of word supercronic_failed_executions in the HTTP response body. @Xtigyro could you confirm or correct me on this?

It might be possible to build a Kubernetes liveness command to do the same, based on nothing but Bash, wget and supercronic's existing Prometheus metrics endpoint. I imagine something like this to work well in practice:

bash -c '( set -o pipefail ; wget -O- http://localhost:9746/metrics 2>/dev/null | fgrep supercronic_ | { ! grep -q failed_executions ; } )'

What do you think?

@Xtigyro
Copy link
Author

Xtigyro commented Nov 28, 2020

@hartwork @krallin Yup - spot on - that was exactly my point.

I'm gonna test that command and let you know - it looks alright to me, should be useful.

On a side note - in the Puppet Server Helm chart we've worked around it like this: https://github.com/puppetlabs/puppetserver-helm-chart/blob/master/templates/r10k-code.configmap.yaml#L20_L40

CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab

@Xtigyro
Copy link
Author

Xtigyro commented Nov 28, 2020

http://localhost:9746/metrics

@hartwork A question - is this exposed by default?

~ $ wget -O- http://localhost:9746/metrics
Connecting to localhost:9746 (127.0.0.1:9746)
wget: can't connect to remote host (127.0.0.1): Connection refused

@hartwork
Copy link

@hartwork A question - is this exposed by default?

It needs something like supercronic -prometheus-listen-address localhost:9746 [..] but that's all. The :9746 can be omitted if your copy of supercronic is fresh enough to include PR #81.

@Xtigyro
Copy link
Author

Xtigyro commented Dec 5, 2020

Guys, replacing one Shell command with another doesn't make enough of a difference for us.

One of these might improve the current situation:

  • from a supercronic local log file that matches the name of the job.
  • by running supercronic status [name_of_your_job].
  • through an HTTP API of supercronic where the status of the scheduled jobs is available.

If it's too time-consuming any of those to be developed - we will completely understand and we should just close the feature request.

CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab

@hartwork
Copy link

hartwork commented Dec 5, 2020

Guys, replacing one Shell command with another doesn't make enough of a difference for us.

@Xtigyro could you elaborate? You asked for a liveness check for vanilla kubernetes and it seems like the command above would satisfy that request. If it's not the actual set of requirements, please share the actual current set of requirements. Thank you.

@Xtigyro
Copy link
Author

Xtigyro commented Dec 6, 2020

The current workaround we use is essentially a much simpler Shell command which is also used in a different way - so it's more resource efficient. We were looking for a real status health check compatible with K8s probes.

The current situation is good enough anyway - so if such per job status health check won't be beneficial in general for supercronic - then we better just leave it the way it is. It's still a great piece of software which we're gonna keep using.

All good, in other words. 🙇‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants