-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional SIGTERM at Job Failure #60
Comments
Can you clarify your use case a little bit, and explain what you're using Supercronic for here? I'm a little confused by exactly what you're trying to achieve. Thanks! |
@krallin At Puppetlabs we're about to include When the run of Right now we work around this through (https://github.com/Xtigyro/puppetserver-helm-chart/blob/edge/templates/r10k-code.configmap.yaml#L25): exec /docker-entrypoint.sh deploy environment --config /etc/puppetlabs/puppet/r10k_code.yaml \
--puppetfile {{ template "r10k.code.args" . }} > ~/.r10k_code_cronjob.out 2>&1
retVal=$?
if [ "$retVal" -eq "0" ]; then
touch ~/.r10k_code_cronjob.success > /dev/null 2>&1
else
rm ~/.r10k_code_cronjob.success > /dev/null 2>&1
fi
exit $retVal |
So, you're saying you want to run r10k on a periodic basis, and you want to the container to exit if it ever fails. Is that correct? |
Yes, correct. |
@krallin Another option can be If you set readiness/liveness K8s probes for the container that
|
The SIGTERM feature that is being asked for in the first post can already by emulated by wrapping the command in the crontab line by # ./supercronic -debug <(echo '* * * * * * * bash -c "false || kill -15 $PPID"')
INFO[2020-11-14T15:36:12+01:00] read crontab: /dev/fd/63
DEBU[2020-11-14T15:36:12+01:00] try parse(7): * * * * * * * bash -c "false || kill -15 $PPID"[0:13] = * * * * * * *
DEBU[2020-11-14T15:36:12+01:00] job will run next at 2020-11-14 15:36:13 +0100 CET job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] starting iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] received terminated, shutting down
INFO[2020-11-14T15:36:13+01:00] waiting for jobs to finish
INFO[2020-11-14T15:36:13+01:00] job succeeded iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] job will run next at 2020-11-14 15:36:14 +0100 CET job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] shutting down job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] exiting Would that be too brittle? On bird eye level, terminating supercronic on failures seems to be intended as a mean of signaling to the outer world. From a monitoring point of view, rather than waiting for someone to signal "I'm dead", checking for repeated "I'm alive" would probably make more robust monitoring. So if the context is monitoring, either the existing Prometheus HTTP endpoint or a new HTTP status endpoint would serve the bigger picture best, probably. What do you think? |
@Xtigyro any thoughts? |
@hartwork It's required so Kubernetes can monitor the Liveness/Readiness probes. A new HTTP status endpoint will be perfect! |
@Xtigyro what about the bash wrapper and the existing Prometheus endpoint? |
@hartwork The BASH wrapper is a bit dirty but can be used. The Prom endpoint cannot work with a vanilla K8s cluster. |
@Xtigyro : can you clarify what you mean by:
I'm not sure I follow what cannot work there. FWIW, I do think exposing a status API would be reasonable to do, but I also would like to make sure that whatever does not work with the Prometheus endpoint would work with a status API :) |
I think he refers to Kubernetes liveness HTTP requests. Kubernetes looks at HTTP status codes and considers It might be possible to build a Kubernetes liveness command to do the same, based on nothing but Bash, wget and supercronic's existing Prometheus metrics endpoint. I imagine something like this to work well in practice: bash -c '( set -o pipefail ; wget -O- http://localhost:9746/metrics 2>/dev/null | fgrep supercronic_ | { ! grep -q failed_executions ; } )' What do you think? |
@hartwork @krallin Yup - spot on - that was exactly my point. I'm gonna test that command and let you know - it looks alright to me, should be useful. On a side note - in the Puppet Server Helm chart we've worked around it like this: https://github.com/puppetlabs/puppetserver-helm-chart/blob/master/templates/r10k-code.configmap.yaml#L20_L40 CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab |
@hartwork A question - is this exposed by default? ~ $ wget -O- http://localhost:9746/metrics
Connecting to localhost:9746 (127.0.0.1:9746)
wget: can't connect to remote host (127.0.0.1): Connection refused |
Guys, replacing one Shell command with another doesn't make enough of a difference for us. One of these might improve the current situation:
If it's too time-consuming any of those to be developed - we will completely understand and we should just close the feature request. CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab |
@Xtigyro could you elaborate? You asked for a liveness check for vanilla kubernetes and it seems like the command above would satisfy that request. If it's not the actual set of requirements, please share the actual current set of requirements. Thank you. |
The current workaround we use is essentially a much simpler Shell command which is also used in a different way - so it's more resource efficient. We were looking for a real status health check compatible with K8s probes. The current situation is good enough anyway - so if such per job status health check won't be beneficial in general for All good, in other words. 🙇♂️ |
I would like to THANK YOU for this brilliant piece of software!
I have a use case in which
supercronic
acts as the entry-point of a container. That leads to the necessity to restart the container if the job that is run bysupercronic
fails.The text was updated successfully, but these errors were encountered: