You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i'm trying to create an alert based on supercronic prometheus metrics that will notify me if a job execution fails. If i have exactly one failed execution of a cron job the alert doesn't fire. Reason seems to be that supercronic_failed_executions counter isn't initialized when the process starts so the metric isn't available in prometheus until at least one failure. And when the first failure happens i cannot detect a change because there's nothing to compare to, the metric has a single value. With at least 2 failures the alert fires but i would really like to know about the first failure too. Initializing all counters with 0 will solve this. Also see the first part of this blog post https://blog.doit-intl.com/making-peace-with-prometheus-rate-43a3ea75c4cf.
My alert rule is supercronic_failed_executions - supercronic_failed_executions offset 10m > 1. Doing something like supercronic_failed_executions > 1 will work but isn't useful because such alert will fire until the process is restarted even if there are successful executions afterwards.
The text was updated successfully, but these errors were encountered:
Hello,
i'm trying to create an alert based on supercronic prometheus metrics that will notify me if a job execution fails. If i have exactly one failed execution of a cron job the alert doesn't fire. Reason seems to be that
supercronic_failed_executions
counter isn't initialized when the process starts so the metric isn't available in prometheus until at least one failure. And when the first failure happens i cannot detect a change because there's nothing to compare to, the metric has a single value. With at least 2 failures the alert fires but i would really like to know about the first failure too. Initializing all counters with 0 will solve this. Also see the first part of this blog post https://blog.doit-intl.com/making-peace-with-prometheus-rate-43a3ea75c4cf.My alert rule is
supercronic_failed_executions - supercronic_failed_executions offset 10m > 1
. Doing something likesupercronic_failed_executions > 1
will work but isn't useful because such alert will fire until the process is restarted even if there are successful executions afterwards.The text was updated successfully, but these errors were encountered: