Need help in writing alert rule for batch jobs #613
Replies: 2 comments 1 reply
-
It's kind of weird to use the status code as the value of the metric. Having said that, you shouldn't see gaps in your metric. A metric pushed to the PGW stays there until it is overwritten by the next push. You appear to always push all the metrics. Maybe both of your batch jobs push to the same group? You might reconsider that, or you use the Add function in client_golang so that only metrics of the same name are overwritten. Once you have fixed that, it should be pretty straight forward to write alerts on any metric value being !=200 for the last hour. |
Beta Was this translation helpful? Give feedback.
-
Very generally, a status code should be in a label. However, this topic is a very general Prometheus metrics topic, not very suitable for the very specific Pushgateway discussions area on GH. I recommend to use one of the other community channels where there are also way more people available to discuss those topics.
The Pushgateway doesn't implement a distributed counter (in other words, it's not statsd like). It doesn't count events. |
Beta Was this translation helpful? Give feedback.
-
I have some batch jobs running in three different regions. Each of them will:
* I have the full error code to error description mapping - it's text based and can easily be converted to JSON, CSV and etc if needed
These metrics are being pushed to a PushGateway instance (a shared instance for all regions). Every 30 seconds, my Prometheus instance will scrape all these metrics.
Now the problem is, since my batch jobs are not long running processes like a web server where metrics usually available at
/metrics
endpoint, all metrics above are only available for short period of time in my Prometheus. There are a lot of "gaps" or "blanks" in my Prometheus which blocked me from achieving what I want.Screenshot:
What I want are:
run_type
andregion
)annotations
area) what are the recent errors based on the metrics return value abovePlease tell me if it's possible and how the alert rules should look like. If it's not possible with current metrics type and labels, please tell me the correct metrics type I need to use and what labels should be added. Since I own and wrote the "exporter" part, I can change it.
FWIW, I'm using this Golang function to push metrics from my batch job to the PushGateway instance.
Beta Was this translation helpful? Give feedback.
All reactions