[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

wattebausch · 2024-12-10T09:28:15Z

Please try to fill out as much of the information below as you can. Thank you!

Yes, I've searched similar issues on GitHub and didn't find any.

Which version contains the bug?

0.2.0 & development

Describe the bug

Hello there

We use check_prometheus in the ‘alert -P’ method. We only want to output errors. If we run without ‘-P’ the SMS/email will be too long with over 100 checks when an error occurs.

We have compared version ‘v0.2.0’ and ‘development’. We noticed that the versions behave differently with ‘alert’. However, both are the same, if our Prometheus is all set to ‘Inactive’, i.e. OK, then we get an ‘UNKNOWN’ back

Can you check if this is the same for you?
Thanks for your great work

How to recreate the bug?

Version: 0.2.0 (Check against a cluster without alerts)

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.clean-cluster.example.com -p 80 alert 
[OK] - Alerts inactive | total=123 firing=0 pending=0 inactive=123

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
 | 

# echo $?
3

Version: 0.2.0 (check against a cluster with problem)

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.some-probelm-cluster.example.com -p 80 alert 
[CRITICAL] - 126 Alerts: 1 Firing - 0 Pending - 125 Inactive
 \_[OK] [FluxHelmReleaseNotReady] is inactive
 \_[OK] [FluxGitRepositorySyncFailed] is inactive
...
 \_[CRITICAL] [Watchdog] is firing - value: 1.00
 \_[OK] [InfoInhibitor] is inactive
..
 \_[OK] [K3sCertificateExpiration] is inactive
 | total=126 firing=1 pending=0 inactive=125

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.some-probelm-cluster.example.com -p 80 alert -P
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
 \_[CRITICAL] [Watchdog] is firing - value: 1.00
 |

Version: development (Check against a cluster without alerts)

./check_prometheus_develop -H metrics.clean-cluster.example.com -p 80 alert 
[OK] - 125 Alerts: 0 Firing - 0 Pending - 125 Inactive
\_ [OK] [FluxHelmReleaseNotReady] is inactive
\_ [OK] [FluxGitRepositorySyncFailed] is inactive
\_ [OK] [InfoInhibitor] is inactive
..
\_ [OK] [K3sCertificateExpiration] is inactive
|total=125 firing=0 pending=0 inactive=125

./check_prometheus_develop -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

# echo $?
3

Version: development (check against a cluster with problem)

./check_prometheus_develop -H metrics.some-probelm-cluster.example.com  -p 80 alert 
[CRITICAL] - 126 Alerts: 1 Firing - 0 Pending - 125 Inactive
\_ [OK] [FluxHelmReleaseNotReady] is inactive
\_ [OK] [FluxGitRepositorySyncFailed] is inactive
..
\_ [CRITICAL] [Watchdog] is firing - value: 1.00
\_ [OK] [InfoInhibitor] is inactive
..
\_ [OK] [K3sCertificateExpiration] is inactive
|total=126 firing=1 pending=0 inactive=125

./check_prometheus_develop -H metrics.some-probelm-cluster.example.com  -p 80 alert -P
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
\_ [CRITICAL] [Watchdog] is firing - value: 1.00

RincewindsHat · 2024-12-10T12:32:27Z

Hi @wattebausch,
To summarize: You would like to only see Problems in the Output and not alerts which are OK?
Did I read that correctly?

wattebausch · 2024-12-10T12:48:35Z

Yes.
if I have understood it correctly, alert -P should be. But as I see it, the exit code is 3 (Unknown) and not 0 (OK)
-P, --problems Display only alerts which status is not inactive/OK

./check_prometheus_develop -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

# echo $?
3

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
 | 

# echo $?
3

martialblog · 2024-12-11T07:40:45Z

Hi, right now that is just the logic of the plugin.

I think the reasoning behind it was, if there's nothing there, we cannot determine the status. But I agree that it is somewhat odd behavior. If there are no alert rule defined it would also return UNKNOWN.

We can improve that, just gotta keep all possible scenarios in mind for example:

	All Alerts	--problems	--name
No Alerts in API	OK	OK	Unknown, since we cannot determine the status of the wanted alert
Alerts in API	Worst State	Worst State	Unknown, since we cannot determine the status of the wanted alert

What does everybody think?

martialblog · 2024-12-11T07:43:57Z

@wattebausch Thanks again for your feedback. It's very helpful!

RincewindsHat · 2024-12-11T12:23:04Z

Hi, right now that is just the logic of the plugin.

I think the reasoning behind it was, if there's nothing there, we cannot determine the status. But I agree that it is somewhat odd behavior. If there are no alert rule defined it would also return UNKNOWN.

We can improve that, just gotta keep all possible scenarios in mind for example:
All Alerts --problems --name
No Alerts in API OK OK Unknown, since we cannot determine the status of the wanted alert
Alerts in API Worst State Worst State Unknown, since we cannot determine the status of the wanted alert

What does everybody think?

Without knowing much about prometheus this sounds about right, when checking for alerts it is probably OK if there are none.

martialblog · 2024-12-16T13:54:01Z

@wattebausch I reworked the logic a bit, you can check out the new code here #66

This should be more expected and flexible behavior

wattebausch · 2024-12-16T15:07:19Z

@wattebausch I reworked the logic a bit, you can check out the new code here #66

This should be more expected and flexible behavior

Check my Version

git checkout
Your branch is up to date with 'origin/no-alerts-logic'.

./check_prometheus --version
check_prometheus version development
commit: HEAD
date: latest

i don't know whether the first four lines should appear in the help. FYI

./check_prometheus alert --help
Checks the status of a Prometheus alert and evaluates the status of the alert:
firing = 2
pending = 1
inactive = 0

Usage:
  check_prometheus alert [flags]

Examples:
...
...
Flags:
  -h, --help                     help for alert
  -n, --name strings             The name of one or more specific alerts to check.
                                 This parameter can be repeated e.G.: '--name alert1 --name alert2'
                                 If no name is given, all alerts will be evaluated
  -T, --no-alerts-state string   State to assign when no alerts are found (0, 1, 2, 3, OK, WARNING, CRITICAL, UNKNOWN). If not set this defaults to OK (default "OK")
  -P, --problems                 Display only alerts which status is not inactive/OK. Note that in combination with the --name flag this might result in no alerts being displayed

Global Flags:
  -b, --bearer string      Specify the Bearer Token for server authentication (CHECK_PROMETHEUS_BEARER)
...

Unfortunately, the behaviour is still the same. As far as I can see, I have compiled the correct version
Should the ‘Inactive’ not be the number of checks. in my case 125?

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T OK
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T 1
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T CRITICAL
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P --no-alerts-state OK
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P --no-alerts-state 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T OK ; echo $?
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

3

Here I check whether, if there is ‘Firing’, it does not mistakenly go to OK. perfect, works as it should.

./check_prometheus -H metrics.some-probelm-cluster.example.com -p 80 alert -P -T OK
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
\_ [CRITICAL] [Watchdog] is firing - value: 1.00

./check_prometheus -H metrics.some-probelm-cluster.example.com -p 80 alert -P -T 0
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
\_ [CRITICAL] [Watchdog] is firing - value: 1.00

martialblog · 2024-12-17T08:05:32Z

Thanks for the feedback, I'll have a another look

martialblog · 2024-12-18T17:30:40Z

Hi,

these are the results when checking Prometheus with the new logic:

# Starting Prometheus without any alert rules defined:

go run main.go alert
[OK] - No alerts defined

go run main.go alert -T 3
[UNKNOWN] - No alerts defined

go run main.go alert -T 2
[CRITICAL] - No alerts defined

go run main.go alert -T foo
[UNKNOWN] - invalid value for --no-alerts-state: foo (*errors.errorString)

# When an alert is defined:

go run main.go alert -T 2
[OK] - 1 Alerts: 0 Firing - 0 Pending - 1 Inactive
\_ [OK] [NodeHasMemoryPressure] is inactive

# When you are looking for a specific alert and alerts are defined:
# These counts only include the names that match the --name parameters

go run main.go alert -T 2 --name foo
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

go run main.go alert -T 1 --name foo --name NodeHasMemoryPressure
[OK] - 1 Alerts: 0 Firing - 0 Pending - 1 Inactive
\_ [OK] [NodeHasMemoryPressure] is inactive
|total=1 firing=0 pending=0 inactive=1

# When you are looking for a specific alert and no alerts are defined:

go run main.go alert -T 2 --name foo
[UNKNOWN] - No such alert defined

As far as I can tell, this is the behavior we want.

wattebausch · 2024-12-19T13:20:17Z

Hi @martialblog many thanks.
i have tried again, but it doesn't work for me. have i done something wrong? do you have any tips for me?
we use it with the --problem "-P" flag

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

go run main.go -H metrics.clean-cluster.example.com -p 80 alert -P 
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

exit status 3

go run main.go -H metrics.clean-cluster.example.com -p 80 alert -P -T 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

exit status 3

go version go1.22.2 linux/amd64

RincewindsHat · 2024-12-20T04:13:57Z

@wattebausch Did you use the current main branch? I do get:

./check_prometheus alert
[OK] - No alerts defined

martialblog · 2024-12-20T06:55:33Z

Haven't tested it with the "-P" flag, but it's applied after checking if there are any alerts. I can give it another look.

wattebausch · 2024-12-20T07:56:41Z

Did you use the current main branch? I do get:

Yes, I made another git clone

>git pull 
Already up to date.

>git log
commit a2cc3770ec1ec2714b8abdd7766fcfe690a56ea9 (HEAD -> main, origin/main, origin/HEAD)
Merge: 6d8740a 1318445
Author: Markus Opolka <[email protected]>
Date:   Thu Dec 19 09:01:20 2024 +0100

    Merge pull request #69 from NETWAYS/fix/update_x_net_html
    
    Update dependencies

commit 6d8740a3ff0fc4ff90c4eb443952d4356df15cd8
Merge: 67fb855 fa2e5bd

./check_prometheus alert
[OK] - No alerts defined

we use kubernetes, i can only test it from our icinga server, not on local. our hosts all have multiple rules. should i delete all rules so that the prometheus is empty?

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert
[OK] - 127 Alerts: 0 Firing - 0 Pending - 127 Inactive
\_ [OK] [FluxHelmReleaseNotReady] is inactive
\_ [OK] [FluxGitRepositorySyncFailed] is inactive
...

martialblog · 2024-12-20T08:16:57Z

I just started a simple Prometheus locally, with a minimal config:

podman run -ti --rm -v $(pwd):/etc/prometheus -p 9090:9090 docker.io/prom/prometheus:v3.0.0

# With alerts defined:
$ go run main.go alert 

[OK] - 1 Alerts: 0 Firing - 0 Pending - 1 Inactive
\_ [OK] [NodeHasMemoryPressure] is inactive

# With alerts defined and problems only
$ go run main.go alert -P

[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

go run main.go alert -P -T 1
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

# With nothing defined:

$ go run main.go alert
[OK] - No alerts defined

$ go run main.go alert -P
[OK] - No alerts defined

So yeah something isn't right yet.
These guys here should behave differently:

# With alerts defined and problems only
$ go run main.go alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
# Should be OK

go run main.go alert -P -T 1
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
# Should be WARN

wattebausch added bug Something isn't working needs-triage Needs to be triaged labels Dec 10, 2024

martialblog removed the needs-triage Needs to be triaged label Dec 11, 2024

martialblog added this to the v0.3.0 milestone Dec 11, 2024

martialblog self-assigned this Dec 16, 2024

martialblog mentioned this issue Dec 16, 2024

Rework logic when there are no alerts defined #66

Merged

wattebausch changed the title ~~[Bug]: ‘alert -p’ gets an Unknown if everything is fine~~ [Bug]: ‘alert -P’ gets an Unknown if everything is fine Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

wattebausch commented Dec 10, 2024

RincewindsHat commented Dec 10, 2024

wattebausch commented Dec 10, 2024

martialblog commented Dec 11, 2024

martialblog commented Dec 11, 2024

RincewindsHat commented Dec 11, 2024

martialblog commented Dec 16, 2024

wattebausch commented Dec 16, 2024 •

edited

Loading

martialblog commented Dec 17, 2024

martialblog commented Dec 18, 2024

wattebausch commented Dec 19, 2024 •

edited

Loading

RincewindsHat commented Dec 20, 2024

martialblog commented Dec 20, 2024

wattebausch commented Dec 20, 2024

martialblog commented Dec 20, 2024

[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

Comments

wattebausch commented Dec 10, 2024

Please try to fill out as much of the information below as you can. Thank you!

Which version contains the bug?

Describe the bug

How to recreate the bug?

Version: 0.2.0 (Check against a cluster without alerts)

Version: 0.2.0 (check against a cluster with problem)

Version: development (Check against a cluster without alerts)

Version: development (check against a cluster with problem)

RincewindsHat commented Dec 10, 2024

wattebausch commented Dec 10, 2024

martialblog commented Dec 11, 2024

martialblog commented Dec 11, 2024

RincewindsHat commented Dec 11, 2024

martialblog commented Dec 16, 2024

wattebausch commented Dec 16, 2024 • edited Loading

martialblog commented Dec 17, 2024

martialblog commented Dec 18, 2024

wattebausch commented Dec 19, 2024 • edited Loading

RincewindsHat commented Dec 20, 2024

martialblog commented Dec 20, 2024

wattebausch commented Dec 20, 2024

martialblog commented Dec 20, 2024

wattebausch commented Dec 16, 2024 •

edited

Loading

wattebausch commented Dec 19, 2024 •

edited

Loading