Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

Open
1 task done
wattebausch opened this issue Dec 10, 2024 · 14 comments
Open
1 task done

[Bug]: ‘alert -P’ gets an Unknown if everything is fine #65

wattebausch opened this issue Dec 10, 2024 · 14 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@wattebausch
Copy link

Please try to fill out as much of the information below as you can. Thank you!

  • Yes, I've searched similar issues on GitHub and didn't find any.

Which version contains the bug?

0.2.0 & development

Describe the bug

Hello there

We use check_prometheus in the ‘alert -P’ method. We only want to output errors. If we run without ‘-P’ the SMS/email will be too long with over 100 checks when an error occurs.

We have compared version ‘v0.2.0’ and ‘development’. We noticed that the versions behave differently with ‘alert’. However, both are the same, if our Prometheus is all set to ‘Inactive’, i.e. OK, then we get an ‘UNKNOWN’ back

Can you check if this is the same for you?
Thanks for your great work

How to recreate the bug?

Version: 0.2.0 (Check against a cluster without alerts)

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.clean-cluster.example.com -p 80 alert 
[OK] - Alerts inactive | total=123 firing=0 pending=0 inactive=123
./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
 | 

# echo $?
3

Version: 0.2.0 (check against a cluster with problem)

./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.some-probelm-cluster.example.com -p 80 alert 
[CRITICAL] - 126 Alerts: 1 Firing - 0 Pending - 125 Inactive
 \_[OK] [FluxHelmReleaseNotReady] is inactive
 \_[OK] [FluxGitRepositorySyncFailed] is inactive
...
 \_[CRITICAL] [Watchdog] is firing - value: 1.00
 \_[OK] [InfoInhibitor] is inactive
..
 \_[OK] [K3sCertificateExpiration] is inactive
 | total=126 firing=1 pending=0 inactive=125
./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.some-probelm-cluster.example.com -p 80 alert -P
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
 \_[CRITICAL] [Watchdog] is firing - value: 1.00
 | 

Version: development (Check against a cluster without alerts)

./check_prometheus_develop -H metrics.clean-cluster.example.com -p 80 alert 
[OK] - 125 Alerts: 0 Firing - 0 Pending - 125 Inactive
\_ [OK] [FluxHelmReleaseNotReady] is inactive
\_ [OK] [FluxGitRepositorySyncFailed] is inactive
\_ [OK] [InfoInhibitor] is inactive
..
\_ [OK] [K3sCertificateExpiration] is inactive
|total=125 firing=0 pending=0 inactive=125
./check_prometheus_develop -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

# echo $?
3

Version: development (check against a cluster with problem)

./check_prometheus_develop -H metrics.some-probelm-cluster.example.com  -p 80 alert 
[CRITICAL] - 126 Alerts: 1 Firing - 0 Pending - 125 Inactive
\_ [OK] [FluxHelmReleaseNotReady] is inactive
\_ [OK] [FluxGitRepositorySyncFailed] is inactive
..
\_ [CRITICAL] [Watchdog] is firing - value: 1.00
\_ [OK] [InfoInhibitor] is inactive
..
\_ [OK] [K3sCertificateExpiration] is inactive
|total=126 firing=1 pending=0 inactive=125
./check_prometheus_develop -H metrics.some-probelm-cluster.example.com  -p 80 alert -P
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
\_ [CRITICAL] [Watchdog] is firing - value: 1.00
@wattebausch wattebausch added bug Something isn't working needs-triage Needs to be triaged labels Dec 10, 2024
@RincewindsHat
Copy link
Member

Hi @wattebausch,
To summarize: You would like to only see Problems in the Output and not alerts which are OK?
Did I read that correctly?

@wattebausch
Copy link
Author

Yes.
if I have understood it correctly, alert -P should be. But as I see it, the exit code is 3 (Unknown) and not 0 (OK)
-P, --problems Display only alerts which status is not inactive/OK

./check_prometheus_develop -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

# echo $?
3
./check_prometheus_v0.2.0_Linux_x86_64 -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
 | 

# echo $?
3

@martialblog martialblog removed the needs-triage Needs to be triaged label Dec 11, 2024
@martialblog
Copy link
Member

Hi, right now that is just the logic of the plugin.

I think the reasoning behind it was, if there's nothing there, we cannot determine the status. But I agree that it is somewhat odd behavior. If there are no alert rule defined it would also return UNKNOWN.

We can improve that, just gotta keep all possible scenarios in mind for example:

All Alerts --problems --name
No Alerts in API OK OK Unknown, since we cannot determine the status of the wanted alert
Alerts in API Worst State Worst State Unknown, since we cannot determine the status of the wanted alert

What does everybody think?

@martialblog martialblog added this to the v0.3.0 milestone Dec 11, 2024
@martialblog
Copy link
Member

@wattebausch Thanks again for your feedback. It's very helpful!

@RincewindsHat
Copy link
Member

Hi, right now that is just the logic of the plugin.

I think the reasoning behind it was, if there's nothing there, we cannot determine the status. But I agree that it is somewhat odd behavior. If there are no alert rule defined it would also return UNKNOWN.

We can improve that, just gotta keep all possible scenarios in mind for example:
All Alerts --problems --name
No Alerts in API OK OK Unknown, since we cannot determine the status of the wanted alert
Alerts in API Worst State Worst State Unknown, since we cannot determine the status of the wanted alert

What does everybody think?

Without knowing much about prometheus this sounds about right, when checking for alerts it is probably OK if there are none.

@martialblog
Copy link
Member

@wattebausch I reworked the logic a bit, you can check out the new code here #66

This should be more expected and flexible behavior

@wattebausch
Copy link
Author

wattebausch commented Dec 16, 2024

@wattebausch I reworked the logic a bit, you can check out the new code here #66

This should be more expected and flexible behavior

Check my Version

git checkout
Your branch is up to date with 'origin/no-alerts-logic'.
./check_prometheus --version
check_prometheus version development
commit: HEAD
date: latest

i don't know whether the first four lines should appear in the help. FYI

./check_prometheus alert --help
Checks the status of a Prometheus alert and evaluates the status of the alert:
firing = 2
pending = 1
inactive = 0

Usage:
  check_prometheus alert [flags]

Examples:
...
...
Flags:
  -h, --help                     help for alert
  -n, --name strings             The name of one or more specific alerts to check.
                                 This parameter can be repeated e.G.: '--name alert1 --name alert2'
                                 If no name is given, all alerts will be evaluated
  -T, --no-alerts-state string   State to assign when no alerts are found (0, 1, 2, 3, OK, WARNING, CRITICAL, UNKNOWN). If not set this defaults to OK (default "OK")
  -P, --problems                 Display only alerts which status is not inactive/OK. Note that in combination with the --name flag this might result in no alerts being displayed

Global Flags:
  -b, --bearer string      Specify the Bearer Token for server authentication (CHECK_PROMETHEUS_BEARER)
...

Unfortunately, the behaviour is still the same. As far as I can see, I have compiled the correct version
Should the ‘Inactive’ not be the number of checks. in my case 125?

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T OK
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T 1
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T CRITICAL
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P --no-alerts-state OK
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P --no-alerts-state 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T OK ; echo $?
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

3

Here I check whether, if there is ‘Firing’, it does not mistakenly go to OK. perfect, works as it should.

./check_prometheus -H metrics.some-probelm-cluster.example.com -p 80 alert -P -T OK
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
\_ [CRITICAL] [Watchdog] is firing - value: 1.00
./check_prometheus -H metrics.some-probelm-cluster.example.com -p 80 alert -P -T 0
[CRITICAL] - 1 Alerts: 1 Firing - 0 Pending - 0 Inactive
\_ [CRITICAL] [Watchdog] is firing - value: 1.00

@martialblog
Copy link
Member

Thanks for the feedback, I'll have a another look

@martialblog
Copy link
Member

Hi,

these are the results when checking Prometheus with the new logic:

# Starting Prometheus without any alert rules defined:

go run main.go alert
[OK] - No alerts defined

go run main.go alert -T 3
[UNKNOWN] - No alerts defined

go run main.go alert -T 2
[CRITICAL] - No alerts defined

go run main.go alert -T foo
[UNKNOWN] - invalid value for --no-alerts-state: foo (*errors.errorString)

# When an alert is defined:

go run main.go alert -T 2
[OK] - 1 Alerts: 0 Firing - 0 Pending - 1 Inactive
\_ [OK] [NodeHasMemoryPressure] is inactive

# When you are looking for a specific alert and alerts are defined:
# These counts only include the names that match the --name parameters

go run main.go alert -T 2 --name foo
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

go run main.go alert -T 1 --name foo --name NodeHasMemoryPressure
[OK] - 1 Alerts: 0 Firing - 0 Pending - 1 Inactive
\_ [OK] [NodeHasMemoryPressure] is inactive
|total=1 firing=0 pending=0 inactive=1

# When you are looking for a specific alert and no alerts are defined:

go run main.go alert -T 2 --name foo
[UNKNOWN] - No such alert defined

As far as I can tell, this is the behavior we want.

@wattebausch
Copy link
Author

wattebausch commented Dec 19, 2024

Hi @martialblog many thanks.
i have tried again, but it doesn't work for me. have i done something wrong? do you have any tips for me?
we use it with the --problem "-P" flag

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert -P -T 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
go run main.go -H metrics.clean-cluster.example.com -p 80 alert -P 
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

exit status 3
go run main.go -H metrics.clean-cluster.example.com -p 80 alert -P -T 0
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

exit status 3
go version go1.22.2 linux/amd64

@RincewindsHat
Copy link
Member

@wattebausch Did you use the current main branch? I do get:

./check_prometheus alert
[OK] - No alerts defined

@martialblog
Copy link
Member

Haven't tested it with the "-P" flag, but it's applied after checking if there are any alerts. I can give it another look.

@wattebausch
Copy link
Author

Did you use the current main branch? I do get:

Yes, I made another git clone

>git pull 
Already up to date.

>git log
commit a2cc3770ec1ec2714b8abdd7766fcfe690a56ea9 (HEAD -> main, origin/main, origin/HEAD)
Merge: 6d8740a 1318445
Author: Markus Opolka <[email protected]>
Date:   Thu Dec 19 09:01:20 2024 +0100

    Merge pull request #69 from NETWAYS/fix/update_x_net_html
    
    Update dependencies

commit 6d8740a3ff0fc4ff90c4eb443952d4356df15cd8
Merge: 67fb855 fa2e5bd

./check_prometheus alert
[OK] - No alerts defined

we use kubernetes, i can only test it from our icinga server, not on local. our hosts all have multiple rules. should i delete all rules so that the prometheus is empty?

./check_prometheus -H metrics.clean-cluster.example.com -p 80 alert
[OK] - 127 Alerts: 0 Firing - 0 Pending - 127 Inactive
\_ [OK] [FluxHelmReleaseNotReady] is inactive
\_ [OK] [FluxGitRepositorySyncFailed] is inactive
...

@martialblog
Copy link
Member

I just started a simple Prometheus locally, with a minimal config:

podman run -ti --rm -v $(pwd):/etc/prometheus -p 9090:9090 docker.io/prom/prometheus:v3.0.0
# With alerts defined:
$ go run main.go alert 

[OK] - 1 Alerts: 0 Firing - 0 Pending - 1 Inactive
\_ [OK] [NodeHasMemoryPressure] is inactive

# With alerts defined and problems only
$ go run main.go alert -P

[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

go run main.go alert -P -T 1
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive

# With nothing defined:

$ go run main.go alert
[OK] - No alerts defined

$ go run main.go alert -P
[OK] - No alerts defined

So yeah something isn't right yet.
These guys here should behave differently:

# With alerts defined and problems only
$ go run main.go alert -P
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
# Should be OK

go run main.go alert -P -T 1
[UNKNOWN] - 0 Alerts: 0 Firing - 0 Pending - 0 Inactive
# Should be WARN

@wattebausch wattebausch changed the title [Bug]: ‘alert -p’ gets an Unknown if everything is fine [Bug]: ‘alert -P’ gets an Unknown if everything is fine Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants