False positive status reason by timing issues in alert rules #1410

TeodorSAP · 2024-08-30T09:25:52Z

We observed single datapoints where the health status was reported negative which was not the case. That mainly is caused by having the status evaluation based on two rule evaluations, which might be evaluated slightly at a different point in time. So a beginning ingestion might have been evaluated to true already where the starting export might not have been evaluated yet, so that there is a short time window where the status is seen as "no export possible".

A refactoring of the alerting was done for the LogPipeline already (#1397) and here no false positives were observed anymore.

The existing self-monitor alerts, based on PromQL queries, should be refactored such that:

A single PromQL query (i.e. alert) is mapped to each unhealthy condition (i.e. no Go code is involved in logically evaluating the firing alerts)
The for clause is used to avoid timing issues (where justified)
Alerts in firing state should strictly be used for negative/unhealthy scenarios

This refactoring should follow the changes introduced in: #1397

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-30T00:10:25Z

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.

github-actions · 2024-11-07T00:10:15Z

This issue has been automatically closed due to the lack of recent activity.
/lifecycle rotten

TeodorSAP changed the title ~~Refactor existing SelfMon / Prometheus alerts~~ Refactor existing Self-Monitoring (Prometheus) alerts Aug 30, 2024

TeodorSAP added area/logs LogPipeline area/metrics MetricPipeline area/traces TracePipeline kind/chore Categorizes issue or PR as related to a chore. labels Aug 30, 2024

TeodorSAP changed the title ~~Refactor existing Self-Monitoring (Prometheus) alerts~~ Refactor existing Self-Monitor (Prometheus) alerts Aug 30, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2024

kyma-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 7, 2024

a-thaler removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2024

a-thaler reopened this Nov 7, 2024

hisarbalik self-assigned this Dec 19, 2024

a-thaler added kind/bug Categorizes issue or PR as related to a bug. and removed kind/chore Categorizes issue or PR as related to a chore. labels Dec 20, 2024

a-thaler changed the title ~~Refactor existing Self-Monitor (Prometheus) alerts~~ False positive status reason by timing issues in alert rules Dec 20, 2024

hisarbalik mentioned this issue Dec 27, 2024

fix: False positive status reason by timing issues in alert rules #1713

Closed

6 tasks

skhalash mentioned this issue Jan 9, 2025

fix: Flaky self-monitoring conditions #1735

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False positive status reason by timing issues in alert rules #1410

False positive status reason by timing issues in alert rules #1410

TeodorSAP commented Aug 30, 2024 •

edited by a-thaler

Loading

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 7, 2024

False positive status reason by timing issues in alert rules #1410

False positive status reason by timing issues in alert rules #1410

Comments

TeodorSAP commented Aug 30, 2024 • edited by a-thaler Loading

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 7, 2024

TeodorSAP commented Aug 30, 2024 •

edited by a-thaler

Loading