False positive status reason by timing issues in alert rules #1410
Labels
area/logs
LogPipeline
area/metrics
MetricPipeline
area/traces
TracePipeline
kind/bug
Categorizes issue or PR as related to a bug.
We observed single datapoints where the health status was reported negative which was not the case. That mainly is caused by having the status evaluation based on two rule evaluations, which might be evaluated slightly at a different point in time. So a beginning ingestion might have been evaluated to true already where the starting export might not have been evaluated yet, so that there is a short time window where the status is seen as "no export possible".
A refactoring of the alerting was done for the LogPipeline already (#1397) and here no false positives were observed anymore.
The existing self-monitor alerts, based on PromQL queries, should be refactored such that:
for
clause is used to avoid timing issues (where justified)firing
state should strictly be used for negative/unhealthy scenariosThis refactoring should follow the changes introduced in: #1397
The text was updated successfully, but these errors were encountered: