Loki Alerts are fired for each principal application instead only once #159

cbartz · 2024-08-05T09:01:35Z

Bug Description

We have following alert rule: https://github.com/canonical/github-runner-operator/blob/b70a5353deb280738339f5878e8fa57c45c3cc78/src/loki_alert_rules/failure.rules#L4-L11 to detect runner crashes for a particular self-hosted runner deployment.

This gets topology labels injected by the grafana agent

kubectl exec -ti loki-0 -c loki --   cat /loki/rules/fake/prod-github-runner_4572b1cc_grafana-agent_alert.rules
...
- name: foo_bar_5572b1cc_grafana-agent_foo-bar_5572b1cc_edge_github-runner-failure_alerts_alerts
  rules:
  - alert: Crashed runner
    annotations:
      summary: A runner in unit {{ $labels.juju_unit }} crashed.
    expr: (sum_over_time({filename="/var/log/github-runner-metrics.log", juju_application="grafana-agent",
      juju_charm="grafana-agent", juju_model="foo-bar", juju_model_uuid="4572b1cc-0a39-40b7-818d-c68ed553f11a"}
      | json event="event",crashed_runners="crashed_runners" | event="reconciliation"
      | unwrap crashed_runners[1h]) > 0)
    for: 0s
    labels:
      juju_application: edge
      juju_charm: github-runner
      juju_model: foo-bar
      juju_model_uuid: 5572b1cc-0a39-40b7-818d-c68ed553f11a
      severity: high

The Grafana agent is integrated with 3 main applications. Once an alert is triggered, it is duplicated to all 3 main applications (although it only applies to one). The alerts all contain juju_application=principal-charm instead of juju_application=grafana-agent (as it is displayed in grafana and loki).

It appears that the principal application name is used for the juju_application label in the fired alert instead of the subordinate one, which would lead to deduplication because the same alert is fired.

To Reproduce

Create a simple alert rule for a charm. Deploy the charm three times, and integrate it with the grafana agent, which in turn should be integrated with loki, and loki with the alertmanager. Note that the alert is triggered for each main application, not just once for the grafana agent.

Environment

juju 3.1.8 , grafana-agent rev 164

Relevant log output

not applicable

Additional context

No response

The text was updated successfully, but these errors were encountered:

sed-i · 2024-08-15T13:20:03Z

Multiple principles is not yet supported this way.
See #11.

Juju supports deploying multiple principals to the same machine (--to), but if we relate them to the same subordinate, or even if we deploy the same subordinate under two different names, the snap would overwrite/uninstall each other, as well as the grafana agent config file.

Also, we're not interested in parallel installs, because we do want only one instance of grafana agent.

In the current juju model for subordinates, we would need to have separate charms doing "delta charming" on the shared granfana-agent.yaml. And also figure out from the config file if we should install/uninstall the snap.

sed-i · 2024-08-15T13:28:09Z

Btw, does each principal come with their own rules? If not, then relating gagent to just one of the principals would be enough to get the node-exporter stuff.

cbartz · 2024-08-27T10:53:52Z

Multiple principles is not yet supported this way. See #11.

Juju supports deploying multiple principals to the same machine (--to), but if we relate them to the same subordinate, or even if we deploy the same subordinate under two different names, the snap would overwrite/uninstall each other, as well as the grafana agent config file.

@sed-i

I think we are misunderstanding, the principals are all on different machines.
Each principal is a deployment of the same charm. Please see https://pastebin.canonical.com/p/NnYwT4ybv5/ for the actual deployment. You see three github-runner charm deployments on different machines integrated with a grafana agent.

Btw, does each principal come with their own rules? If not, then relating gagent to just one of the principals would be enough to get the node-exporter stuff.

Yes, they have the same rules. But they all have to be integrated with grafana-agent. E.g. we have three charm deployments A,B,C, and they are all integrated with grafana-agent X. If we would only integrate A with grafana-agent X, then the logs and metrics for B and C would be missing.

lucabello · 2024-09-06T13:43:59Z

Could you give us a screenshot of Alertmanager showing the alerts firing with the labels expanded, and a screenshot of Prometheus showing the results of querying that metric name?

cbartz · 2024-09-09T08:08:30Z

Hi @lucabello , sure here are the screenshots.

First three screenshots show the alert notifications in mattermost channel:

You see that the alert is reported for the large, noble-large and xlarge applications. The reason is that these integrations are all integrated with the same grafana agent.

The log (which comes from Loki) is only issued once by the same grafana-agent, and the log contains the grafana-agent as juju_application.

So there is a drift between the value of juju_application in the fired alerts and the value of juju_application in the logs, causing the duplication of the fired alert. If the fired alert would also contain the grafana-agent as juju_application, it would not be duplicated.

cbartz added Status: Triage Type: Bug labels Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki Alerts are fired for each principal application instead only once #159

Loki Alerts are fired for each principal application instead only once #159

cbartz commented Aug 5, 2024

sed-i commented Aug 15, 2024 •

edited

Loading

sed-i commented Aug 15, 2024

cbartz commented Aug 27, 2024 •

edited

Loading

lucabello commented Sep 6, 2024

cbartz commented Sep 9, 2024

Loki Alerts are fired for each principal application instead only once #159

Loki Alerts are fired for each principal application instead only once #159

Comments

cbartz commented Aug 5, 2024

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

sed-i commented Aug 15, 2024 • edited Loading

sed-i commented Aug 15, 2024

cbartz commented Aug 27, 2024 • edited Loading

lucabello commented Sep 6, 2024

cbartz commented Sep 9, 2024

sed-i commented Aug 15, 2024 •

edited

Loading

cbartz commented Aug 27, 2024 •

edited

Loading