Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki Alerts are fired for each principal application instead only once #159

Open
cbartz opened this issue Aug 5, 2024 · 5 comments
Open

Comments

@cbartz
Copy link
Contributor

cbartz commented Aug 5, 2024

Bug Description

We have following alert rule: https://github.com/canonical/github-runner-operator/blob/b70a5353deb280738339f5878e8fa57c45c3cc78/src/loki_alert_rules/failure.rules#L4-L11 to detect runner crashes for a particular self-hosted runner deployment.

This gets topology labels injected by the grafana agent

kubectl exec -ti loki-0 -c loki --   cat /loki/rules/fake/prod-github-runner_4572b1cc_grafana-agent_alert.rules
...
- name: foo_bar_5572b1cc_grafana-agent_foo-bar_5572b1cc_edge_github-runner-failure_alerts_alerts
  rules:
  - alert: Crashed runner
    annotations:
      summary: A runner in unit {{ $labels.juju_unit }} crashed.
    expr: (sum_over_time({filename="/var/log/github-runner-metrics.log", juju_application="grafana-agent",
      juju_charm="grafana-agent", juju_model="foo-bar", juju_model_uuid="4572b1cc-0a39-40b7-818d-c68ed553f11a"}
      | json event="event",crashed_runners="crashed_runners" | event="reconciliation"
      | unwrap crashed_runners[1h]) > 0)
    for: 0s
    labels:
      juju_application: edge
      juju_charm: github-runner
      juju_model: foo-bar
      juju_model_uuid: 5572b1cc-0a39-40b7-818d-c68ed553f11a
      severity: high

The Grafana agent is integrated with 3 main applications. Once an alert is triggered, it is duplicated to all 3 main applications (although it only applies to one). The alerts all contain juju_application=principal-charm instead of juju_application=grafana-agent (as it is displayed in grafana and loki).

It appears that the principal application name is used for the juju_application label in the fired alert instead of the subordinate one, which would lead to deduplication because the same alert is fired.

To Reproduce

Create a simple alert rule for a charm. Deploy the charm three times, and integrate it with the grafana agent, which in turn should be integrated with loki, and loki with the alertmanager. Note that the alert is triggered for each main application, not just once for the grafana agent.

Environment

juju 3.1.8 , grafana-agent rev 164

Relevant log output

not applicable

Additional context

No response

@sed-i
Copy link
Contributor

sed-i commented Aug 15, 2024

Multiple principles is not yet supported this way.
See #11.

Juju supports deploying multiple principals to the same machine (--to), but if we relate them to the same subordinate, or even if we deploy the same subordinate under two different names, the snap would overwrite/uninstall each other, as well as the grafana agent config file.

Also, we're not interested in parallel installs, because we do want only one instance of grafana agent.

In the current juju model for subordinates, we would need to have separate charms doing "delta charming" on the shared granfana-agent.yaml. And also figure out from the config file if we should install/uninstall the snap.

@sed-i
Copy link
Contributor

sed-i commented Aug 15, 2024

Btw, does each principal come with their own rules? If not, then relating gagent to just one of the principals would be enough to get the node-exporter stuff.

@cbartz
Copy link
Contributor Author

cbartz commented Aug 27, 2024

Multiple principles is not yet supported this way. See #11.

Juju supports deploying multiple principals to the same machine (--to), but if we relate them to the same subordinate, or even if we deploy the same subordinate under two different names, the snap would overwrite/uninstall each other, as well as the grafana agent config file.

@sed-i

I think we are misunderstanding, the principals are all on different machines.
Each principal is a deployment of the same charm. Please see https://pastebin.canonical.com/p/NnYwT4ybv5/ for the actual deployment. You see three github-runner charm deployments on different machines integrated with a grafana agent.

Btw, does each principal come with their own rules? If not, then relating gagent to just one of the principals would be enough to get the node-exporter stuff.

Yes, they have the same rules. But they all have to be integrated with grafana-agent. E.g. we have three charm deployments A,B,C, and they are all integrated with grafana-agent X. If we would only integrate A with grafana-agent X, then the logs and metrics for B and C would be missing.

@lucabello
Copy link
Contributor

Could you give us a screenshot of Alertmanager showing the alerts firing with the labels expanded, and a screenshot of Prometheus showing the results of querying that metric name?

@cbartz
Copy link
Contributor Author

cbartz commented Sep 9, 2024

Hi @lucabello , sure here are the screenshots.

First three screenshots show the alert notifications in mattermost channel:

Screenshot from 2024-09-09 09-54-57
Screenshot from 2024-09-09 09-55-17
Screenshot from 2024-09-09 09-55-23

You see that the alert is reported for the large, noble-large and xlarge applications. The reason is that these integrations are all integrated with the same grafana agent.

The log (which comes from Loki) is only issued once by the same grafana-agent, and the log contains the grafana-agent as juju_application.
Screenshot from 2024-09-09 10-00-00

So there is a drift between the value of juju_application in the fired alerts and the value of juju_application in the logs, causing the duplication of the fired alert. If the fired alert would also contain the grafana-agent as juju_application, it would not be duplicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants