Skip to content

Commit

Permalink
Add decision for routing OpenShift/Kubernetes base alerts
Browse files Browse the repository at this point in the history
  • Loading branch information
HappyTetrahedron committed Sep 14, 2023
1 parent f87117e commit 7a63438
Show file tree
Hide file tree
Showing 2 changed files with 72 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
= Multi-team alert routing for base alerts

== Problem

Currently, all base alerts configured through component openshift4-monitoring are routed to the same team.
However, for some alerts (such as KubeJobFailed), this isn't always correct, since there might be jobs that are managed by other teams.
It would be better to route those alerts dynamically based on some criteria, such as a namespace label.

=== Goals

* Selectively route base alerts to the appropriate teams based on namespace ownership
* Define how "namespace ownership" can be either configured or inferred

=== Non-Goals

* Define alert routing for custom alerts
* Handle alerts in namespaces that aren't syn-managed

== Proposals

=== Add namespace labels to relevant alerts

One option is to modify the alert queries to join the `kube_namespace_labels` metric to the result, such that the information therein can be used in the (statically defined) routing rules.
This implies that namespace ownership is configured by setting a specific label on the namespace in question.
No suitable label currently exists, so all components would need to be updated to provide one, for example based on component ownership.

However, not all alert rules have the namespace label, since not all are namespace scoped. Blindly joining the namespace label metric will therefore lead to some broken alerts.
This problem can be solved by creating a new configuration option in `component-openshift4-monitoring`, in which a list of alerts and a list of labels can be specified.
All listed alert rules will be modified to append the listed labels.

`component-openshift4-monitoring` already modifies the base OpenShift alert rules by creating copies of them.
Modifying the alert rules in this way would be straightforward.
The challenging part is figuring out which rules to modify, which would be specified through configuration.
The alert routing rules will still be configured in the same way as they presently are.

If the modified alert queries are written in such a way that the ownership team label is renamed to `syn_team`, the existing alert routing rules should work without modification.

=== Generate routing rules for relevant namespaces based on cluster hierarchy data

With this option, namespace ownership is inferred from component ownership:
Components with a dedicated team for operative responsibility should be configured with a `syn.teams` label, and components generally have a `namespace` label already.
From this, a mapping between namespaces and teams can be inferred from the cluster hierarchy.

`component-openshift4-monitoring` would then generate routing rules for each team, which match any alerts where the `namespace` label is present and matches one of the team's namespaces.

With this approach, the routing rules would be generated by `component-openshift4-monitoring`.
Previously, they were configured statically (though still through the component) by specifying `parameters.openshift4_monitoring.alertManagerConfig`.
In our setup, this configuration is provided as a template in `commodore-defaults`, and is included into the cluster hierarchy where necessary.

`component-openshift4-monitoring` should expose a way to configure which team's alerts should go to which alert receiver, so it can generate the rules accordingly.
There should be a default receiver, for teams where no explicit receiver was provided.
The new configuration can continue to be provided as a global template.

If the complete new configuration can be provided under `parameters.openshift4_monitoring.alertManagerConfig`, tenant repos won't require any changes.
Only `commodore-defaults` will need to be updated with the new configuration.

== Decision

`component-openshift4-monitoring` will generate dynamic alert routing rules based on cluster hierarchy data.

== Rationale

With this approach, the monitoring component can leverage the existing source of truth for component (instance) ownership to route the base alerts.
The other approach would necessitate duplicating this information in the namespace labels, or updating all of the components to expose this information in a namespace label.

In addition, the alternative approach requires maintaining a list of base alerts which have the namespace label.
The chosen approach doesn't require such a list, so it doesn't carry a risk of forgetting certain alerts.

== References

* https://syn.tools/syn/SDDs/0030-argocd-multitenancy.html#_design_proposal[SDD 0030 on ArgoCD multitenancy and assigning operative responsibility to components]
1 change: 1 addition & 0 deletions docs/modules/ROOT/partials/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@
** xref:oc4:ROOT:explanations/decisions/syn-argocd-sharing.adoc[]
** xref:oc4:ROOT:explanations/decisions/multi-instance-argocd.adoc[]
** xref:oc4:ROOT:explanations/decisions/multi-team-alert-routing.adoc[]
*** xref:oc4:ROOT:explanations/decisions/multi-team-alert-routing-base-alerts.adoc[]
** xref:oc4:ROOT:explanations/decisions/shipping-metrics-to-centralized-instance.adoc[]
** xref:oc4:ROOT:explanations/decisions/scheduled-mr-merges.adoc[]
** xref:oc4:ROOT:explanations/decisions/subscription-tracking.adoc[]

0 comments on commit 7a63438

Please sign in to comment.