diff --git a/docs/modules/ROOT/pages/explanations/decisions/multi-team-alert-routing-base-alerts.adoc b/docs/modules/ROOT/pages/explanations/decisions/multi-team-alert-routing-base-alerts.adoc new file mode 100644 index 00000000..088ebbda --- /dev/null +++ b/docs/modules/ROOT/pages/explanations/decisions/multi-team-alert-routing-base-alerts.adoc @@ -0,0 +1,71 @@ += Multi-team alert routing for base alerts + +== Problem + +Currently, all base alerts configured through component openshift4-monitoring are routed to the same team. +However, for some alerts (such as KubeJobFailed), this isn't always correct, since there might be jobs that are managed by other teams. +It would be better to route those alerts dynamically based on some criteria, such as a namespace label. + +=== Goals + +* Selectively route base alerts to the appropriate teams based on namespace ownership +* Define how "namespace ownership" can be either configured or inferred + +=== Non-Goals + +* Define alert routing for custom alerts +* Handle alerts in namespaces that aren't syn-managed + +== Proposals + +=== Add namespace labels to relevant alerts + +One option is to modify the alert queries to join the `kube_namespace_labels` metric to the result, such that the information therein can be used in the (statically defined) routing rules. +This implies that namespace ownership is configured by setting a specific label on the namespace in question. +No suitable label currently exists, so all components would need to be updated to provide one, for example based on component ownership. + +However, not all alert rules have the namespace label, since not all are namespace scoped. Blindly joining the namespace label metric will therefore lead to some broken alerts. +This problem can be solved by creating a new configuration option in `component-openshift4-monitoring`, in which a list of alerts and a list of labels can be specified. +All listed alert rules will be modified to append the listed labels. + +`component-openshift4-monitoring` already modifies the base OpenShift alert rules by creating copies of them. +Modifying the alert rules in this way would be straightforward. +The challenging part is figuring out which rules to modify, which would be specified through configuration. +The alert routing rules will still be configured in the same way as they presently are. + +If the modified alert queries are written in such a way that the ownership team label is renamed to `syn_team`, the existing alert routing rules should work without modification. + +=== Generate routing rules for relevant namespaces based on cluster hierarchy data + +With this option, namespace ownership is inferred from component ownership: +Components with a dedicated team for operative responsibility should be configured with a `syn.teams` label, and components generally have a `namespace` label already. +From this, a mapping between namespaces and teams can be inferred from the cluster hierarchy. + +`component-openshift4-monitoring` would then generate routing rules for each team, which match any alerts where the `namespace` label is present and matches one of the team's namespaces. + +With this approach, the routing rules would be generated by `component-openshift4-monitoring`. +Previously, they were configured statically (though still through the component) by specifying `parameters.openshift4_monitoring.alertManagerConfig`. +In our setup, this configuration is provided as a template in `commodore-defaults`, and is included into the cluster hierarchy where necessary. + +`component-openshift4-monitoring` should expose a way to configure which team's alerts should go to which alert receiver, so it can generate the rules accordingly. +There should be a default receiver, for teams where no explicit receiver was provided. +The new configuration can continue to be provided as a global template. + +If the complete new configuration can be provided under `parameters.openshift4_monitoring.alertManagerConfig`, tenant repos won't require any changes. +Only `commodore-defaults` will need to be updated with the new configuration. + +== Decision + +`component-openshift4-monitoring` will generate dynamic alert routing rules based on cluster hierarchy data. + +== Rationale + +With this approach, the monitoring component can leverage the existing source of truth for component (instance) ownership to route the base alerts. +The other approach would necessitate duplicating this information in the namespace labels, or updating all of the components to expose this information in a namespace label. + +In addition, the alternative approach requires maintaining a list of base alerts which have the namespace label. +The chosen approach doesn't require such a list, so it doesn't carry a risk of forgetting certain alerts. + +== References + +* https://syn.tools/syn/SDDs/0030-argocd-multitenancy.html#_design_proposal[SDD 0030 on ArgoCD multitenancy and assigning operative responsibility to components] diff --git a/docs/modules/ROOT/partials/nav.adoc b/docs/modules/ROOT/partials/nav.adoc index 50cb22df..b8bce993 100644 --- a/docs/modules/ROOT/partials/nav.adoc +++ b/docs/modules/ROOT/partials/nav.adoc @@ -220,6 +220,7 @@ ** xref:oc4:ROOT:explanations/decisions/syn-argocd-sharing.adoc[] ** xref:oc4:ROOT:explanations/decisions/multi-instance-argocd.adoc[] ** xref:oc4:ROOT:explanations/decisions/multi-team-alert-routing.adoc[] +*** xref:oc4:ROOT:explanations/decisions/multi-team-alert-routing-base-alerts.adoc[] ** xref:oc4:ROOT:explanations/decisions/shipping-metrics-to-centralized-instance.adoc[] ** xref:oc4:ROOT:explanations/decisions/scheduled-mr-merges.adoc[] ** xref:oc4:ROOT:explanations/decisions/subscription-tracking.adoc[]