Skip to content

Commit

Permalink
add build team availability alerts
Browse files Browse the repository at this point in the history
- Add GitHubAppFailureAlert for build-service
- Add QuayFailureAlert for image-controller
  • Loading branch information
tnevrlka committed Aug 27, 2024
1 parent aa08353 commit 3f5cb93
Show file tree
Hide file tree
Showing 4 changed files with 105 additions and 0 deletions.
22 changes: 22 additions & 0 deletions rhobs/alerting/data_plane/prometheus.build_service_alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rhtap-build-service-github-app-alerting
labels:
tenant: rhtap
spec:
groups:
- name: build_service_github_app_alerts
interval: 1m
rules:
- alert: GitHubAppFailureAlert
expr: absent_over_time(konflux_up{service="github", check="build-service"}[5m])
labels:
severity: warning
annotations:
summary: "'konflux_up' availability metric missing for GitHub App in build-service."
description: >-
The 'konflux_up' availability metric for the GitHub App in the build-service has not been reported for check {{ $labels.check }} on service {{ $labels.service }} for over 5 minutes, indicating a possible service disruption.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/build-service/availability_github_app.md
23 changes: 23 additions & 0 deletions rhobs/alerting/data_plane/prometheus.image_controller_alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rhtap-image-controller-quay-alerting
labels:
tenant: rhtap
spec:
groups:
- name: image_controller_quay_alerts
interval: 1m
rules:
- alert: QuayFailureAlert
expr: absent_over_time(konflux_up{service="quay", check="image-controller"}[5m])
labels:
severity: warning
annotations:
summary: Availability metric 'konflux_up' missing for Quay in image-controller.
description: >-
The availability metric 'konflux_up' is missing for the Quay service in the image-controller
for more than 5 minutes, indicating a potential service failure.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/image-controller/availability_quay.md
31 changes: 31 additions & 0 deletions test/promql/tests/data_plane/github_app_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
rule_files:
- prometheus.build_service_alerts.yaml

evaluation_interval: 1m

tests:
- interval: 1m
input_series:
- series: konflux_up{service="github", check="build-service"}
values: '1 _x6 1'
alert_rule_test:
- alertname: GitHubAppFailureAlert
eval_time: 5m
exp_alerts: []
- alertname: GitHubAppFailureAlert
eval_time: 6m
exp_alerts:
- exp_labels:
severity: warning
check: build-service
service: github
exp_annotations:
summary: "'konflux_up' availability metric missing for GitHub App in build-service."
description: >-
The 'konflux_up' availability metric for the GitHub App in the build-service has not been reported for check build-service on service github for over 5 minutes, indicating a possible service disruption.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/build-service/availability_github_app.md
- alertname: GitHubAppFailureAlert
eval_time: 7m
exp_alerts: []
29 changes: 29 additions & 0 deletions test/promql/tests/data_plane/quay_failure_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
rule_files:
- prometheus.image_controller_alerts.yaml

evaluation_interval: 1m

tests:
- interval: 1m
input_series:
- series: konflux_up{service="quay", check="image-controller"}
values: '1 _x6 1'
alert_rule_test:
- alertname: QuayFailureAlert
eval_time: 5m
exp_alerts: []
- alertname: QuayFailureAlert
eval_time: 6m
exp_alerts:
- exp_labels:
severity: warning
check: image-controller
service: quay
exp_annotations:
summary: Availability metric 'konflux_up' missing for Quay in image-controller.
description: >-
The availability metric 'konflux_up' is missing for the Quay service in the image-controller
for more than 5 minutes, indicating a potential service failure.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/image-controller/availability_quay.md

0 comments on commit 3f5cb93

Please sign in to comment.