Skip to content

Commit

Permalink
add basic alerting for Kueue (red-hat-data-services#258)
Browse files Browse the repository at this point in the history
Signed-off-by: Kevin <[email protected]>
  • Loading branch information
KPostOffice authored and zdtsw committed May 10, 2024
1 parent 17c131a commit d6e5e96
Showing 1 changed file with 33 additions and 0 deletions.
33 changes: 33 additions & 0 deletions config/monitoring/prometheus/apps/prometheus-configs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,25 @@ data:
target_label: __address__
replacement: ${1}:8080
- job_name: 'Kueue Operator'
honor_labels: true
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- <odh_application_namespace>
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: ^(kueue-metrics-service)$
target_label: kubernetes_name
action: keep
- source_labels: [__address__]
regex: (.+):(\d+)
target_label: __address__
replacement: ${1}:8080
- job_name: 'TrustyAI Controller Manager'
honor_labels: true
metrics_path: /metrics
Expand Down Expand Up @@ -1240,6 +1259,20 @@ data:
triage: 'https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/RHODS/Distributed-Workloads/kuberay-operator-availability.md'
summary: Alerting for KubeRay Operator
kueue-alerting.rules: |
groups:
- name: Distributed Workloads Kueue
interval: 1m
rules:
- alert: Kueue Operator is not running
expr: absent(up{job=~'Kueue Operator'}) or up{job=~'Kueue Operator'} != 1
labels:
severity: warning
annotations:
description: This alert fires when the Kueue Operator is not running.
triage: 'https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/RHODS/Distributed-Workloads/kueue-operator-availability.md'
summary: Alerting for Kueue Operator
workbenches-recording.rules: |
groups:
- name: SLOs - Notebook Controller
Expand Down

0 comments on commit d6e5e96

Please sign in to comment.