Skip to content

Commit

Permalink
Added KFTO alerting rule
Browse files Browse the repository at this point in the history
  • Loading branch information
Bobbins228 committed May 10, 2024
1 parent 0fbd57c commit 941538f
Showing 1 changed file with 33 additions and 0 deletions.
33 changes: 33 additions & 0 deletions config/monitoring/prometheus/apps/prometheus-configs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,25 @@ data:
regex: (.+):(\d+)
target_label: __address__
replacement: ${1}:8080
- job_name: 'KubeFlow Training Operator'
honor_labels: true
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- <odh_application_namespace>
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: ^(training-operator-metrics-service)$
target_label: kubernetes_name
action: keep
- source_labels: [__address__]
regex: (.+):(\d+)
target_label: __address__
replacement: ${1}:8080
- job_name: 'Kueue Operator'
honor_labels: true
Expand Down Expand Up @@ -549,6 +568,20 @@ data:
triage: 'https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/RHODS/Distributed-Workloads/codeflare-operator-absent-over-time.md'
summary: Alerting for CodeFlare Operator
trainingoperator-alerting.rules: |
groups:
- name: KubeFlow Training Operator
interval: 1m
rules:
- alert: KubeFlow Training Operator is not running
expr: absent(up{job=~'KubeFlow Training Operator'}) or up{job=~'KubeFlow Training Operator'} != 1
labels:
severity: warning
annotations:
description: This alert fires when the KubeFlow Training Operator is not running.
triage: 'https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/RHODS/Distributed-Workloads/training-operator-availability.md'
summary: Alerting for KubeFlow Training Operator
rhods-dashboard-recording.rules: |
groups:
- name: SLOs - ODH Dashboard
Expand Down

0 comments on commit 941538f

Please sign in to comment.