Konflux Observability

This repository contains the following definitions for Konflux:

Prometheus rules (deployed to RHOBS)
Grafana dashboards (deployed to AppSRE's Grafana)
Availability exporters

Alerting Rules

The repository contains Prometheus alert rules files for monitoring Konflux data plane clusters along with their tests.

The different alerting rules in this repository are:

SLO Alerts

SLO (Service Level Objective) alert rules are rules defined to monitor and alert when a service or system is not meeting its specified service level objectives.

Usage Guidelines:

Apply the slo label to alerts directly associated with Service Level Objectives. These alerts typically indicate issues affecting the performance or reliability of the service.

Benefits of Using the `slo` Label:

Using the slo label facilitates quicker incident response by promptly identifying and addressing issues that impact service level objectives.

How to Apply the `slo` Label:

Apply slo: "true" under labels section of any alerting rule.

labels:
    severity: critical
    slo: "true"

Note

SLO alerts should be labeled with severity: critical

Miscellaneous Alerts

Alerts lacking the slo: "true" label are considered non-SLO, miscellaneous or misc alerts.

Such alerting rules are intended to notify regarding issues requiring attention, but are not directly affecting Service Level Objectives defined by any service.

Availability Metric Alerts

These are non-SLO alerts defined to monitor and alert if the konflux_up metric is missing for any expected permutations of the service and check labels across different environments.

Alerts Tagging

Teams receive updates on alerts relevant to them through Slack notifications, where the team's handle is tagged in the alert message.

Usage Guidelines:

Apply the alert_team_handle and team annotations to SLO alerts in order to get notified about them.

How to Apply the `alert_team_handle` Annotation:

Apply the alert_team_handle key to the annotations section of any alerting rule, with the relevant team's Slack group handle. The format of the Slack handle is: <!subteam^-slack_group_id-> (e.g.: <!subteam^S041261DDEW>); To obtain the Slack group ID, click on the team's group handle, then click the three dots, and select "Copy group ID."

Make sure to also add the team annotation with the name of the relevant team for readability.

annotations:
  summary: "PipelineRunFinish to SnapshotInProgress time exceeded"
  alert_team_handle: <!subteam^S04S21ECL8K>
  team: o11y

Recording Rules

Recording rules allow us to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Recording rules are the go-to approach for speeding up the performance of queries that take too long to return.

Rules located in the recording rules directory are deployed to RHOBS which makes them present in AppSRE Grafana.

Rules should be created together with the unit tests.

Faster Selective Rule Testing

To accelerate the development and validation of specific alerting or recording rules, a selective testing mechanism is available. This allows you to run the rules checker (e.g., obsctl-reloader-rules-checker, or as configured in the Makefile via the CMD variable) only on a chosen set of rule files and their corresponding test case files, rather than processing the entire suite. This is particularly useful for quick validation of changes and a faster feedback loop during development.

The selective testing process involves:

Creating a temporary, isolated environment.
Copying only the specified rule and test files into this environment.
Executing the rules checker against these isolated files.
Automatically cleaning up the temporary environment afterward.

Usage

You can invoke this selective test by running the following make command:

make selective-check-and-test RULE_FILES="<space_separated_rule_files>" TEST_CASE_FILES="<space_separated_test_files>"

Updating Alert and Recording Rules

Alert rules for data plane clusters and recording rules are being deployed by app-interface to RHOBS, to where the metrics are also being forwarded. For deploying the alert rules and recording rules, app-interface references the location of the rules together with a git reference - branch name or commit hash.

It holds separate references to both staging and production RHOBS instances (monitoring Konflux staging and production deployments).

The staging environment references the main of this repo, so rule changes reaching that branch are automatically deployed to RHOBS.

The production environment keeps the reference to the rules as a commit hash (rather than a branch). This means that any changes to the rules will not take effect until the references are updated.

Steps for updating the rules:

Merge the necessary changes to this repository - alerts, recording rules and tests.
Verify that the rules are visible as expected in AppSRE Grafana.
Once the changes are ready to be promoted to production, update the alerting rules production reference and/or the recording rules production reference in app-interface to the commit hash of the changes you made.

Grafana Dashboards

Refer to the app-interface instructions to learn how to develop AppSRE dashboards for Konflux. This repository serves as versioned storage for the dashboard definitions and nothing more.

Dashboards are automatically deployed to stage AppSRE Grafana when merged into the main branch. Deploying to production requires an update of a commit reference in app-interface.

Note: The dashboard UID must always be unique in each Grafana instance. Make sure to modify it by changing a few characters or deleting the test dashboard in staging instance. If the test dashboard is kept and the uid is not updated, glitches will occur insta grafana as it will juggle between the two dashboards with identical UIDs.

Adding Metrics and Labels

Only a subset of the metrics and labels available within the Konflux clusters is forwarded to RHOBS. If additional metrics or labels are needed, add them by following the steps described for Troubleshooting Missing Metrics and Labels

Availability Exporters

In order to be able to evaluate the overall availability of the Konflux ecosystem, we need to be able to establish the availability of each of its components.

By leveraging the existing Konflux monitoring stack based on Prometheus, we create Prometheus exporters that generate metrics that are scraped by the User Workload Monitoring Prometheus instance and remote-written to RHOBS.

Availability Exporter Example

The o11y team provides an example availability exporter that can be used as reference, especially in the case in which the exporter is external to the code it's monitoring.

For more detailed documentation on Availability exporters

Availability Exporter Recording Rules

When teams want to go with their own metrics format for exporters they need to adapt to the standard metric format by translating it using recording rules.

These recording rules should be put in the rhobs/recording folder.

The standard format is a single availability metric konflux_up with labels service and check. Each time series will have the service and check labels for the name of the originating service and the availability check it performed, respectively.

The metric konflux_up should return either 0 or 1 based on the availability of the component/service. If the service is up then the metric should return 1 else 0.

The recording rule example provided here has the below format:

grafana_ds_up(check=prometheus-appstudio-ds) -> konflux_up(service=grafana, check=prometheus-appstudio-ds)

See detailed documentation on recording rules.

Support

Slack: #forum-konflux-o11y

Name		Name	Last commit message	Last commit date
Latest commit History 1,391 Commits
.tekton		.tekton
config		config
dashboards		dashboards
exporter-build-scripts		exporter-build-scripts
exporters		exporters
rhobs		rhobs
scripts		scripts
test/promql/tests		test/promql/tests
.gitignore		.gitignore
.kube-linter.yaml		.kube-linter.yaml
.yamllint		.yamllint
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
go.mod		go.mod
go.sum		go.sum
review.md		review.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Konflux Observability

Alerting Rules

SLO Alerts

Usage Guidelines:

Benefits of Using the `slo` Label:

How to Apply the `slo` Label:

Note

Miscellaneous Alerts

Availability Metric Alerts

Alerts Tagging

Usage Guidelines:

How to Apply the `alert_team_handle` Annotation:

Recording Rules

Faster Selective Rule Testing

Usage

Updating Alert and Recording Rules

Grafana Dashboards

Adding Metrics and Labels

Availability Exporters

Availability Exporter Example

Availability Exporter Recording Rules

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 72

Uh oh!

Languages

License

redhat-appstudio/o11y

Folders and files

Latest commit

History

Repository files navigation

Konflux Observability

Alerting Rules

SLO Alerts

Usage Guidelines:

Benefits of Using the slo Label:

How to Apply the slo Label:

Note

Miscellaneous Alerts

Availability Metric Alerts

Alerts Tagging

Usage Guidelines:

How to Apply the alert_team_handle Annotation:

Recording Rules

Faster Selective Rule Testing

Usage

Updating Alert and Recording Rules

Grafana Dashboards

Adding Metrics and Labels

Availability Exporters

Availability Exporter Example

Availability Exporter Recording Rules

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 72

Uh oh!

Languages

Benefits of Using the `slo` Label:

How to Apply the `slo` Label:

How to Apply the `alert_team_handle` Annotation:

Packages