Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new observability enhancement #42

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions proposals/observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Observability
linktitle: Observability
description: Observability for Jenkins X
weight: 80
---

# Observability for Jenkins X

One of the goals of Jenkins X is to glue the "right" components together, and give you a nice experience, where you don't need to spend a lot of time configuring a low-level component to get DNS or certificates for example.
But for the moment there is one big topic which is missing: observability.

Our goal here is to package observability as part of Jenkins X, for:
- the platform itself
- "devops" indicators

## Platform Observability

This is the first step of this project: setup an observability stack so that we can:
- get the logs of all the Jenkins X components: Tekton, Lighthouse, cert-manager, ...
- get the metrics - as dashboards - for all these components
- setup alerting using logs-based or metrics-based rules

People should be able to quickly:
- get a "big picture" of the cluster health
- find all issues/errors
- get alerted if something goes wrong: a certificate which can't be renewed, a pod which can't start, ...

## "devops" indicators

This is the second step of this project: use the observability stack to collect and visualize Continuous Delivery Indicators.

We can collect all kinds of events happening in the clusters:
- Pull Requests
- Pipelines
- Releases
- Deployments
- ...

and then use this data to calculate indicators such as the 4 key devops metrics:
- mean lead time
- deployment frequency
- mean time to recover
- change failure rate

and we can even go 1 step further, and use the [SPACE framework](https://queue.acm.org/detail.cfm?id=3454124k) to present system metrics.

Our goal here is to give people insights into their workflows/processes, so that they can continuously improve them.

## Implementation

The selected observability stack is Grafana, because:
- it's open-source
- has support for logs/metrics/traces
- is written in go
- has a low memory footprint
- great kubernetes integration

So we're using:
- [Promtail](https://grafana.com/docs/loki/latest/clients/promtail/) to collect the logs from all running containers
- promtail is deployed as a daemonset on every node of the cluster, so that it can read the Kubernetes log files
- [Loki](https://grafana.com/docs/loki/latest/) to ingest the logs
- [Prometheus](https://prometheus.io/) to collect and ingest the metrics from the running pods
- [Grafana](https://grafana.com/docs/grafana/latest/) to visualize logs & metrics

The CD Indicators collector will be written as a go application
- watching for kubernetes events on the PipelineActivities
- watching for Kubernetes events on the Releases
- watching for github events through lighthouse - configured as an external plugin in lighthouse
And writting data into a PostgreSQL database, which will be setup as a datasource in Grafana.

Grafana dashboards will be dynamically loaded from ConfigMaps, which will enable:
- an easy packaging of Jenkins X own dashboards: https://github.com/jenkins-x-charts/grafana-dashboard
- Jenkins X users to easily write their own dashboards, and manage them in a gitops-friendly workflow