This terraform module performs some adjustments on a prometheus configuration and syncs it with an etcd key prefix.
It is meant to:
- Be compatible with the way we operate prometheus, by continuously updating its configuration at runtime against the content of an etcd key prefix:
- Make some repetitive boilerplate prometheus rules/alerts configurations more dry
- Be flexible enough to support unmanaged configuration outside the boilerplate that it manages
Currently, the two kinds of boilerplate that are supported:
- Node exporter rules and alerts for vms (number of hosts detected, cpu, ram, disks)
- Terracd jobs metrics and alerts (to get the interval since the last plan/apply and a threshold value that will trigger an alert)
- config: This should be the value of the entrypoint prometheus.yml configuration file which will be generated from this value. The module will add some rule_files entries for the rule files it generates and otherwise will leave the content as is.
- fs_path: Path where the prometheus configuration will be generated prior to synchronizting it with etcd. Beyond generating the prometheus.yml file there, boilerplate rule files will be generated in the rules subdirectory.
- etcd_key_prefix: Etcd prefix where the processed prometheus configuration will be synchronized.
- node_exporter_jobs: List of node exporter jobs to generate boilerplate for. Each entry should take the following keys:
- tag: Tag for the node exporter job. Is should consist of words separated by dashes. The job is expected to be called
<tag>-node-exporter
- expected_count: Expected number of instances associated with the job
- memory_usage_threshold: Maximum memory usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- cpu_usage_threshold: Maximum cpu usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- expected_disks_count: Expected number of disks (ex: 2). An alert will be triggered if the number of disks doesn't match. Can be set to -1 to disable the alert.
- disk_space_usage_threshold: Maximum disk space usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- disk_io_usage_threshold: Maximum disk io usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- tag: Tag for the node exporter job. Is should consist of words separated by dashes. The job is expected to be called
- blackbox_exporter_jobs: List of blackbox tcp/http exporter jobs to generate boilerplate for. Each entry should take the following keys:
- tag: Tag for the blackbox exporter job. Is should consist of words separated by dashes. The job is expected to be called
<tag>-blackbox-exporter
- unavailability_tolerance: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formated as prometheus expects in the for field of alert rules.
- max_acceptable_latency: Duration in seconds indicating the maximum acceptable response time for the service. If the service continuously takes longer than this to respond for an interval of time longer than unavailability_tolerance, a slow service alert will be triggered.
- cert_renewal_window: Delay in days indicating the expected renewal window for the tls certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly.
- has_tls: Boolean indicating whether the service expects a tls connection. If false, alerts for the cert renewal window and tls version will not be set.
- expect_recent_tls: Boolean indicating whether the service is expected to use tls version 1.3. If set to true and the service uses a version of tls older than 1.3, an alert will be triggered.
- alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- tag: Tag for the blackbox exporter job. Is should consist of words separated by dashes. The job is expected to be called
- terracd_jobs: List of terracd jobs to generate boilerplate for. Each entry should take the following keys:
- tag: Tag for the terracd job. It should correspond to the job name.
- plan_interval_threshold: Interval threshold after which an alert will be triggered if a plan or apply command did not run successfully. Used to diagnose a broken or non-running pipeline.
- apply_interval_threshold: Interval threshold after which an alert will be triggered if an apply command did not run successfully. Used to detect a pipeline that was left in plan and never put back on apply.
- unit: Base time unit to use (minute or hour) that will affect how the thresholds are interepreted and how the rules are processed (to be either in minutes or hours)
- alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- kubernetes_cluster_jobs: List of kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key:
- tag: Tag for the kubernetes cluster job. It should correspond to the cluster name.
- expected_services: List of expected deployments that should have a certain number of long running instances. Each entry should have the following keys:
- namespace: Namespace where the service is expected to run
- name: Name of the service. It should match the k8 deployment name.
- expected_min_count: Minimum expected number of instances that should be running.
- expected_start_delay: Expected delay before an instance is started. Running instances that have been around for less than that delay won't be considered running.
- alert_labels: Extra labels to add to alerts triggered for the service.
- minio_cluster_jobs: List of minio cluster jobs to generate boilerplate for. Each entry should take the following key:
- tag: Tag for the minio cluster job. It should correspond to the cluster name.
- etcd_exporter_jobs: List of etcd exporter jobs to generate boilerplate for. Each entry should take the following keys:
- tag: Tag for the etcd exporter job. Is should consist of words separated by dashes. The job is expected to be called
<tag>-etcd-exporter
- expected_count: Expected number of etcd members associated with the job
- max_learn_time: Max expected time for an etcd learner to catchup.
- max_db_size: Maximum expected data size (note that etcd has its own limit if 8GiB)
- alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- tag: Tag for the etcd exporter job. Is should consist of words separated by dashes. The job is expected to be called
- vault_exporter_jobs: List of Vault telemetry jobs to generate boilerplate for. Each entry should take the following keys:
- tag: Tag for the Vault telemetry job. It should correspond to the job name.
- expected_unsealed_count: Expected number of unsealed Vault nodes in the cluster. An alert will be triggered if the number of unsealed nodes drops below this value.
- alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
For a usage example, see: https://github.com/Ferlab-Ste-Justine/kvm-dev-orchestrations/blob/main/prometheus/prometheus-configs.tf