-
Notifications
You must be signed in to change notification settings - Fork 13
Steps to Visualize Litmus Metrics and Generate Notifications #6
Comments
LitmusChaos Infra SetupYou can either choose to directly install the latest chaos-operator (1.13.5) in the desired cluster OR setup the litmus portal control plane with the operator getting installed as part of the agent registration process (2.0.0-beta7) Case-1: Chaos-Operator Setup
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.5.yaml
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml
kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus --namespace=litmus Case-2: Litmus 2.0.0-Beta(7) Setup
kubectl apply -f https://litmuschaos.github.io/litmus/2.0.0-Beta/litmus-2.0.0-Beta.yaml
kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus-portal litmuschaos/litmus-2-0-0-beta --namespace litmus --devel Verify that the litmus chaos operator (and control plane components, in case of 2.0.0-Beta) are up and running. |
Monitoring Infra Setup
kubectl create ns monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prom prometheus-community/kube-prometheus-stack --namespace monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: chaos-exporter
name: chaos-exporter
namespace: litmus
spec:
replicas: 1
selector:
matchLabels:
app: chaos-exporter
template:
metadata:
labels:
app: chaos-exporter
spec:
serviceAccountName: litmus
containers:
- image: litmuschaos/chaos-exporter:1.13.5
imagePullPolicy: Always
name: chaos-exporter
env:
- name: TSDB_SCRAPE_INTERVAL
value: "30"
---
apiVersion: v1
kind: Service
metadata:
labels:
app: chaos-exporter
name: chaos-exporter
namespace: litmus
spec:
ports:
- port: 8080
name: tcp
protocol: TCP
targetPort: 8080
selector:
app: chaos-exporter
type: ClusterIP
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: chaos-exporter
name: chaos-exporter
name: chaos-exporter
namespace: litmus
spec:
endpoints:
- interval: 1s
port: tcp
jobLabel: name
namespaceSelector:
matchNames:
- litmus
selector:
matchLabels:
app: chaos-exporter
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
app: prometheus-blackbox-exporter
data:
blackbox.yaml: |
modules:
http_2xx:
http:
no_follow_redirects: false
preferred_ip_protocol: ip4
valid_http_versions:
- HTTP/1.1
- HTTP/2
valid_status_codes: []
prober: http
timeout: 5s
---
kind: Service
apiVersion: v1
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
app: prometheus-blackbox-exporter
spec:
type: ClusterIP
ports:
- name: http
port: 9115
protocol: TCP
selector:
app: prometheus-blackbox-exporter
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
app: prometheus-blackbox-exporter
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-blackbox-exporter
template:
metadata:
labels:
app: prometheus-blackbox-exporter
spec:
restartPolicy: Always
containers:
- name: blackbox-exporter
image: "prom/blackbox-exporter:v0.15.1"
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
args:
- "--config.file=/config/blackbox.yaml"
resources:
{}
ports:
- containerPort: 9115
name: http
livenessProbe:
httpGet:
path: /health
port: http
readinessProbe:
httpGet:
path: /health
port: http
volumeMounts:
- mountPath: /config
name: config
- name: configmap-reload
image: "jimmidyson/configmap-reload:v0.2.2"
imagePullPolicy: "IfNotPresent"
securityContext:
runAsNonRoot: true
runAsUser: 65534
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9115/-/reload
resources:
{}
volumeMounts:
- mountPath: /etc/config
name: config
readOnly: true
volumes:
- name: config
configMap:
name: prometheus-blackbox-exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
name: prometheus-blackbox-exporter
k8s-app: prometheus-blackbox-exporter
name: prometheus-blackbox-exporter
namespace: monitoring
spec:
endpoints:
- interval: 1s
path: /probe
port: http
params:
module:
- http_2xx
target:
- "helloservice.demospace.svc.cluster.local:9000"
metricRelabelings:
- action: replace
regex: (.*)
replacement: my_local_service
sourceLabels:
- __param_target
targetLabel: target
selector:
matchLabels:
app: prometheus-blackbox-exporter
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
annotations:
meta.helm.sh/release-name: prom
meta.helm.sh/release-namespace: monitoring
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: prom
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 16.5.0
chart: kube-prometheus-stack-16.5.0
heritage: Helm
release: prom
name: prom-kube-prometheus-stack-prometheus
namespace: monitoring
spec:
alerting:
alertmanagers:
- apiVersion: v2
name: prom-kube-prometheus-stack-alertmanager
namespace: monitoring
pathPrefix: /
port: web
enableAdminAPI: false
evaluationInterval: 10s
externalUrl: http://prom-kube-prometheus-stack-prometheus.monitoring:9090
image: quay.io/prometheus/prometheus:v2.27.1
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
release: prom
portName: web
probeNamespaceSelector: {}
probeSelector:
matchLabels:
release: prom
replicas: 1
retention: 10d
routePrefix: /
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
app: kube-prometheus-stack
release: prom
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prom-kube-prometheus-stack-prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- chaos-exporter
- prometheus-blackbox-exporter
shards: 1
version: v2.27.1
|
Alerting Configuration
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
meta.helm.sh/release-name: prom
meta.helm.sh/release-namespace: monitoring
prometheus-operator-validated: "true"
labels:
app: kube-prometheus-stack
app.kubernetes.io/instance: prom
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 16.5.0
chart: kube-prometheus-stack-16.5.0
heritage: Helm
release: prom
name: prom-kube-prometheus-stack-alertmanager.rules
spec:
groups:
- name: alertmanager.rules
rules:
- alert: LitmusExpFailureAlert
annotations:
message: |
Chaos test {{ $labels.chaosengine_name }} has failed in namespace {{ $labels.chaosresult_namespace }} with probe success percentage {{ $labels.probe_success_percentage }}
expr: litmuschaos_experiment_verdict{chaosresult_verdict="Fail",endpoint="tcp",job="chaos-monitor",service="chaos-monitor"}
> 0
labels:
severity: critical
global:
resolve_timeout: 1m
receivers:
- name: slack-notifications
slack_configs:
- api_url: <redacted>
channel: '#litmus-alerts'
icon_url: https://raw.githubusercontent.com/litmuschaos/icons/master/litmus.png
title: 'LitmusChaos Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Description:* {{ .Annotations.message }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
route:
receiver: slack-notifications
group_by: ['alertname']
group_wait: 15s
group_interval: 30s
routes:
- receiver: 'slack-notifications'
match:
severity: slack
templates:
- /etc/alertmanager/config/*.tmpl
|
Visualize Application Metrics Interleaved With Chaos Metrics On Grafana
avg_over_time(probe_success{job="prometheus-blackbox-exporter", namespace="monitoring"}[60s:1s])*100
avg_over_time(probe_duration_seconds{job="prometheus-blackbox-exporter", namespace="monitoring"}[60s:1s])*1000 Under normal circumstances, these values equal
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
},
{
"datasource": "Prometheus",
"enable": true,
"expr": "litmuschaos_awaited_experiments{chaosresult_namespace=\"litmus\",endpoint=\"tcp\",job=\"chaos-monitor\",namespace=\"litmus\",service=\"chaos-monitor\"}",
"hide": false,
"iconColor": "#C4162A",
"name": "Show Chaos Period",
"showIn": 0,
"step": "5s"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 25,
"links": [],
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": null,
"fieldConfig": {
"defaults": {},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.5.5",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "avg_over_time(probe_success{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*100",
"interval": "",
"legendFormat": "Probe Success percentage",
"refId": "A"
}
],
"thresholds": [
{
"$$hashKey": "object:172",
"colorMode": "critical",
"fill": true,
"line": true,
"op": "lt",
"value": 95,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Probe Success Percentage",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"$$hashKey": "object:147",
"format": "short",
"label": null,
"logBase": 1,
"max": "100",
"min": "0",
"show": true
},
{
"$$hashKey": "object:148",
"format": "short",
"label": null,
"logBase": 1,
"max": "1",
"min": "0",
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": null,
"fieldConfig": {
"defaults": {},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"hiddenSeries": false,
"id": 4,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.5.5",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "avg_over_time(probe_duration_seconds{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*1000",
"interval": "",
"legendFormat": "Service Access Latency",
"refId": "A"
}
],
"thresholds": [
{
"$$hashKey": "object:262",
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt",
"value": 20,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Access Duration (in ms)",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"$$hashKey": "object:218",
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"$$hashKey": "object:219",
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"schemaVersion": 27,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "PodtatoHead-BlackBox-Exporter",
"uid": "V8yDu66Gk",
"version": 2
} |
Prepare a Chaos Scenario
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.5?file=charts/generic/pod-delete/experiment.yaml -n litmus apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: hello-chaos
namespace: litmus
spec:
appinfo:
appns: 'demospace'
applabel: 'app=helloservice'
appkind: 'deployment'
annotationCheck: 'false'
engineState: 'active'
chaosServiceAccount: litmus-admin
monitoring: false
jobCleanUpPolicy: 'retain'
experiments:
- name: pod-delete
spec:
probe:
- name: http-probe
type: "httpProbe"
httpProbe/inputs:
url: "http://104.154.133.35:31798"
insecureSkipVerify: false
method:
get:
criteria: "=="
responseCode: "200"
mode: "Continuous"
runProperties:
probeTimeout: 1
interval: 1
retry: 1
probePollingInterval: 1
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false' |
Trigger, Visualize & Receive Notifications on Chaos
kubectl get pods -n litmus
NAME READY STATUS RESTARTS AGE
chaos-monitor-758c6b7f7c-vwxhw 1/1 Running 0 25h
chaos-operator-ce-5ffd8d8c8b-6hx7j 1/1 Running 0 2d23h
hello-chaos-runner 1/1 Running 0 8s
pod-delete-n2e1yq-g2q9q 1/1 Running 0 6s kubectl get pods -n demospace
NAME READY STATUS RESTARTS AGE
helloservice-79869dd7f6-jbmn8 0/1 Terminating 0 20m
helloservice-79869dd7f6-z7ctn 1/1 Running 0 2s
|
Tips / Gotchas During SetupThere are a few things to take care of/note when performing the setup of the observability stack described in this exercise.
You can play around with these values in a way that makes for your overall environment/other services, while ensuring the general combination selected works within the principles explained. |
There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon! |
@ksatchit You may use the following GitHub release tag from the repository, to install the sample application.
|
Hi, do you have any example for it? |
The sock shop Grafana dashboard has sample alerts setup which can be used as a reference. https://github.com/litmuschaos/litmus/blob/master/monitoring/grafana-dashboards/sock-shop/README.md https://docs.litmuschaos.io/docs/integrations/grafana#fault-injection-and-system-failure-alerts |
Introduction
This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).
Application Setup
Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.
The text was updated successfully, but these errors were encountered: