Skip to content

Commit

Permalink
Merge pull request #57 from simonpasquier/update-faq
Browse files Browse the repository at this point in the history
Rearrange the OCP monitoring FAQ page
  • Loading branch information
simonpasquier authored Aug 3, 2023
2 parents 233be2f + 59c7e33 commit 919e468
Showing 1 changed file with 8 additions and 12 deletions.
20 changes: 8 additions & 12 deletions content/Products/OpenshiftMonitoring/faq.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,13 @@
# Debugging the default OpenShift 4.x in-cluster Monitoring: Platform Monitoring and User Workload Monitoring Products

## Overview
# Frequently asked questions

This serves as a collection of resources that relate to FAQ around configuring/debugging the in-cluster monitoring stack. Particularly it applies to two OpenShift Projects:

* [Platform Cluster Monitoring - PM](https://docs.openshift.com/container-platform/latest/monitoring/understanding-the-monitoring-stack.html#understanding-the-monitoring-stack_understanding-the-monitoring-stack)
* [User Workload Monitoring - UWM](https://docs.openshift.com/container-platform/latest/monitoring/enabling-monitoring-for-user-defined-projects.html)

## Troubleshooting when faced with potential symptoms:

### ServiceMonitor created but no metrics from the service
## How do I understand why targets aren't discovered and metrics are missing?

Both `PM` and `UWM` monitoring stack relies on the `ServiceMonitor` custom resource in order to tell Prometheus what endpoints to scrape.
Both `PM` and `UWM` monitoring stacks rely on the `ServiceMonitor` and `PodMonitor` custom resources in order to tell Prometheus which endpoints to scrape.

The examples below show the namespace `openshift-monitoring`, which can be replaced with `openshift-user-workload-monitoring` when dealing with `UWM`.

Expand Down Expand Up @@ -49,7 +45,7 @@ Assuming it does exist then we know `prometheus-operator` is doing its job. Doub

Check the service discovery endpoint to ensure Prometheus can discover the target. It will need the appropriate RBAC to do so. An example can be found [here](https://github.com/openshift/cluster-monitoring-operator/blob/23201e012586d4864ca23593621f843179c47412/assets/prometheus-k8s/role-specific-namespaces.yaml#L35-L50).

### TargetDown alert firing
## How do I troubleshoot the TargetDown alert?

We have, in the past seen cases where the `TargetDown` alert was firing when all endpoints appeared to be up. The following commands fetch some useful metrics to help identify the cause.

Expand Down Expand Up @@ -83,13 +79,13 @@ oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://lo
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-1.json
```

## Debugging high CPU usage
## How do I troubleshoot high CPU usage of Prometheus?

Often, when "high" CPU usage or spikes are identified it can be a symptom of expensive rules.

A good place to start the investigation is the `/rules` endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times.

### Obtaining CPU profiles
## How do I retrieve CPU profiles?

In cases where excessive CPU usage is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span.

Expand All @@ -109,7 +105,7 @@ while [ $duration -ne 0 ]; do
done
```

## Debugging high memory usage
## How do I debug high memory usage?

The following queries might prove useful for debugging.

Expand All @@ -129,7 +125,7 @@ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
'query=sort_desc(sum by (pod,namespace) (max without(instance) (container_memory_working_set_bytes{namespace=~"openshift-monitoring|openshift-user-workload-monitoring", container=""})))' > memory.json
```

### Obtaining memory profiles
## How do I get memory profiles?

In cases where excessive memory is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span.

Expand Down

0 comments on commit 919e468

Please sign in to comment.