From ce3262f3facf03cddc5ead1422391fb89dd8decc Mon Sep 17 00:00:00 2001 From: Michael Bridgen Date: Tue, 24 Jan 2023 22:23:42 +0000 Subject: [PATCH] Clarify type and meaning of stacks_* metrics (#402) * Clarify type and meaning of stacks_* metrics The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric). I've simplified the verbiage a little, in passing. Addresses #399. * Reset stacks_failed gauge when stack deleted The stacks_failed metric is a set of gauges, each labelled with the namespace and name of a Stack object. The controller sets a gauge to `1` when its Stack object is given a state of "failed", and `0` for "succeeded". A query aggregating over the labels will get the count of failed stacks. However: once a Stack is deleted, the gauge remains with the last value -- and if it was failing, it will still be included in the count. So, this commit resets the gauge to `0` when a Stack is deleted (if it had a state at all). Signed-off-by: Michael Bridgen --- CHANGELOG.md | 2 ++ docs/metrics.md | 4 ++-- pkg/controller/stack/metrics.go | 8 ++++++++ 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index ca37dc53..f8304d95 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,8 @@ CHANGELOG - When a Stack uses a Flux source, but the source has no artifact to download, park the Stack until the source has been updated, rather than retrying [#359](https://github.com/pulumi/pulumi-kubernetes-operator/pull/359) +- Correct the stacks_failing metric in the case of a stack being deleted after failing + [#402](https://github.com/pulumi/pulumi-kubernetes-operator/pull/402) ## 1.10.1 (2022-10-25) diff --git a/docs/metrics.md b/docs/metrics.md index 62020767..45e7d8a4 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -20,8 +20,8 @@ Once the above are created, Prometheus will update its target scraping rules to The current implementation explicitly emits the following metrics: -1. `stacks_active` - `gauge` that tracks the number of currently registered stacks managed by the system -2. `stacks_failing` - `gaugevec` that provides information about stacks currently failing (`stack.status.lastUpdate.state` is `failed`) +1. `stacks_active` - a `gauge` time series that reports the number of currently registered stacks managed by the system +2. `stacks_failing` - a set of `gauge` time series, labelled by namespace, that gives the number of stacks currently failing (`stack.status.lastUpdate.state` is `failed`) In addition, we find tracking the following metrics emitted by the controller-runtime would be useful to track: diff --git a/pkg/controller/stack/metrics.go b/pkg/controller/stack/metrics.go index 0470223f..e9340973 100644 --- a/pkg/controller/stack/metrics.go +++ b/pkg/controller/stack/metrics.go @@ -66,4 +66,12 @@ func updateStackCallback(oldObj, newObj interface{}) { func deleteStackCallback(oldObj interface{}) { numStacks.Dec() + oldStack, ok := oldObj.(*pulumiv1.Stack) + if !ok { + return + } + // assume that if there was a status recorded, this gauge exists + if oldStack.Status.LastUpdate != nil { + numStacksFailing.With(prometheus.Labels{"namespace": oldStack.Namespace, "name": oldStack.Name}).Set(0) + } }