From ce3262f3facf03cddc5ead1422391fb89dd8decc Mon Sep 17 00:00:00 2001
From: Michael Bridgen <mbridgen@pulumi.com>
Date: Tue, 24 Jan 2023 22:23:42 +0000
Subject: [PATCH] Clarify type and meaning of stacks_* metrics (#402)

* Clarify type and meaning of stacks_* metrics

The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric).

I've simplified the verbiage a little, in passing.

Addresses #399.

* Reset stacks_failed gauge when stack deleted

The stacks_failed metric is a set of gauges, each labelled with the
namespace and name of a Stack object. The controller sets a gauge to `1`
when its Stack object is given a state of "failed", and `0` for
"succeeded". A query aggregating over the labels will get the count of
failed stacks.

However: once a Stack is deleted, the gauge remains with the last value
-- and if it was failing, it will still be included in the count. So,
this commit resets the gauge to `0` when a Stack is deleted (if it had a
state at all).

Signed-off-by: Michael Bridgen <mbridgen@pulumi.com>
---
 CHANGELOG.md                    | 2 ++
 docs/metrics.md                 | 4 ++--
 pkg/controller/stack/metrics.go | 8 ++++++++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index ca37dc53..f8304d95 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -17,6 +17,8 @@ CHANGELOG
 - When a Stack uses a Flux source, but the source has no artifact to download, park the Stack until
   the source has been updated, rather than retrying
   [#359](https://github.com/pulumi/pulumi-kubernetes-operator/pull/359)
+- Correct the stacks_failing metric in the case of a stack being deleted after failing
+  [#402](https://github.com/pulumi/pulumi-kubernetes-operator/pull/402)
 
 ## 1.10.1 (2022-10-25)
 
diff --git a/docs/metrics.md b/docs/metrics.md
index 62020767..45e7d8a4 100644
--- a/docs/metrics.md
+++ b/docs/metrics.md
@@ -20,8 +20,8 @@ Once the above are created, Prometheus will update its target scraping rules to
 
 The current implementation explicitly emits the following metrics:
 
-1. `stacks_active` - `gauge` that tracks the number of currently registered stacks managed by the system
-2. `stacks_failing` - `gaugevec` that provides information about stacks currently failing (`stack.status.lastUpdate.state` is `failed`)
+1. `stacks_active` - a `gauge` time series that reports the number of currently registered stacks managed by the system
+2. `stacks_failing` - a set of `gauge` time series, labelled by namespace, that gives the number of stacks currently failing (`stack.status.lastUpdate.state` is `failed`)
 
 In addition, we find tracking the following metrics emitted by the controller-runtime would be useful to track:
 
diff --git a/pkg/controller/stack/metrics.go b/pkg/controller/stack/metrics.go
index 0470223f..e9340973 100644
--- a/pkg/controller/stack/metrics.go
+++ b/pkg/controller/stack/metrics.go
@@ -66,4 +66,12 @@ func updateStackCallback(oldObj, newObj interface{}) {
 
 func deleteStackCallback(oldObj interface{}) {
 	numStacks.Dec()
+	oldStack, ok := oldObj.(*pulumiv1.Stack)
+	if !ok {
+		return
+	}
+	// assume that if there was a status recorded, this gauge exists
+	if oldStack.Status.LastUpdate != nil {
+		numStacksFailing.With(prometheus.Labels{"namespace": oldStack.Namespace, "name": oldStack.Name}).Set(0)
+	}
 }