Add CICD metrics #1681

christophe-kamphaus-jemmic · 2024-12-13T18:07:20Z

Fixes #1600

Changes

This PR adds metrics for CICD systems and related attributes.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
schema-next.yaml updated with changes to existing conventions. → NA

.vscode/settings.json

docs/attributes-registry/cicd.md

docs/cicd/cicd-metrics.md

docs/attributes-registry/cicd.md

docs/cicd/cicd-metrics.md

model/cicd/metrics.yaml

….active

…queue

….errors metric

This ensures that all result values are nouns.

carlosalberto · 2025-01-10T23:51:02Z

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

christophe-kamphaus-jemmic · 2025-01-11T09:20:55Z

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

No I did not consider it until now. It's only with the latest changes that the similarity between these two metrics became apparent.
Also the cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue seem very similar.

I'm open to combine these two metric pairs and distinguishing them with phase attribute if additional reviewers are in favor of this change.

model/cicd/registry.yaml

lmolkova · 2025-01-17T22:40:47Z

model/cicd/metrics.yaml

+        requirement_level: required
+  - id: metric.cicd.pipeline.run.errors
+    type: metric
+    metric_name: cicd.pipeline.run.errors


Could it be derived from metric.cicd.pipeline.run.duration? It could be if there is just one error per run, but I assume it's not the case, right?

There might be errors in a pipeline run that are non fatal, ie they are suppressed or in a parallel stage multiple stages could have an error.
Ie this error count might not be the same as the count of metric.cicd.pipeline.run.duration with run result failure.

We did think that this metric might either be derived as a span metric from a run trace or reported directly by the CICD system controller.

There might be errors in a pipeline run that are non fatal, ie they are suppressed or in a parallel stage multiple stages could have an error.
Ie this error count might not be the same as the count of metric.cicd.pipeline.run.duration with run result

Make sure this is included in the brief

let's make sure it's described in the yaml

I have added this as a note in 02e044a.

lmolkova · 2025-01-17T22:51:14Z

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

Not a strong opinion, but I like the idea of merging cicd.pipeline.run.queued and cicd.pipeline.run.active to something like cicd.pipeline.run.count with cicd.pipeline.run.state attribute.

Also the cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue seem very similar.

I don't think those are similar to cicd.pipeline.run.queued|active

these metrics could easily have different range (different order of magnitude) of values (e.g. time in queue is hopefully tiny and pipeline duration is measured in minutes or hours)
you'd configure different alerts for them and different people would pay attention to their values (in my BigCo anecdotal experience, workers are managed by a separate team/org and I have little control over them, while pipeline and its duration is controlled by dev teams)

distinguish them by the new cicd.pipeline.run.state attribute

christophe-kamphaus-jemmic · 2025-01-19T20:52:42Z

I have combined cicd.pipeline.run.active and cicd.pipeline.run.queued into cicd.pipeline.run.active distinguishing them with the cicd.pipeline.run.state attribute in 1d88f59.

christophe-kamphaus-jemmic · 2025-01-19T20:58:02Z

model/cicd/metrics.yaml

+        requirement_level: required
+      - ref: cicd.pipeline.run.state
+        requirement_level: required
+  - id: metric.cicd.pipeline.run.time_in_queue


I thought about combining cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue into cicd.pipeline.run.duration and distinguishing with the cicd.pipeline.run.state attribute.

The advantages of combining the metrics are that

the metric centers around the pipeline run

is vendor-neutral (not supposing the existing of a queue)

extensible by using additional states

The disadvantages are

the issue Liudmila mentioned of the different orders of magnitude of time_in_queue vs duration spent executing.
This could be worked around by filtering on the cicd.pipeline.run.state attribute for display on chart and alert query, ie having separate charts and alerts for both cases.

both metrics differ in the cicd.pipeline.result: either cicd.pipeline.result would need to be made conditionally required (ie only being set for the cicd.pipeline.run.state in which the cicd.pipeline.result becomes known) or a pending result would need to be added.

If I understand correctly, an execution of a pipeline would be something like this:

{ "cicd": { "run": { "duration": 60, "state": "pending" } } }

{ "cicd": { "run": { "duration": 60, "state": "starting" } } }

{ "cicd": { "run": { "duration": 60, "state": "executing" } } }

{ "cicd": { "run": { "duration": 60, "state": "finalizing" } } }

I was thinking that the duration would record only the time spent in that state.
So we could have the following records being recorded in the cicd.pipeline.run.duration metric:

Success

attributes: cicd.pipeline: name: Example1 run.state: pending value: 3

attributes: cicd.pipeline: name: Example1 run.state: executing result: success value: 60

attributes: cicd.pipeline: name: Example1 run.state: finalizing value: 2

Failure & no clean up

attributes: cicd.pipeline: name: Example1 run.state: pending value: 2

attributes: cicd.pipeline: name: Example1 run.state: executing result: failure error.type: Task non-zero exit code value: 50

Cancellation during initialization & clean up (no execution)

attributes: cicd.pipeline: name: Example1 run.state: pending result: cancellation error.type: User cancellation value: 1

attributes: cicd.pipeline: name: Example1 run.state: finalizing value: 4

Feedback from SemConv meeting:

Either combine both the cicd.pipeline.run.active and cicd.pipeline.run.duration or neither.

The issue of different orders of magnitude can be worked around by defining buckets carefully or fixed by using exponential histograms.

I have made the metric merge into cicd.pipeline.run.duration in f244911

Please take a look at how I made cicd.pipeline.result conditionally_required. Does that look ok?

…e.run.duration we make the distinction using the attribute cicd.pipeline.run.state

This makes it clear that this count might not match the cicd.pipeline.run.duration count for result failure.

christophe-kamphaus-jemmic added 7 commits December 13, 2024 11:32

[cicd] add pipeline run duration metric

72b225b

[cicd] add cicd queue metrics

1b10459

[cicd] add cicd worker count metric

d820586

[cicd] add cicd error count

a2edc16

Update vscode settings to align with markdown-toc --no-first-h1

3e1537b

[cicd] update examples of cicd.pipeline.result

7d4affa

[cicd] add changelog entry

70f2bd6

christophe-kamphaus-jemmic requested review from a team as code owners December 13, 2024 18:07

christophe-kamphaus-jemmic added 2 commits December 13, 2024 19:20

[cicd] update brief to add missing article

1c4e584

[cicd] improve metric brief

2f9b857

christophe-kamphaus-jemmic commented Dec 13, 2024

View reviewed changes

.vscode/settings.json Show resolved Hide resolved