Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CICD metrics #1681

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

christophe-kamphaus-jemmic
Copy link
Contributor

Fixes #1600

Changes

This PR adds metrics for CICD systems and related attributes.

Merge requirement checklist

docs/attributes-registry/cicd.md Outdated Show resolved Hide resolved
docs/cicd/cicd-metrics.md Outdated Show resolved Hide resolved
@carlosalberto
Copy link
Contributor

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

@christophe-kamphaus-jemmic
Copy link
Contributor Author

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

No I did not consider it until now. It's only with the latest changes that the similarity between these two metrics became apparent.
Also the cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue seem very similar.

I'm open to combine these two metric pairs and distinguishing them with phase attribute if additional reviewers are in favor of this change.

requirement_level: required
- id: metric.cicd.pipeline.run.errors
type: metric
metric_name: cicd.pipeline.run.errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be derived from metric.cicd.pipeline.run.duration? It could be if there is just one error per run, but I assume it's not the case, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be errors in a pipeline run that are non fatal, ie they are suppressed or in a parallel stage multiple stages could have an error.
Ie this error count might not be the same as the count of metric.cicd.pipeline.run.duration with run result failure.

We did think that this metric might either be derived as a span metric from a run trace or reported directly by the CICD system controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be errors in a pipeline run that are non fatal, ie they are suppressed or in a parallel stage multiple stages could have an error.
Ie this error count might not be the same as the count of metric.cicd.pipeline.run.duration with run result

Make sure this is included in the brief

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure it's described in the yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this as a note in 02e044a.

@lmolkova
Copy link
Contributor

lmolkova commented Jan 17, 2025

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

Not a strong opinion, but I like the idea of merging cicd.pipeline.run.queued and cicd.pipeline.run.active to something like cicd.pipeline.run.count with cicd.pipeline.run.state attribute.

Also the cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue seem very similar.

I don't think those are similar to cicd.pipeline.run.queued|active

  • these metrics could easily have different range (different order of magnitude) of values (e.g. time in queue is hopefully tiny and pipeline duration is measured in minutes or hours)
  • you'd configure different alerts for them and different people would pay attention to their values (in my BigCo anecdotal experience, workers are managed by a separate team/org and I have little control over them, while pipeline and its duration is controlled by dev teams)

@christophe-kamphaus-jemmic
Copy link
Contributor Author

I have combined cicd.pipeline.run.active and cicd.pipeline.run.queued into cicd.pipeline.run.active distinguishing them with the cicd.pipeline.run.state attribute in 1d88f59.

requirement_level: required
- ref: cicd.pipeline.run.state
requirement_level: required
- id: metric.cicd.pipeline.run.time_in_queue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about combining cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue into cicd.pipeline.run.duration and distinguishing with the cicd.pipeline.run.state attribute.

The advantages of combining the metrics are that

  • the metric centers around the pipeline run
  • is vendor-neutral (not supposing the existing of a queue)
  • extensible by using additional states

The disadvantages are

  • the issue Liudmila mentioned of the different orders of magnitude of time_in_queue vs duration spent executing.
    This could be worked around by filtering on the cicd.pipeline.run.state attribute for display on chart and alert query, ie having separate charts and alerts for both cases.
  • both metrics differ in the cicd.pipeline.result: either cicd.pipeline.result would need to be made conditionally required (ie only being set for the cicd.pipeline.run.state in which the cicd.pipeline.result becomes known) or a pending result would need to be added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, an execution of a pipeline would be something like this:

{
    "cicd": {
        "run": {
            "duration": 60,
            "state": "pending"
        }
    }
}
{
    "cicd": {
        "run": {
            "duration": 60,
            "state": "starting"
        }
    }
}
{
    "cicd": {
        "run": {
            "duration": 60,
            "state": "executing"
        }
    }
}
{
    "cicd": {
        "run": {
            "duration": 60,
            "state": "finalizing"
        }
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the duration would record only the time spent in that state.
So we could have the following records being recorded in the cicd.pipeline.run.duration metric:

Success

attributes:
  cicd.pipeline:
    name: Example1
    run.state: pending
value: 3
attributes:
  cicd.pipeline:
    name: Example1
    run.state: executing
    result: success
value: 60
attributes:
  cicd.pipeline:
    name: Example1
    run.state: finalizing
value: 2

Failure & no clean up

attributes:
  cicd.pipeline:
    name: Example1
    run.state: pending
value: 2
attributes:
  cicd.pipeline:
    name: Example1
    run.state: executing
    result: failure
  error.type: Task non-zero exit code
value: 50

Cancellation during initialization & clean up (no execution)

attributes:
  cicd.pipeline:
    name: Example1
    run.state: pending
    result: cancellation
  error.type: User cancellation
value: 1
attributes:
  cicd.pipeline:
    name: Example1
    run.state: finalizing
value: 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback from SemConv meeting:

Either combine both the cicd.pipeline.run.active and cicd.pipeline.run.duration or neither.

The issue of different orders of magnitude can be worked around by defining buckets carefully or fixed by using exponential histograms.


I have made the metric merge into cicd.pipeline.run.duration in f244911

Please take a look at how I made cicd.pipeline.result conditionally_required. Does that look ok?

@trask trask added this to the v1.30.0 milestone Jan 20, 2025
…e.run.duration

we make the distinction using the attribute cicd.pipeline.run.state
This makes it clear that this count might not match the
cicd.pipeline.run.duration count for result failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready to be Merged
Development

Successfully merging this pull request may close these issues.

Add metrics for CICD job queues
10 participants