-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CICD metrics #1681
base: main
Are you sure you want to change the base?
Add CICD metrics #1681
Conversation
This ensures that all result values are nouns.
Overall LGTM. A small (non-blocking) question: were |
No I did not consider it until now. It's only with the latest changes that the similarity between these two metrics became apparent. I'm open to combine these two metric pairs and distinguishing them with |
requirement_level: required | ||
- id: metric.cicd.pipeline.run.errors | ||
type: metric | ||
metric_name: cicd.pipeline.run.errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be derived from metric.cicd.pipeline.run.duration
? It could be if there is just one error per run, but I assume it's not the case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be errors in a pipeline run that are non fatal, ie they are suppressed or in a parallel stage multiple stages could have an error.
Ie this error count might not be the same as the count of metric.cicd.pipeline.run.duration
with run result failure
.
We did think that this metric might either be derived as a span metric from a run trace or reported directly by the CICD system controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be errors in a pipeline run that are non fatal, ie they are suppressed or in a parallel stage multiple stages could have an error.
Ie this error count might not be the same as the count of metric.cicd.pipeline.run.duration with run result
Make sure this is included in the brief
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make sure it's described in the yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this as a note in 02e044a.
Not a strong opinion, but I like the idea of merging
I don't think those are similar to
|
distinguish them by the new cicd.pipeline.run.state attribute
I have combined |
model/cicd/metrics.yaml
Outdated
requirement_level: required | ||
- ref: cicd.pipeline.run.state | ||
requirement_level: required | ||
- id: metric.cicd.pipeline.run.time_in_queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about combining cicd.pipeline.run.duration
and cicd.pipeline.run.time_in_queue
into cicd.pipeline.run.duration
and distinguishing with the cicd.pipeline.run.state
attribute.
The advantages of combining the metrics are that
- the metric centers around the pipeline run
- is vendor-neutral (not supposing the existing of a queue)
- extensible by using additional states
The disadvantages are
- the issue Liudmila mentioned of the different orders of magnitude of time_in_queue vs duration spent executing.
This could be worked around by filtering on thecicd.pipeline.run.state
attribute for display on chart and alert query, ie having separate charts and alerts for both cases. - both metrics differ in the
cicd.pipeline.result
: eithercicd.pipeline.result
would need to be made conditionally required (ie only being set for thecicd.pipeline.run.state
in which thecicd.pipeline.result
becomes known) or a pending result would need to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, an execution of a pipeline would be something like this:
{
"cicd": {
"run": {
"duration": 60,
"state": "pending"
}
}
}
{
"cicd": {
"run": {
"duration": 60,
"state": "starting"
}
}
}
{
"cicd": {
"run": {
"duration": 60,
"state": "executing"
}
}
}
{
"cicd": {
"run": {
"duration": 60,
"state": "finalizing"
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the duration would record only the time spent in that state.
So we could have the following records being recorded in the cicd.pipeline.run.duration
metric:
Success
attributes:
cicd.pipeline:
name: Example1
run.state: pending
value: 3
attributes:
cicd.pipeline:
name: Example1
run.state: executing
result: success
value: 60
attributes:
cicd.pipeline:
name: Example1
run.state: finalizing
value: 2
Failure & no clean up
attributes:
cicd.pipeline:
name: Example1
run.state: pending
value: 2
attributes:
cicd.pipeline:
name: Example1
run.state: executing
result: failure
error.type: Task non-zero exit code
value: 50
Cancellation during initialization & clean up (no execution)
attributes:
cicd.pipeline:
name: Example1
run.state: pending
result: cancellation
error.type: User cancellation
value: 1
attributes:
cicd.pipeline:
name: Example1
run.state: finalizing
value: 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback from SemConv meeting:
Either combine both the cicd.pipeline.run.active
and cicd.pipeline.run.duration
or neither.
The issue of different orders of magnitude can be worked around by defining buckets carefully or fixed by using exponential histograms.
I have made the metric merge into cicd.pipeline.run.duration
in f244911
Please take a look at how I made cicd.pipeline.result
conditionally_required. Does that look ok?
…e.run.duration we make the distinction using the attribute cicd.pipeline.run.state
This makes it clear that this count might not match the cicd.pipeline.run.duration count for result failure.
Fixes #1600
Changes
This PR adds metrics for CICD systems and related attributes.
Merge requirement checklist
[chore]