[sdk] Metrics refactor #5131

CodeBlanch · 2023-12-06T01:26:43Z

Changes

Refactor metrics to remove duplicated code!
AggregationType has been replaced with a more flexible [Flags] enum and it is no longer stored on MetricPoint which reduces the overall memory footprint of the SDK.
Removed the int used for locking on histogram buckets & exponential buckets in favor of the one on the optional components which also slightly lowers the overall memory footprint of the SDK.
Fixed some bugs like exemplars being processed where they shouldn't be due to likely copy/paste failures.

Overview

The current MetricPoint implementation has a lot spaghetti code 🍝 which has been copy/pasted almost entirely into different methods which add tweaks for exemplars, min/max recording, lock differences, etc. There are also a lot of artificial concepts like "SingleStream" mode and downright whacky things like metric super lists (List<List<Metric>>).

What this PR seeks to do is rein in all of that code while maintaining or improving the performance.

How this is done is by utilizing the JIT/compiler to do the specialization work for us. In .NET if you declare some generic thing and you feed it a value type/struct the JIT will generate a specialized implementation for it. During that specialization the JIT is able to elide/remove code when it determines something will be false forever. For native AOT this is done at compile time (there is no JIT).

What's the catch, there's always a catch?!

“There are no solutions. There are only trade-offs.” ― Thomas Sowell

The trade-off here is we need value type/structs which capture all the possible permutations. These need to be statically reachable so they are preserved during AOT. There are a lot of "behaviors" through the SDK so there are a lot of permutations which means there are a lot of structs! Supporting AOT made this its own complexity.

Benchmarks

New code:

Method	AggregationTemporality	Mean	Error	StdDev	Allocated
CounterHotPath	Cumulative	9.602 ns	0.1498 ns	0.1402 ns	-
CounterWith1LabelsHotPath	Cumulative	36.595 ns	0.5499 ns	0.5144 ns	-
CounterWith3LabelsHotPath	Cumulative	74.093 ns	0.6904 ns	0.5765 ns	-
CounterWith5LabelsHotPath	Cumulative	124.899 ns	1.9911 ns	1.7651 ns	-
CounterWith6LabelsHotPath	Cumulative	144.470 ns	2.1833 ns	2.0423 ns	-
CounterWith7LabelsHotPath	Cumulative	162.113 ns	1.5812 ns	1.4017 ns	-
CounterHotPath	Delta	9.934 ns	0.0919 ns	0.0860 ns	-
CounterWith1LabelsHotPath	Delta	37.793 ns	0.6613 ns	0.6186 ns	-
CounterWith3LabelsHotPath	Delta	76.518 ns	1.4816 ns	1.5853 ns	-
CounterWith5LabelsHotPath	Delta	135.717 ns	2.4324 ns	2.2753 ns	-
CounterWith6LabelsHotPath	Delta	148.603 ns	2.6981 ns	2.5238 ns	-
CounterWith7LabelsHotPath	Delta	163.807 ns	2.4840 ns	2.3235 ns	-

Old code:

Method	AggregationTemporality	Mean	Error	StdDev	Allocated
CounterHotPath	Cumulative	10.04 ns	0.108 ns	0.101 ns	-
CounterWith1LabelsHotPath	Cumulative	37.30 ns	0.522 ns	0.488 ns	-
CounterWith3LabelsHotPath	Cumulative	75.42 ns	0.612 ns	0.542 ns	-
CounterWith5LabelsHotPath	Cumulative	125.83 ns	1.349 ns	1.262 ns	-
CounterWith6LabelsHotPath	Cumulative	143.11 ns	2.734 ns	2.558 ns	-
CounterWith7LabelsHotPath	Cumulative	166.01 ns	2.178 ns	2.037 ns	-
CounterHotPath	Delta	14.36 ns	0.269 ns	0.252 ns	-
CounterWith1LabelsHotPath	Delta	42.44 ns	0.227 ns	0.177 ns	-
CounterWith3LabelsHotPath	Delta	79.54 ns	1.037 ns	0.970 ns	-
CounterWith5LabelsHotPath	Delta	127.64 ns	1.274 ns	1.192 ns	-
CounterWith6LabelsHotPath	Delta	144.53 ns	1.225 ns	1.146 ns	-
CounterWith7LabelsHotPath	Delta	175.06 ns	3.456 ns	3.549 ns	-

Note: The delta benchmarks seem to be speeding up a lot. That is kind of a lie. They are speeding up, yes, but that is only because I fixed the code. Today we are doing some things for the "metric point reclaim" feature even when it isn't turned on. Most of the speed up is coming from that fix.

Examples of JIT output

Note: To capture this I set the DOTNET_JitDisasm envvar to *RecordMeasurement* and then ran the "MetricsBenchmarks" suite.

A quick refresher. The JIT has a two-tier process. The first tier (Tier0) is the first compile which is performed the first time some code is run. It is meant to be fast and only does minor optimizations to code. It injects some instrumentation which is used to inform later optimizations (this is known as dynamic PGO). If some code is executed enough it gets re-compiled at the second tier (Tier1). This is where every optimization in the book is thrown at the code including the data from the instrumentation.

Tier1 assembly for CumulativeTemporality + DeltaCounterLong:

; Total bytes of code 302
; Assembly listing for method OpenTelemetry.Metrics.MetricMeasurementHandler`2[OpenTelemetry.Metrics.MetricAggregatorBehaviorDefinitions+CumulativeTemporality,OpenTelemetry.Metrics.MetricPointBehaviorDefinitions+DeltaCounterLong]:RecordMeasurement[long](OpenTelemetry.Metrics.AggregatorStore,long,System.ReadOnlySpan`1[System.Collections.Generic.KeyValuePair`2[System.String,System.Object]]):this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rbp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 408192
; 6 inlinees with PGO data; 5 single block inlinees; 4 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rbp
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 48
       lea      rbp, [rsp+0x50]
       mov      qword ptr [rbp-0x30], rsp
       mov      gword ptr [rbp+0x18], rdx
       mov      rcx, rdx
       mov      rbx, r8
 
G_M000_IG02:                ;; offset=0x001D
       mov      rsi, bword ptr [r9]
       mov      edi, dword ptr [r9+0x08]
       cmp      byte  ptr [rcx], cl
       test     edi, edi
       jne      SHORT G_M000_IG07
       call     [OpenTelemetry.Metrics.AggregatorStore:InitializeZeroTagPointIfNotInitialized():this]
       xor      r14d, r14d
 
G_M000_IG03:                ;; offset=0x0033
       test     r14d, r14d
       jl       SHORT G_M000_IG06
 
G_M000_IG04:                ;; offset=0x0038
       mov      rcx, gword ptr [rbp+0x18]
       mov      rax, gword ptr [rcx+0x58]
       cmp      r14d, dword ptr [rax+0x08]
       jae      SHORT G_M000_IG05
       mov      edx, r14d
       shl      rdx, 6
       lea      rax, bword ptr [rax+rdx+0x10]
       cmp      byte  ptr [rax], al
       lea      rdx, bword ptr [rax+0x20]
       lock     
       add      qword ptr [rdx], rbx
       mov      dword ptr [rax+0x1C], 1
       jmp      G_M000_IG15
 
G_M000_IG05:                ;; offset=0x0068
       call     CORINFO_HELP_RNGCHKFAIL
 
G_M000_IG06:                ;; offset=0x006D
       mov      rcx, gword ptr [rbp+0x18]
       call     [OpenTelemetry.Metrics.AggregatorStore:DropMeasurementDefault():this]
       jmp      G_M000_IG15
 
G_M000_IG07:                ;; offset=0x007C
       mov      ecx, 2
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE_NOCTOR_OPTIMIZED
       mov      rax, gword ptr [rax]
       test     rax, rax
       jne      SHORT G_M000_IG08
       mov      rcx, 0x7FFF2E9F2B88
       call     CORINFO_HELP_NEWSFAST
       mov      r14, rax
       mov      rcx, r14
       call     [OpenTelemetry.Metrics.ThreadStaticStorage:.ctor():this]
       mov      ecx, 2
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE_NOCTOR_OPTIMIZED
       mov      rcx, rax
       mov      rdx, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rax, r14
 
G_M000_IG08:                ;; offset=0x00C1
       mov      edx, edi
       cmp      byte  ptr [rax], al
       cmp      edx, 8
       jg       SHORT G_M000_IG09
       mov      rcx, gword ptr [rax+0x08]
       lea      eax, [rdx-0x01]
       cmp      eax, dword ptr [rcx+0x08]
       jae      SHORT G_M000_IG05
       dec      edx
       mov      rdx, gword ptr [rcx+8*rdx+0x10]
       mov      r14, gword ptr [rdx+0x08]
       jmp      SHORT G_M000_IG10
 
G_M000_IG09:                ;; offset=0x00E3
       movsxd   rdx, edx
       mov      rcx, 0x7FFF2E68F620
       call     CORINFO_HELP_NEWARR_1_VC
       mov      r14, rax
 
G_M000_IG10:                ;; offset=0x00F8
       mov      r8, r14
       test     r8, r8
       jne      SHORT G_M000_IG11
       xor      rcx, rcx
       xor      edx, edx
       jmp      SHORT G_M000_IG12
 
G_M000_IG11:                ;; offset=0x0106
       lea      rcx, bword ptr [r8+0x10]
       mov      edx, dword ptr [r8+0x08]
 
G_M000_IG12:                ;; offset=0x010E
       cmp      edi, edx
       ja       SHORT G_M000_IG13
       mov      r8d, edi
       shl      r8, 4
       mov      rdx, rsi
       call     [System.Buffer:BulkMoveWithWriteBarrier(byref,byref,ulong)]
       jmp      SHORT G_M000_IG14
 
G_M000_IG13:                ;; offset=0x0124
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3     
 
G_M000_IG14:                ;; offset=0x012B
       mov      rcx, gword ptr [rbp+0x18]
       mov      rax, gword ptr [rcx+0x70]
       mov      rdx, r14
       mov      r8d, edi
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Func`3[System.__Canon,int,int]:Invoke(System.__Canon,int):int:this
       mov      r14d, eax
       jmp      G_M000_IG03
 
G_M000_IG15:                ;; offset=0x0148
       add      rsp, 48
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r14
       pop      rbp
       ret      
 
G_M000_IG16:                ;; offset=0x0153
       push     rbp
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 48
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x50]
 
G_M000_IG17:                ;; offset=0x016A
       mov      rcx, gword ptr [rbp+0x18]
       cmp      dword ptr [rcx], ecx
       call     [OpenTelemetry.Metrics.AggregatorStore:DropMeasurementUnhandledException():this]
       lea      rax, G_M000_IG15
 
G_M000_IG18:                ;; offset=0x017D
       add      rsp, 48
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r14
       pop      rbp
       ret

Lots of inlining going on. No calls on the hot path. The key bit is...

G_M000_IG04:                ;; offset=0x0038
       mov      rcx, gword ptr [rbp+0x18]
       mov      rax, gword ptr [rcx+0x58]
       cmp      r14d, dword ptr [rax+0x08]
       jae      SHORT G_M000_IG05
       mov      edx, r14d
       shl      rdx, 6
       lea      rax, bword ptr [rax+rdx+0x10]
       cmp      byte  ptr [rax], al
       lea      rdx, bword ptr [rax+0x20]
       lock     
       add      qword ptr [rdx], rbx
       mov      dword ptr [rax+0x1C], 1
       jmp      G_M000_IG15

I'm not an expert in x64 assembly but that looks to me like the inlined versions of the code which actually does the modifications to MetricPoint. More or less this code:

Interlocked.Add(ref metricPoint.RunningValue.AsLong, value);

metricPoint.MetricPointStatus = MetricPointStatus.CollectPending;

That is what was left after everything got elided/removed from these methods:

    private static void RecoredMeasurementOnLongCounterOrGaugeMetricPoint(ref MetricPoint metricPoint, long value)
    {
        if (LockRequired)
        {
            if (IsDeltaAggregation)
            {
                unchecked
                {
                    metricPoint.RunningValue.AsLong += value;
                }
            }
            else
            {
                Debug.Assert(IsCumulativeAggregation, "IsCumulativeAggregation was false");

                metricPoint.RunningValue.AsLong = value;
            }
        }
        else if (IsDeltaAggregation)
        {
            Interlocked.Add(ref metricPoint.RunningValue.AsLong, value);
        }
        else
        {
            Debug.Assert(IsCumulativeAggregation, "IsCumulativeAggregation was false");

            Interlocked.Exchange(ref metricPoint.RunningValue.AsLong, value);
        }
    }

; Total bytes of code 392
; Assembly listing for method OpenTelemetry.Metrics.MetricMeasurementHandler`2[OpenTelemetry.Metrics.MetricAggregatorBehaviorDefinitions+CumulativeTemporality,OpenTelemetry.Metrics.MetricPointBehaviorDefinitions+DeltaCounterLong]:RecordMeasurementOnMetricPoint[long](OpenTelemetry.Metrics.AggregatorStore,byref,long,System.ReadOnlySpan`1[System.Collections.Generic.KeyValuePair`2[System.String,System.Object]]):this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 417408
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
 
G_M000_IG02:                ;; offset=0x0000
       cmp      byte  ptr [r8], r8b
       add      r8, 32
       lock     
       add      qword ptr [r8], r9
 
G_M000_IG03:                ;; offset=0x000B
       ret

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static void CompleteRecordMeasurement(ref MetricPoint metricPoint)
    {
        metricPoint.MetricPointStatus = MetricPointStatus.CollectPending;

        if (IsDeltaTemporality && ReclaimMetricPoints)
        {
            Interlocked.Decrement(ref metricPoint.ReferenceCount);
        }
    }

; Total bytes of code 12
; Assembly listing for method OpenTelemetry.Metrics.MetricMeasurementHandler`2[OpenTelemetry.Metrics.MetricAggregatorBehaviorDefinitions+CumulativeTemporality,OpenTelemetry.Metrics.MetricPointBehaviorDefinitions+DeltaCounterLong]:CompleteRecordMeasurement(byref) (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 382976
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0x0000
 
G_M000_IG02:                ;; offset=0x0000
       mov      dword ptr [rcx+0x1C], 1
 
G_M000_IG03:                ;; offset=0x0007
       ret

Merge requirement checklist

CONTRIBUTING guidelines followed (nullable enabled, static analysis, etc.)
Unit tests added/updated
Appropriate CHANGELOG.md files updated for non-trivial changes

codecov · 2023-12-06T01:39:52Z

Codecov Report

Attention: 55 lines in your changes are missing coverage. Please review.

Comparison is base (6250307) 83.38% compared to head (c4e9a0f) 84.39%.
Report is 25 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5131      +/-   ##
==========================================
+ Coverage   83.38%   84.39%   +1.01%     
==========================================
  Files         297      276      -21     
  Lines       12531    11724     -807     
==========================================
- Hits        10449     9895     -554     
+ Misses       2082     1829     -253

Flag	Coverage Δ
unittests	`?`
unittests-Instrumentation-Experimental	`25.79% <0.00%> (?)`
unittests-Instrumentation-Stable	`25.77% <0.00%> (?)`
unittests-Solution-Experimental	`84.38% <91.37%> (?)`
unittests-Solution-Stable	`84.36% <91.37%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...ing/Implementation/OpenTelemetryMetricsListener.cs	`90.90% <100.00%> (+0.43%)`	⬆️
...lemetry/Metrics/Base2ExponentialBucketHistogram.cs	`100.00% <ø> (ø)`
.../OpenTelemetry/Metrics/CompositeMetricReaderExt.cs	`100.00% <100.00%> (+27.08%)`	⬆️
src/OpenTelemetry/Metrics/HistogramBuckets.cs	`100.00% <ø> (ø)`
...mentHandling/MetricAggregatorBehaviorsAttribute.cs	`100.00% <100.00%> (ø)`
...surementHandling/MetricMeasurementHandlerHelper.cs	`100.00% <100.00%> (ø)`
...asurementHandling/MetricPointBehaviorsAttribute.cs	`100.00% <100.00%> (ø)`
...Telemetry/Metrics/MetricPointOptionalComponents.cs	`86.66% <100.00%> (+8.88%)`	⬆️
src/OpenTelemetry/Metrics/MetricReaderExt.cs	`91.08% <100.00%> (+1.08%)`	⬆️
src/OpenTelemetry/Metrics/MetricPoint.cs	`88.37% <96.29%> (+19.90%)`	⬆️
... and 5 more

... and 32 files with indirect coverage changes

cijothomas · 2023-12-06T06:17:39Z

Amazing! Will need some time to review, but I assume this won't be merged until after 1.7 stable is out, right?

CodeBlanch · 2023-12-06T18:17:46Z

@cijothomas

I assume this won't be merged until after 1.7 stable is out, right?

Not something we have discussed yet but I think that would be reasonable.

github-actions · 2023-12-15T03:17:03Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2023-12-23T03:16:50Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2023-12-31T03:16:49Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions · 2024-01-16T03:17:23Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2024-01-24T03:16:59Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2024-02-01T03:16:48Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

CodeBlanch added 11 commits November 29, 2023 14:11

Metrics refactor WIP.

822641a

Tweaks and fixes.

c26e367

Tweaks.

1cbaee9

Tweaks.

25194f0

Merge from main.

9ea0626

Tweaks.

180843b

Tweak.

6cec262

Tweaks.

a8d3b7d

Bug fixes and polishing.

073973e

Fixes.

13b584a

Include snapshot code in refactor.

ba51064

CodeBlanch requested a review from a team December 6, 2023 01:26

CodeBlanch added 4 commits December 6, 2023 09:44

Typo fix.

db6c394

Whitespace warning fix.

2732e95

Cleanup.

8ce3fe0

Merge remote-tracking branch 'upstream/main' into metrics-refactor

9d23137

CodeBlanch added 2 commits December 7, 2023 10:15

Merge from main.

1e5ed09

Copyright header updates.

1078b80

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Dec 15, 2023

Merge branch 'main' into metrics-refactor

0e71ee6

github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Dec 16, 2023

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Dec 23, 2023

github-actions bot closed this Dec 31, 2023

CodeBlanch reopened this Jan 8, 2024

Merge from main.

c4e9a0f

github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jan 9, 2024

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jan 16, 2024

alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jan 16, 2024

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jan 24, 2024

github-actions bot closed this Feb 1, 2024

CodeBlanch mentioned this pull request Feb 1, 2024

[metrics] Fix perf regression introduced by metric point reclaim feature #5307

Merged

1 task

CodeBlanch mentioned this pull request Feb 12, 2024

[sdk-metrics] Simplify the state built when an instrument is published #5350

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sdk] Metrics refactor #5131

[sdk] Metrics refactor #5131

CodeBlanch commented Dec 6, 2023 •

edited

Loading

codecov bot commented Dec 6, 2023 •

edited

Loading

cijothomas commented Dec 6, 2023

CodeBlanch commented Dec 6, 2023

github-actions bot commented Dec 15, 2023

github-actions bot commented Dec 23, 2023

github-actions bot commented Dec 31, 2023

github-actions bot commented Jan 16, 2024

github-actions bot commented Jan 24, 2024

github-actions bot commented Feb 1, 2024

[sdk] Metrics refactor #5131

[sdk] Metrics refactor #5131

Conversation

CodeBlanch commented Dec 6, 2023 • edited Loading

Changes

Overview

Benchmarks

Examples of JIT output

Merge requirement checklist

codecov bot commented Dec 6, 2023 • edited Loading

Codecov Report

cijothomas commented Dec 6, 2023

CodeBlanch commented Dec 6, 2023

github-actions bot commented Dec 15, 2023

github-actions bot commented Dec 23, 2023

github-actions bot commented Dec 31, 2023

github-actions bot commented Jan 16, 2024

github-actions bot commented Jan 24, 2024

github-actions bot commented Feb 1, 2024

CodeBlanch commented Dec 6, 2023 •

edited

Loading

codecov bot commented Dec 6, 2023 •

edited

Loading