Performance regression looping over large arrays #114047

richardcocks · 2025-03-30T16:53:42Z

Description

While performing a comparison of memcmp, ReadOnlySpan<T>SequenceEqual and a naive for loop, I noticed a weird performance regression between .NET 8 and .NET 9 / 10.

The naive for loop of course supposed to be efficient, but it slowed down dramatically between .NET 8 and .NET 9.

It's possible that this is a benchmark.NET issue not a dotnet runtime issue, but I'm not sure how to diagnose whether that is the case. I'd be happy to receive instruction to verify that.

BenchmarkDotNet v0.14.1-develop (2025-03-30), Windows 10 (10.0.19045.5679/22H2/2022Update)
AMD Ryzen 7 3800X 3.90GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.2.25164.34
  [Host]    : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 10.0 : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 6.0  : .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2
  .NET 7.0  : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2
  .NET 8.0  : .NET 8.0.14 (8.0.1425.11118), X64 RyuJIT AVX2
  .NET 9.0  : .NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2

Method	Job	Runtime	Length	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
Loop	.NET 6.0	.NET 6.0	1073741824	273.5 ms	5.30 ms	5.21 ms	0.98	0.02	1648 B	8.24
Loop	.NET 7.0	.NET 7.0	1073741824	523.3 ms	10.18 ms	9.52 ms	1.87	0.04	600 B	3.00
Loop	.NET 8.0	.NET 8.0	1073741824	279.3 ms	4.65 ms	3.88 ms	1.00	0.02	200 B	1.00
Loop	.NET 9.0	.NET 9.0	1073741824	524.4 ms	5.10 ms	4.52 ms	1.88	0.03	400 B	2.00
Loop	.NET 10.0	.NET 10.0	1073741824	537.7 ms	9.20 ms	8.16 ms	1.93	0.04	400 B	2.00

The results are strange to say the least. Allocations are least in .NET 8.0 by a factor of 2, but runtime is minimised in both .NET 6 (despite more memory allocation) and .NET 8.

Source code to generate this is at https://github.com/richardcocks/dotnet8_9_10_regression.

The jist of it is:

    [Benchmark]
    public void Loop()
    {
        EqualsLoop(first, second);
    }

    public static bool EqualsLoop(byte[] b1, byte[] b2)
    {

        if (b1.Length != b2.Length) return false;
        for (int i = 0; i < b1.Length; i++)
        {
            if (b1[i] != b2[i]) return false;
        }

        return true;
    }

Running command: dotnet run -c:Release -f net10.0

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2025-03-30T16:54:14Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS · 2025-03-30T19:47:30Z

What are the BDN invocation and iteration counts for those runs? When the per-invocation time exceeds 100ms or so, BDN's default strategy may not end up measuring the most optimized code.

richardcocks · 2025-03-30T21:13:44Z

OddRegression-20250330-174626.log

@AndyAyersMS I've attached the detailed log file, the pertinent bit I think is:

// * Detailed results *
OddRegression.Loop: .NET 10.0(Runtime=.NET 10.0) [Length=1073741824]
Runtime = .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 537.700 ms, StdErr = 2.181 ms (0.41%), N = 14, StdDev = 8.159 ms
Min = 528.663 ms, Q1 = 531.305 ms, Median = 536.337 ms, Q3 = 541.698 ms, Max = 558.171 ms
IQR = 10.394 ms, LowerFence = 515.714 ms, UpperFence = 557.289 ms
ConfidenceInterval = [528.496 ms; 546.903 ms] (CI 99.9%), Margin = 9.204 ms (1.71% of Mean)
Skewness = 1.03, Kurtosis = 3.24, MValue = 2
-------------------- Histogram --------------------
[524.220 ms ; 539.969 ms) | @@@@@@@@@@
[539.969 ms ; 562.614 ms) | @@@@
---------------------------------------------------

and

OddRegression.Loop: .NET 6.0(Runtime=.NET 6.0) [Length=1073741824]
Runtime = .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 273.482 ms, StdErr = 1.301 ms (0.48%), N = 16, StdDev = 5.206 ms
Min = 265.906 ms, Q1 = 269.867 ms, Median = 271.708 ms, Q3 = 276.424 ms, Max = 283.116 ms
IQR = 6.557 ms, LowerFence = 260.032 ms, UpperFence = 286.259 ms
ConfidenceInterval = [268.182 ms; 278.783 ms] (CI 99.9%), Margin = 5.301 ms (1.94% of Mean)
Skewness = 0.51, Kurtosis = 1.94, MValue = 2
-------------------- Histogram --------------------
[263.195 ms ; 268.840 ms) | @@
[268.840 ms ; 279.354 ms) | @@@@@@@@@@@
[279.354 ms ; 285.828 ms) | @@@
---------------------------------------------------

I'd be happy to re-run with different strategy / job type to verify under better conditions.

AndyAyersMS · 2025-03-30T22:08:54Z

The key part is:

// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "dotnet8_9_10_regression-.NET 9.0-5.dll" --anonymousPipes 832 2680 --benchmarkName "OddRegression.Loop(Length: 1073741824)" --job ".NET 9.0" --benchmarkId 0 in C:\Users\Rich\Documents\code\dotnet8_9_10_regression\bin\Release\net10.0\dotnet8_9_10_regression-.NET 9.0-5\bin\Release\net9.0
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.14.1-develop (2025-03-30)
// Runtime=.NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: .NET 9.0

OverheadJitting  1: 1 op, 139000.00 ns, 139.0000 us/op
WorkloadJitting  1: 1 op, 518367200.00 ns, 518.3672 ms/op

OverheadJitting  2: 1 op, 100.00 ns, 100.0000 ns/op
WorkloadJitting  2: 1 op, 518718400.00 ns, 518.7184 ms/op

WorkloadWarmup   1: 1 op, 519250100.00 ns, 519.2501 ms/op
WorkloadWarmup   2: 1 op, 521098500.00 ns, 521.0985 ms/op
WorkloadWarmup   3: 1 op, 528780000.00 ns, 528.7800 ms/op
WorkloadWarmup   4: 1 op, 528735800.00 ns, 528.7358 ms/op
WorkloadWarmup   5: 1 op, 531150000.00 ns, 531.1500 ms/op
WorkloadWarmup   6: 1 op, 524142900.00 ns, 524.1429 ms/op

// BeforeActualRun
WorkloadActual   1: 1 op, 521549400.00 ns, 521.5494 ms/op
WorkloadActual   2: 1 op, 526467300.00 ns, 526.4673 ms/op
WorkloadActual   3: 1 op, 517454100.00 ns, 517.4541 ms/op
WorkloadActual   4: 1 op, 525269700.00 ns, 525.2697 ms/op
WorkloadActual   5: 1 op, 518909500.00 ns, 518.9095 ms/op
WorkloadActual   6: 1 op, 524525400.00 ns, 524.5254 ms/op
WorkloadActual   7: 1 op, 523900900.00 ns, 523.9009 ms/op
WorkloadActual   8: 1 op, 525895200.00 ns, 525.8952 ms/op
WorkloadActual   9: 1 op, 525107200.00 ns, 525.1072 ms/op
WorkloadActual  10: 1 op, 532946200.00 ns, 532.9462 ms/op
WorkloadActual  11: 1 op, 531242600.00 ns, 531.2426 ms/op
WorkloadActual  12: 1 op, 543491500.00 ns, 543.4915 ms/op
WorkloadActual  13: 1 op, 528027400.00 ns, 528.0274 ms/op
WorkloadActual  14: 1 op, 520351600.00 ns, 520.3516 ms/op
WorkloadActual  15: 1 op, 519735500.00 ns, 519.7355 ms/op

// AfterActualRun
WorkloadResult   1: 1 op, 521549400.00 ns, 521.5494 ms/op
WorkloadResult   2: 1 op, 526467300.00 ns, 526.4673 ms/op
WorkloadResult   3: 1 op, 517454100.00 ns, 517.4541 ms/op
WorkloadResult   4: 1 op, 525269700.00 ns, 525.2697 ms/op
WorkloadResult   5: 1 op, 518909500.00 ns, 518.9095 ms/op
WorkloadResult   6: 1 op, 524525400.00 ns, 524.5254 ms/op
WorkloadResult   7: 1 op, 523900900.00 ns, 523.9009 ms/op
WorkloadResult   8: 1 op, 525895200.00 ns, 525.8952 ms/op
WorkloadResult   9: 1 op, 525107200.00 ns, 525.1072 ms/op
WorkloadResult  10: 1 op, 532946200.00 ns, 532.9462 ms/op
WorkloadResult  11: 1 op, 531242600.00 ns, 531.2426 ms/op
WorkloadResult  12: 1 op, 528027400.00 ns, 528.0274 ms/op
WorkloadResult  13: 1 op, 520351600.00 ns, 520.3516 ms/op
WorkloadResult  14: 1 op, 519735500.00 ns, 519.7355 ms/op

(from .Net9, others are similar) Here BDN invokes your benchmark method about 37 times. That is not enough to ensure it is fully tiered up.

Try adding something like:

--invocationCount 10 --iterationCount 15

and it should invoke the benchmark enough times to reliably reach Tier1.

That being said, there may be a regression here related to OSR codegen (which is likely what BDN is measuring by default). We have changed our minds on the OSR/PGO interaction over time. So I will also take a look.

richardcocks · 2025-03-30T22:29:33Z

Thank you, that's useful to know.

I set InvocationCount and IterationCount both to 20, and halfway through the warmup there was indeed a step-change in behaviour:

WorkloadWarmup   1: 20 op, 10468300800.00 ns, 523.4150 ms/op
WorkloadWarmup   2: 20 op, 10498538100.00 ns, 524.9269 ms/op
WorkloadWarmup   3: 20 op, 10481983800.00 ns, 524.0992 ms/op
WorkloadWarmup   4: 20 op, 5606436400.00 ns, 280.3218 ms/op
WorkloadWarmup   5: 20 op, 5632784700.00 ns, 281.6392 ms/op
WorkloadWarmup   6: 20 op, 5615803900.00 ns, 280.7902 ms/op

richardcocks · 2025-03-30T22:43:37Z

Final results from that run:


BenchmarkDotNet v0.14.1-develop (2025-03-30), Windows 10 (10.0.19045.5679/22H2/2022Update)
AMD Ryzen 7 3800X 3.90GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.2.25164.34
  [Host]    : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 10.0 : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 6.0  : .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2
  .NET 7.0  : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2
  .NET 8.0  : .NET 8.0.14 (8.0.1425.11118), X64 RyuJIT AVX2
  .NET 9.0  : .NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2

InvocationCount=20  IterationCount=20  UnrollFactor=1

Method	Job	Runtime	Length	Mean	Error	StdDev	Ratio	Allocated	Alloc Ratio
Loop	.NET 6.0	.NET 6.0	1073741824	277.2 ms	1.70 ms	1.82 ms	1.00	32 B	1.60
Loop	.NET 7.0	.NET 7.0	1073741824	276.7 ms	1.83 ms	1.88 ms	1.00	30 B	1.50
Loop	.NET 8.0	.NET 8.0	1073741824	276.9 ms	0.77 ms	0.89 ms	1.00	20 B	1.00
Loop	.NET 9.0	.NET 9.0	1073741824	274.7 ms	0.95 ms	1.10 ms	0.99	20 B	1.00
Loop	.NET 10.0	.NET 10.0	1073741824	279.9 ms	1.94 ms	2.16 ms	1.01	20 B	1.00

So I'm pleased to report there's no meaningful performance regression after all.

( Full Log attached. )

OddRegression-20250330-232624.log

teo-tsirpanis · 2025-03-30T23:01:15Z

I wonder what are these 20 bytes of allocated memory. 🤔

timcassell · 2025-03-30T23:04:40Z

I wonder what are these 20 bytes of allocated memory. 🤔

Probably related to dotnet/BenchmarkDotNet#2562

EgorBo · 2025-03-31T11:41:41Z

That being said, there may be a regression here related to OSR codegen (which is likely what BDN is measuring by default). We have changed our minds on the OSR/PGO interaction over time. So I will also take a look.

@AndyAyersMS Here is the codegen diff between OSR and Tier1: https://www.diffchecker.com/aLqfTllg/ Looks like bounds checks is what causing the regression. Tier1 clones the loop and then if (b1.Length != b2.Length) return false; check even helps to remove the slow loop as never reachable.

AndyAyersMS · 2025-03-31T14:36:19Z

I'm curious if the OSR code has changed release to release... seems like in the "fast" results with default BDN maybe we got it right? Let me profile first and make sure we're in OSR for all the variants.

AndyAyersMS · 2025-03-31T15:23:55Z

Using BenchmarkSwitcher and

--runtimes net6.0 net8.0 net9.0 net10.0 --f * -p ETW --invocationCount 1 --iterationCount 15 --unrollFactor 1

Profiling shows more or less what I expected.

;; 10.0
99.50%   7.517E+07   OSR      [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

;; 9.0
99.36%   7.723E+07   OSR      [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

;; 8.0
99.48%   4.398E+07   OSR      [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

;; 6.0
99.76%   5.421E+07   FullOpt  [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

So OSR in 8.0 was able to do better, and indeed in 8.0 we clone and 10.0 we don't.

AndyAyersMS · 2025-03-31T15:34:44Z

In 10 we have the following loop:

***************  Natural loop graph
L00 header: BB05
  Members (3): [BB05..BB07]
  Entry: BB10 -> BB05
  Exit: BB05 -> BB03; BB07 -> BB11
  Back: BB07 -> BB05

------------ BB10 [0010] [00E..???) -> BB05(1) (always), preds={BB01} succs={BB05}

------------ BB05 [0003] [00E..016) -> BB03(0.005),BB06(0.995) (cond), preds={BB07,BB10} succs={BB06,BB03}

***** BB05 [0003]
STMT00004 ( 0x00E[E-] ... 0x014 )
               [000024] ---XG+-----                         *  JTRUE     void  
               [000023] J--XG+-N---                         \--*  NE        int   
               [000042] ---XG+-----                            +--*  COMMA     ubyte 
               [000036] ---X-+-----                            |  +--*  BOUNDS_CHECK_Rng void  
               [000016] -----+-----                            |  |  +--*  LCL_VAR   int    V02 loc0         
               [000035] ---X-+-----                            |  |  \--*  ARR_LENGTH int   
               [000015] -----+-----                            |  |     \--*  LCL_VAR   ref    V00 arg0         
               [000043] n---G+-----                            |  \--*  IND       ubyte 
               [000041] -----+-----                            |     \--*  ARR_ADDR  byref ubyte[]
               [000040] -----+-----                            |        \--*  ADD       byref 
               [000033] -----+-----                            |           +--*  LCL_VAR   ref    V00 arg0         
               [000039] -----+-----                            |           \--*  ADD       long  
               [000037] -----+---U-                            |              +--*  CAST      long <- uint
               [000034] -----+-----                            |              |  \--*  LCL_VAR   int    V02 loc0         
               [000038] -----+-----                            |              \--*  CNS_INT   long   16
               [000053] ---XG+-----                            \--*  COMMA     ubyte 
               [000047] ---X-+-----                               +--*  BOUNDS_CHECK_Rng void  
               [000020] -----+-----                               |  +--*  LCL_VAR   int    V02 loc0         
               [000046] ---X-+-----                               |  \--*  ARR_LENGTH int   
               [000019] -----+-----                               |     \--*  LCL_VAR   ref    V01 arg1         
               [000054] n---G+-----                               \--*  IND       ubyte 
               [000052] -----+-----                                  \--*  ARR_ADDR  byref ubyte[]
               [000051] -----+-----                                     \--*  ADD       byref 
               [000044] -----+-----                                        +--*  LCL_VAR   ref    V01 arg1         
               [000050] -----+-----                                        \--*  ADD       long  
               [000048] -----+---U-                                           +--*  CAST      long <- uint
               [000045] -----+-----                                           |  \--*  LCL_VAR   int    V02 loc0         
               [000049] -----+-----                                           \--*  CNS_INT   long   16

------------ BB06 [0005] [018..01C) -> BB07(1) (always), preds={BB05} succs={BB07}

***** BB06 [0005]
STMT00005 ( 0x018[E-] ... 0x01B )
               [000028] DA---+-----                         *  STORE_LCL_VAR int    V02 loc0         
               [000027] -----+-----                         \--*  ADD       int   
               [000025] -----+-----                            +--*  LCL_VAR   int    V02 loc0         
               [000026] -----+-----                            \--*  CNS_INT   int    1

------------ BB07 [0006] [01C..022) -> BB05(0.99405),BB11(0.00595) (cond), preds={BB06} succs={BB11,BB05}

***** BB07 [0006]
STMT00003 ( 0x01C[E-] ... 0x020 )
     ( 11,  9) [000014] ---X-+-----                         *  JTRUE     void  
     (  9,  7) [000013] J--X-+-N---                         \--*  GT        int   
     (  5,  4) [000012] ---X-+-----                            +--*  ARR_LENGTH int   
     (  3,  2) [000011] -----+-----                            |  \--*  LCL_VAR   ref    V00 arg0         
     (  3,  2) [000010] -----+-----                            \--*  LCL_VAR   int    V02 loc0

and we fail to realize V02 is an IV:

*************** In optCloneLoops()
Considering loop L00 to clone for optimizations.
Analyzing iteration for L00 with header BB05
  Preheader = BB10
  Checking exiting block BB05
    Could not extract an IV
  Checking exiting block BB07
    Could not extract an IV
  Could not find any IV
Loop cloning: rejecting loop L00. Could not analyze iteration.
------------------------------------------------------------

No sure why yet, I recall a while back we needed to find the IV initialization, which won't be there for OSR methods, but I thought we fixed that.

AndyAyersMS · 2025-03-31T16:23:50Z

Could be the result of #97122 or similar.

JulieLeeMSFT · 2025-04-15T22:38:29Z

@jakobbotsch, PTAL.

richardcocks added the tenet-performance Performance related issue label Mar 30, 2025

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 30, 2025

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 30, 2025

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Mar 31, 2025

JulieLeeMSFT assigned jakobbotsch Apr 15, 2025

JulieLeeMSFT added this to the 10.0.0 milestone Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression looping over large arrays #114047

Performance regression looping over large arrays #114047

richardcocks commented Mar 30, 2025 •

edited

Loading

dotnet-policy-service bot commented Mar 30, 2025

AndyAyersMS commented Mar 30, 2025

richardcocks commented Mar 30, 2025

AndyAyersMS commented Mar 30, 2025

richardcocks commented Mar 30, 2025

richardcocks commented Mar 30, 2025

teo-tsirpanis commented Mar 30, 2025

timcassell commented Mar 30, 2025

EgorBo commented Mar 31, 2025 •

edited

Loading

AndyAyersMS commented Mar 31, 2025

AndyAyersMS commented Mar 31, 2025

AndyAyersMS commented Mar 31, 2025

AndyAyersMS commented Mar 31, 2025

JulieLeeMSFT commented Apr 15, 2025

Performance regression looping over large arrays #114047

Performance regression looping over large arrays #114047

Comments

richardcocks commented Mar 30, 2025 • edited Loading

Description

dotnet-policy-service bot commented Mar 30, 2025

AndyAyersMS commented Mar 30, 2025

richardcocks commented Mar 30, 2025

AndyAyersMS commented Mar 30, 2025

richardcocks commented Mar 30, 2025

richardcocks commented Mar 30, 2025

teo-tsirpanis commented Mar 30, 2025

timcassell commented Mar 30, 2025

EgorBo commented Mar 31, 2025 • edited Loading

AndyAyersMS commented Mar 31, 2025

AndyAyersMS commented Mar 31, 2025

AndyAyersMS commented Mar 31, 2025

AndyAyersMS commented Mar 31, 2025

JulieLeeMSFT commented Apr 15, 2025

richardcocks commented Mar 30, 2025 •

edited

Loading

EgorBo commented Mar 31, 2025 •

edited

Loading