Skip to content

Performance regression looping over large arrays #114047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
richardcocks opened this issue Mar 30, 2025 · 14 comments
Open

Performance regression looping over large arrays #114047

richardcocks opened this issue Mar 30, 2025 · 14 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@richardcocks
Copy link

richardcocks commented Mar 30, 2025

Description

While performing a comparison of memcmp, ReadOnlySpan<T>SequenceEqual and a naive for loop, I noticed a weird performance regression between .NET 8 and .NET 9 / 10.

The naive for loop of course supposed to be efficient, but it slowed down dramatically between .NET 8 and .NET 9.

It's possible that this is a benchmark.NET issue not a dotnet runtime issue, but I'm not sure how to diagnose whether that is the case. I'd be happy to receive instruction to verify that.

BenchmarkDotNet v0.14.1-develop (2025-03-30), Windows 10 (10.0.19045.5679/22H2/2022Update)
AMD Ryzen 7 3800X 3.90GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.2.25164.34
  [Host]    : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 10.0 : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 6.0  : .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2
  .NET 7.0  : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2
  .NET 8.0  : .NET 8.0.14 (8.0.1425.11118), X64 RyuJIT AVX2
  .NET 9.0  : .NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2
Method Job Runtime Length Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
Loop .NET 6.0 .NET 6.0 1073741824 273.5 ms 5.30 ms 5.21 ms 0.98 0.02 1648 B 8.24
Loop .NET 7.0 .NET 7.0 1073741824 523.3 ms 10.18 ms 9.52 ms 1.87 0.04 600 B 3.00
Loop .NET 8.0 .NET 8.0 1073741824 279.3 ms 4.65 ms 3.88 ms 1.00 0.02 200 B 1.00
Loop .NET 9.0 .NET 9.0 1073741824 524.4 ms 5.10 ms 4.52 ms 1.88 0.03 400 B 2.00
Loop .NET 10.0 .NET 10.0 1073741824 537.7 ms 9.20 ms 8.16 ms 1.93 0.04 400 B 2.00

The results are strange to say the least. Allocations are least in .NET 8.0 by a factor of 2, but runtime is minimised in both .NET 6 (despite more memory allocation) and .NET 8.

Source code to generate this is at https://github.com/richardcocks/dotnet8_9_10_regression.

The jist of it is:

    [Benchmark]
    public void Loop()
    {
        EqualsLoop(first, second);
    }

    public static bool EqualsLoop(byte[] b1, byte[] b2)
    {

        if (b1.Length != b2.Length) return false;
        for (int i = 0; i < b1.Length; i++)
        {
            if (b1[i] != b2[i]) return false;
        }

        return true;
    }

Running command: dotnet run -c:Release -f net10.0

@richardcocks richardcocks added the tenet-performance Performance related issue label Mar 30, 2025
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 30, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 30, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member

What are the BDN invocation and iteration counts for those runs? When the per-invocation time exceeds 100ms or so, BDN's default strategy may not end up measuring the most optimized code.

@richardcocks
Copy link
Author

OddRegression-20250330-174626.log

@AndyAyersMS I've attached the detailed log file, the pertinent bit I think is:

// * Detailed results *
OddRegression.Loop: .NET 10.0(Runtime=.NET 10.0) [Length=1073741824]
Runtime = .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 537.700 ms, StdErr = 2.181 ms (0.41%), N = 14, StdDev = 8.159 ms
Min = 528.663 ms, Q1 = 531.305 ms, Median = 536.337 ms, Q3 = 541.698 ms, Max = 558.171 ms
IQR = 10.394 ms, LowerFence = 515.714 ms, UpperFence = 557.289 ms
ConfidenceInterval = [528.496 ms; 546.903 ms] (CI 99.9%), Margin = 9.204 ms (1.71% of Mean)
Skewness = 1.03, Kurtosis = 3.24, MValue = 2
-------------------- Histogram --------------------
[524.220 ms ; 539.969 ms) | @@@@@@@@@@
[539.969 ms ; 562.614 ms) | @@@@
---------------------------------------------------

and

OddRegression.Loop: .NET 6.0(Runtime=.NET 6.0) [Length=1073741824]
Runtime = .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 273.482 ms, StdErr = 1.301 ms (0.48%), N = 16, StdDev = 5.206 ms
Min = 265.906 ms, Q1 = 269.867 ms, Median = 271.708 ms, Q3 = 276.424 ms, Max = 283.116 ms
IQR = 6.557 ms, LowerFence = 260.032 ms, UpperFence = 286.259 ms
ConfidenceInterval = [268.182 ms; 278.783 ms] (CI 99.9%), Margin = 5.301 ms (1.94% of Mean)
Skewness = 0.51, Kurtosis = 1.94, MValue = 2
-------------------- Histogram --------------------
[263.195 ms ; 268.840 ms) | @@
[268.840 ms ; 279.354 ms) | @@@@@@@@@@@
[279.354 ms ; 285.828 ms) | @@@
---------------------------------------------------

I'd be happy to re-run with different strategy / job type to verify under better conditions.

@AndyAyersMS
Copy link
Member

The key part is:

// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "dotnet8_9_10_regression-.NET 9.0-5.dll" --anonymousPipes 832 2680 --benchmarkName "OddRegression.Loop(Length: 1073741824)" --job ".NET 9.0" --benchmarkId 0 in C:\Users\Rich\Documents\code\dotnet8_9_10_regression\bin\Release\net10.0\dotnet8_9_10_regression-.NET 9.0-5\bin\Release\net9.0
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.14.1-develop (2025-03-30)
// Runtime=.NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: .NET 9.0

OverheadJitting  1: 1 op, 139000.00 ns, 139.0000 us/op
WorkloadJitting  1: 1 op, 518367200.00 ns, 518.3672 ms/op

OverheadJitting  2: 1 op, 100.00 ns, 100.0000 ns/op
WorkloadJitting  2: 1 op, 518718400.00 ns, 518.7184 ms/op

WorkloadWarmup   1: 1 op, 519250100.00 ns, 519.2501 ms/op
WorkloadWarmup   2: 1 op, 521098500.00 ns, 521.0985 ms/op
WorkloadWarmup   3: 1 op, 528780000.00 ns, 528.7800 ms/op
WorkloadWarmup   4: 1 op, 528735800.00 ns, 528.7358 ms/op
WorkloadWarmup   5: 1 op, 531150000.00 ns, 531.1500 ms/op
WorkloadWarmup   6: 1 op, 524142900.00 ns, 524.1429 ms/op

// BeforeActualRun
WorkloadActual   1: 1 op, 521549400.00 ns, 521.5494 ms/op
WorkloadActual   2: 1 op, 526467300.00 ns, 526.4673 ms/op
WorkloadActual   3: 1 op, 517454100.00 ns, 517.4541 ms/op
WorkloadActual   4: 1 op, 525269700.00 ns, 525.2697 ms/op
WorkloadActual   5: 1 op, 518909500.00 ns, 518.9095 ms/op
WorkloadActual   6: 1 op, 524525400.00 ns, 524.5254 ms/op
WorkloadActual   7: 1 op, 523900900.00 ns, 523.9009 ms/op
WorkloadActual   8: 1 op, 525895200.00 ns, 525.8952 ms/op
WorkloadActual   9: 1 op, 525107200.00 ns, 525.1072 ms/op
WorkloadActual  10: 1 op, 532946200.00 ns, 532.9462 ms/op
WorkloadActual  11: 1 op, 531242600.00 ns, 531.2426 ms/op
WorkloadActual  12: 1 op, 543491500.00 ns, 543.4915 ms/op
WorkloadActual  13: 1 op, 528027400.00 ns, 528.0274 ms/op
WorkloadActual  14: 1 op, 520351600.00 ns, 520.3516 ms/op
WorkloadActual  15: 1 op, 519735500.00 ns, 519.7355 ms/op

// AfterActualRun
WorkloadResult   1: 1 op, 521549400.00 ns, 521.5494 ms/op
WorkloadResult   2: 1 op, 526467300.00 ns, 526.4673 ms/op
WorkloadResult   3: 1 op, 517454100.00 ns, 517.4541 ms/op
WorkloadResult   4: 1 op, 525269700.00 ns, 525.2697 ms/op
WorkloadResult   5: 1 op, 518909500.00 ns, 518.9095 ms/op
WorkloadResult   6: 1 op, 524525400.00 ns, 524.5254 ms/op
WorkloadResult   7: 1 op, 523900900.00 ns, 523.9009 ms/op
WorkloadResult   8: 1 op, 525895200.00 ns, 525.8952 ms/op
WorkloadResult   9: 1 op, 525107200.00 ns, 525.1072 ms/op
WorkloadResult  10: 1 op, 532946200.00 ns, 532.9462 ms/op
WorkloadResult  11: 1 op, 531242600.00 ns, 531.2426 ms/op
WorkloadResult  12: 1 op, 528027400.00 ns, 528.0274 ms/op
WorkloadResult  13: 1 op, 520351600.00 ns, 520.3516 ms/op
WorkloadResult  14: 1 op, 519735500.00 ns, 519.7355 ms/op

(from .Net9, others are similar) Here BDN invokes your benchmark method about 37 times. That is not enough to ensure it is fully tiered up.

Try adding something like:

--invocationCount 10 --iterationCount 15

and it should invoke the benchmark enough times to reliably reach Tier1.

That being said, there may be a regression here related to OSR codegen (which is likely what BDN is measuring by default). We have changed our minds on the OSR/PGO interaction over time. So I will also take a look.

@richardcocks
Copy link
Author

Thank you, that's useful to know.

I set InvocationCount and IterationCount both to 20, and halfway through the warmup there was indeed a step-change in behaviour:

WorkloadWarmup   1: 20 op, 10468300800.00 ns, 523.4150 ms/op
WorkloadWarmup   2: 20 op, 10498538100.00 ns, 524.9269 ms/op
WorkloadWarmup   3: 20 op, 10481983800.00 ns, 524.0992 ms/op
WorkloadWarmup   4: 20 op, 5606436400.00 ns, 280.3218 ms/op
WorkloadWarmup   5: 20 op, 5632784700.00 ns, 281.6392 ms/op
WorkloadWarmup   6: 20 op, 5615803900.00 ns, 280.7902 ms/op

@richardcocks
Copy link
Author

Final results from that run:


BenchmarkDotNet v0.14.1-develop (2025-03-30), Windows 10 (10.0.19045.5679/22H2/2022Update)
AMD Ryzen 7 3800X 3.90GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.2.25164.34
  [Host]    : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 10.0 : .NET 10.0.0 (10.0.25.16302), X64 RyuJIT AVX2
  .NET 6.0  : .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2
  .NET 7.0  : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2
  .NET 8.0  : .NET 8.0.14 (8.0.1425.11118), X64 RyuJIT AVX2
  .NET 9.0  : .NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2

InvocationCount=20  IterationCount=20  UnrollFactor=1  

Method Job Runtime Length Mean Error StdDev Ratio Allocated Alloc Ratio
Loop .NET 6.0 .NET 6.0 1073741824 277.2 ms 1.70 ms 1.82 ms 1.00 32 B 1.60
Loop .NET 7.0 .NET 7.0 1073741824 276.7 ms 1.83 ms 1.88 ms 1.00 30 B 1.50
Loop .NET 8.0 .NET 8.0 1073741824 276.9 ms 0.77 ms 0.89 ms 1.00 20 B 1.00
Loop .NET 9.0 .NET 9.0 1073741824 274.7 ms 0.95 ms 1.10 ms 0.99 20 B 1.00
Loop .NET 10.0 .NET 10.0 1073741824 279.9 ms 1.94 ms 2.16 ms 1.01 20 B 1.00

So I'm pleased to report there's no meaningful performance regression after all.

( Full Log attached. )

OddRegression-20250330-232624.log

@teo-tsirpanis
Copy link
Contributor

I wonder what are these 20 bytes of allocated memory. 🤔

@timcassell
Copy link

I wonder what are these 20 bytes of allocated memory. 🤔

Probably related to dotnet/BenchmarkDotNet#2562

@EgorBo
Copy link
Member

EgorBo commented Mar 31, 2025

That being said, there may be a regression here related to OSR codegen (which is likely what BDN is measuring by default). We have changed our minds on the OSR/PGO interaction over time. So I will also take a look.

@AndyAyersMS Here is the codegen diff between OSR and Tier1: https://www.diffchecker.com/aLqfTllg/ Looks like bounds checks is what causing the regression. Tier1 clones the loop and then if (b1.Length != b2.Length) return false; check even helps to remove the slow loop as never reachable.

@AndyAyersMS
Copy link
Member

I'm curious if the OSR code has changed release to release... seems like in the "fast" results with default BDN maybe we got it right? Let me profile first and make sure we're in OSR for all the variants.

@AndyAyersMS
Copy link
Member

Using BenchmarkSwitcher and

--runtimes net6.0 net8.0 net9.0 net10.0 --f * -p ETW --invocationCount 1 --iterationCount 15 --unrollFactor 1

Profiling shows more or less what I expected.

;; 10.0
99.50%   7.517E+07   OSR      [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

;; 9.0
99.36%   7.723E+07   OSR      [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

;; 8.0
99.48%   4.398E+07   OSR      [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

;; 6.0
99.76%   5.421E+07   FullOpt  [r114047]OddRegression.EqualsLoop(unsigned int8[],unsigned int8[])

So OSR in 8.0 was able to do better, and indeed in 8.0 we clone and 10.0 we don't.

@AndyAyersMS
Copy link
Member

In 10 we have the following loop:

***************  Natural loop graph
L00 header: BB05
  Members (3): [BB05..BB07]
  Entry: BB10 -> BB05
  Exit: BB05 -> BB03; BB07 -> BB11
  Back: BB07 -> BB05

------------ BB10 [0010] [00E..???) -> BB05(1) (always), preds={BB01} succs={BB05}

------------ BB05 [0003] [00E..016) -> BB03(0.005),BB06(0.995) (cond), preds={BB07,BB10} succs={BB06,BB03}

***** BB05 [0003]
STMT00004 ( 0x00E[E-] ... 0x014 )
               [000024] ---XG+-----                         *  JTRUE     void  
               [000023] J--XG+-N---                         \--*  NE        int   
               [000042] ---XG+-----                            +--*  COMMA     ubyte 
               [000036] ---X-+-----                            |  +--*  BOUNDS_CHECK_Rng void  
               [000016] -----+-----                            |  |  +--*  LCL_VAR   int    V02 loc0         
               [000035] ---X-+-----                            |  |  \--*  ARR_LENGTH int   
               [000015] -----+-----                            |  |     \--*  LCL_VAR   ref    V00 arg0         
               [000043] n---G+-----                            |  \--*  IND       ubyte 
               [000041] -----+-----                            |     \--*  ARR_ADDR  byref ubyte[]
               [000040] -----+-----                            |        \--*  ADD       byref 
               [000033] -----+-----                            |           +--*  LCL_VAR   ref    V00 arg0         
               [000039] -----+-----                            |           \--*  ADD       long  
               [000037] -----+---U-                            |              +--*  CAST      long <- uint
               [000034] -----+-----                            |              |  \--*  LCL_VAR   int    V02 loc0         
               [000038] -----+-----                            |              \--*  CNS_INT   long   16
               [000053] ---XG+-----                            \--*  COMMA     ubyte 
               [000047] ---X-+-----                               +--*  BOUNDS_CHECK_Rng void  
               [000020] -----+-----                               |  +--*  LCL_VAR   int    V02 loc0         
               [000046] ---X-+-----                               |  \--*  ARR_LENGTH int   
               [000019] -----+-----                               |     \--*  LCL_VAR   ref    V01 arg1         
               [000054] n---G+-----                               \--*  IND       ubyte 
               [000052] -----+-----                                  \--*  ARR_ADDR  byref ubyte[]
               [000051] -----+-----                                     \--*  ADD       byref 
               [000044] -----+-----                                        +--*  LCL_VAR   ref    V01 arg1         
               [000050] -----+-----                                        \--*  ADD       long  
               [000048] -----+---U-                                           +--*  CAST      long <- uint
               [000045] -----+-----                                           |  \--*  LCL_VAR   int    V02 loc0         
               [000049] -----+-----                                           \--*  CNS_INT   long   16

------------ BB06 [0005] [018..01C) -> BB07(1) (always), preds={BB05} succs={BB07}

***** BB06 [0005]
STMT00005 ( 0x018[E-] ... 0x01B )
               [000028] DA---+-----                         *  STORE_LCL_VAR int    V02 loc0         
               [000027] -----+-----                         \--*  ADD       int   
               [000025] -----+-----                            +--*  LCL_VAR   int    V02 loc0         
               [000026] -----+-----                            \--*  CNS_INT   int    1

------------ BB07 [0006] [01C..022) -> BB05(0.99405),BB11(0.00595) (cond), preds={BB06} succs={BB11,BB05}

***** BB07 [0006]
STMT00003 ( 0x01C[E-] ... 0x020 )
     ( 11,  9) [000014] ---X-+-----                         *  JTRUE     void  
     (  9,  7) [000013] J--X-+-N---                         \--*  GT        int   
     (  5,  4) [000012] ---X-+-----                            +--*  ARR_LENGTH int   
     (  3,  2) [000011] -----+-----                            |  \--*  LCL_VAR   ref    V00 arg0         
     (  3,  2) [000010] -----+-----                            \--*  LCL_VAR   int    V02 loc0   

and we fail to realize V02 is an IV:

*************** In optCloneLoops()
Considering loop L00 to clone for optimizations.
Analyzing iteration for L00 with header BB05
  Preheader = BB10
  Checking exiting block BB05
    Could not extract an IV
  Checking exiting block BB07
    Could not extract an IV
  Could not find any IV
Loop cloning: rejecting loop L00. Could not analyze iteration.
------------------------------------------------------------

No sure why yet, I recall a while back we needed to find the IV initialization, which won't be there for OSR methods, but I thought we fixed that.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Mar 31, 2025
@AndyAyersMS
Copy link
Member

Could be the result of #97122 or similar.

@JulieLeeMSFT
Copy link
Member

@jakobbotsch, PTAL.

@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants