-
Notifications
You must be signed in to change notification settings - Fork 5k
Performance regression looping over large arrays #114047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
What are the BDN invocation and iteration counts for those runs? When the per-invocation time exceeds 100ms or so, BDN's default strategy may not end up measuring the most optimized code. |
OddRegression-20250330-174626.log @AndyAyersMS I've attached the detailed log file, the pertinent bit I think is:
and
I'd be happy to re-run with different strategy / job type to verify under better conditions. |
The key part is:
(from .Net9, others are similar) Here BDN invokes your benchmark method about 37 times. That is not enough to ensure it is fully tiered up. Try adding something like:
and it should invoke the benchmark enough times to reliably reach Tier1. That being said, there may be a regression here related to OSR codegen (which is likely what BDN is measuring by default). We have changed our minds on the OSR/PGO interaction over time. So I will also take a look. |
Thank you, that's useful to know. I set
|
Final results from that run:
So I'm pleased to report there's no meaningful performance regression after all. ( Full Log attached. ) |
I wonder what are these 20 bytes of allocated memory. 🤔 |
Probably related to dotnet/BenchmarkDotNet#2562 |
@AndyAyersMS Here is the codegen diff between OSR and Tier1: https://www.diffchecker.com/aLqfTllg/ Looks like bounds checks is what causing the regression. Tier1 clones the loop and then |
I'm curious if the OSR code has changed release to release... seems like in the "fast" results with default BDN maybe we got it right? Let me profile first and make sure we're in OSR for all the variants. |
Using
Profiling shows more or less what I expected.
So OSR in 8.0 was able to do better, and indeed in 8.0 we clone and 10.0 we don't. |
In 10 we have the following loop:
and we fail to realize
No sure why yet, I recall a while back we needed to find the IV initialization, which won't be there for OSR methods, but I thought we fixed that. |
Could be the result of #97122 or similar. |
@jakobbotsch, PTAL. |
Description
While performing a comparison of
memcmp
,ReadOnlySpan<T>SequenceEqual
and a naivefor
loop, I noticed a weird performance regression between .NET 8 and .NET 9 / 10.The naive for loop of course supposed to be efficient, but it slowed down dramatically between .NET 8 and .NET 9.
It's possible that this is a benchmark.NET issue not a dotnet runtime issue, but I'm not sure how to diagnose whether that is the case. I'd be happy to receive instruction to verify that.
The results are strange to say the least. Allocations are least in .NET 8.0 by a factor of 2, but runtime is minimised in both .NET 6 (despite more memory allocation) and .NET 8.
Source code to generate this is at https://github.com/richardcocks/dotnet8_9_10_regression.
The jist of it is:
Running command:
dotnet run -c:Release -f net10.0
The text was updated successfully, but these errors were encountered: