Skip to content

[NativeAOT] ConcurrentDictionary is slower #68891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamsitnik opened this issue May 5, 2022 · 7 comments · Fixed by #79519
Closed

[NativeAOT] ConcurrentDictionary is slower #68891

adamsitnik opened this issue May 5, 2022 · 7 comments · Fixed by #79519
Assignees
Labels
Milestone

Comments

@adamsitnik
Copy link
Member

Most of the ConcurrentDictionary micro benchmarks are few times slower compared to .NET.

Examples:

System.Collections.CreateAddAndClear.ConcurrentDictionary(Size: 512)

Result Base Diff Ratio Alloc Delta Operating System Bit Processor Name Modality
Slower 64614.59 254817.14 0.25 +16385 Windows 10 Arm64 Microsoft SQ1 3.0 GHz
Slower 67615.17 475852.15 0.14 +49154 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 38845.50 414977.53 0.09 +49153 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores bimodal
Slower 78981.98 438806.96 0.18 +49152 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 79174.15 241955.10 0.33 +16384 macOS Monterey 12.2.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

System.Collections.CtorFromCollection.ConcurrentDictionary(Size: 512)

Result Base Diff Ratio Alloc Delta Operating System Bit Processor Name Modality
Slower 84175.23 240287.50 0.35 +16385 Windows 10 Arm64 Microsoft SQ1 3.0 GHz
Slower 55556.09 222005.69 0.25 +24577 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz bimodal
Slower 50111.01 388393.75 0.13 +49153 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 62874.58 227398.68 0.28 +24576 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 69288.84 128176.10 0.54 +8192 macOS Monterey 12.2.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_ExpirationTokens

Result Base Diff Ratio Alloc Delta Operating System Bit Processor Name Modality
Slower 77274.94 222586.06 0.35 +11263 Windows 10 Arm64 Microsoft SQ1 3.0 GHz
Slower 40825.59 158493.50 0.26 +28294 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz several?
Slower 38473.82 166670.27 0.23 -2699 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 43664.37 180698.83 0.24 +29846 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz several?
Slower 70694.75 160708.25 0.44 +11849 macOS Monterey 12.2.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_AbsoluteExpiration

Result Base Diff Ratio Alloc Delta Operating System Bit Processor Name Modality
Slower 62336.17 141002.87 0.44 +6193 Windows 10 Arm64 Microsoft SQ1 3.0 GHz
Slower 38369.18 70234.58 0.55 -7994 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 30147.18 136249.59 0.22 +32694 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 44836.63 86477.55 0.52 -9434 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 58801.04 85427.09 0.69 +14189 macOS Monterey 12.2.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Repro:

git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f net7.0 --filter "System.Collections.CreateAddAndClear<Int32>.ConcurrentDictionary" --bdn-arguments "--keepFiles true --runtimes net7.0 nativeaot7.0 --ilCompilerVersion 7.0.0-preview.5.22254.9 --invocationCount 3488"
BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 10 (10.0.18363.2158/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-preview.5.22254.18
  [Host]     : .NET 7.0.0 (7.0.22.25401), X64 RyuJIT
  Job-ERDEPN : .NET 7.0.0 (7.0.22.25401), X64 RyuJIT                                                                                                                                                                                             
  Job-LHIWER : .NET 7.0.0-preview.5.22254.9, X64 NativeAOT  
Method Runtime Mean Ratio Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
ConcurrentDictionary .NET 7.0 71.56 us 1.00 16.0550 4.0138 - 124.44 KB 1.00
ConcurrentDictionary NativeAOT 7.0 468.09 us 6.51 21.4286 19.6429 1.7857 172.44 KB 1.39

I took a quick look at numbers reported by VTune and it seems that it might be caused by #67805, but I am not 100% sure so I am reporting a new issue (locking itself might just be slower).

NativeAOT

image

JIT

image

cc @jkotas @MichalStrehovsky

@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 5, 2022
@VSadov
Copy link
Member

VSadov commented May 10, 2022

Yes, this looks very much like a result of #67805

I also wonder why ConcurrentDictionary is tested with workstation GC (WKS::). Is that intentional?

@agocke agocke added this to the Future milestone May 10, 2022
@agocke agocke removed the untriaged New issue has not been triaged by the area owner label May 10, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label May 10, 2022
@VSadov VSadov self-assigned this May 29, 2022
@adamsitnik
Copy link
Member Author

also wonder why ConcurrentDictionary is tested with workstation GC (WKS::). Is that intentional?

By default BDN does not enforce any GC settings, so the defaults are used.

@VSadov
Copy link
Member

VSadov commented Jun 23, 2022

It would be more realistic to test ConcurrentDictionary with server GC, so that concurrency is not limited by singlethreaded GC.

Anyways, the most likely problem for this regression has been fixed for Windows as of #70769
Not sure how soon it will get to the perf lab. I hope we will see improvements.

Thanks for raising this issue!!!

@MichalStrehovsky
Copy link
Member

Anyways, the most likely problem for this regression has been fixed for Windows as of #70769

It might have helped a little, but the NativeAOT profile shows that we're spending a lot of time in infrastructure around Monitor.Enter/Exit (e.g. DeadEntryCollector.Finalize). Taking a lock on object is a bit more expensive on NativeAOT than in CoreCLR, especially if we then quickly throw away the objects we were locking on and make a new one. The trace seems to be dominated by the costs of Monitor.Entering on an object the first time, and discarding locking information that we have for objects that were collected.

The benchmark doesn't seem very real world in the sense that I wouldn't expect ConcurrentDictionaries to be used for such short periods of time that the cost of first/last using them dominates.

@adamsitnik
Copy link
Member Author

Anyways, the most likely problem for this regression has been fixed for Windows as of #70769

I've used the latest bits and it turned out that it's now slower?

The results I got for ILCompiler 7.0.0-preview.5.22254.9:

 -------------------- Histogram --------------------
 [290.537 us ; 300.306 us) | @
 [300.306 us ; 312.848 us) | @@@@
 [312.848 us ; 324.572 us) | @@@@@@@@
 [324.572 us ; 336.464 us) | @@@@@@
 [336.464 us ; 347.056 us) | @
 ---------------------------------------------------

 BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.739/21H2)
 AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
 .NET SDK=7.0.100-preview.7.22323.2
   [Host]     : .NET 7.0.0 (7.0.22.32108), X64 RyuJIT
   Job-ELUXMM : .NET 7.0.0 (7.0.22.32108), X64 RyuJIT
   Job-EEOWYD : .NET 7.0.0-preview.5.22254.9, X64 NativeAOT
Method Runtime Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
ConcurrentDictionary .NET 7.0 512 40.50 us 0.677 us 0.600 us 40.32 us 39.81 us 41.78 us 1.00 0.00 11.7546 2.8670 - 98.27 KB 1.00
ConcurrentDictionary NativeAOT 7.0 512 319.61 us 10.527 us 12.123 us 318.31 us 296.40 us 341.19 us 7.85 0.36 17.7752 17.4885 3.1537 146.27 KB 1.49

Latest bits:

 -------------------- Histogram --------------------
 [317.286 us ; 387.411 us) | @@@
 [387.411 us ; 441.485 us) |
 [441.485 us ; 501.692 us) | @@
 [501.692 us ; 571.816 us) | @@@@@@@@@@@@@@@
 ---------------------------------------------------

 BenchmarkDotNet=v0.13.1.1799-nightly, OS=Windows 11 (10.0.22000.739/21H2)
 AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
 .NET SDK=7.0.100-preview.7.22323.19
   [Host]     : .NET 7.0.0 (7.0.22.32207), X64 RyuJIT
   Job-SXLBJF : .NET 7.0.0 (7.0.22.32207), X64 RyuJIT
   Job-NTQTOB : .NET 7.0.0-preview.6.22323.6, X64 NativeAOT
Method Runtime Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
ConcurrentDictionary .NET 7.0 512 41.43 us 0.799 us 0.708 us 41.28 us 40.69 us 43.24 us 1.00 0.00 11.7546 2.8670 - 98.27 KB 1.00
ConcurrentDictionary NativeAOT 7.0 512 507.14 us 62.967 us 72.513 us 530.69 us 338.30 us 569.98 us 11.98 2.03 18.0619 16.6284 3.4404 146.27 KB 1.49

@adamsitnik
Copy link
Member Author

Taking a lock on object is a bit more expensive on NativeAOT than in CoreCLR

This would expect slowness that I have observed in other, simpler benchmarks.

@VSadov
Copy link
Member

VSadov commented Jun 24, 2022

The new round of tests in on a machine with 2X cores compared to original. Maybe that made the situation with locks a bit worse.

Also while GC suspension in NativeAOT on Windows should be on parity with CoreCLR functionally (i.e. unlikely to pause/hang), it misses some optimizations. I wonder if that is important. We plan to implement that, but reliability is the first priority.

The scenario does look like impacted by locks a lot. ConcurrentDictionary uses locks internally and may dynamically add locks as needed.

I looked at NativeAOT implementation of object locks and it is relatively heavy on lock creation path. I have some ideas how that could be improved, but not sure how much that would help.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 12, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 15, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants