[NativeAOT] ConcurrentDictionary is slower #68891

adamsitnik · 2022-05-05T08:11:39Z

Most of the ConcurrentDictionary micro benchmarks are few times slower compared to .NET.

Examples:

System.Collections.CreateAddAndClear.ConcurrentDictionary(Size: 512)

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name	Modality
Slower	64614.59	254817.14	0.25	+16385	Windows 10	Arm64	Microsoft SQ1 3.0 GHz
Slower	67615.17	475852.15	0.14	+49154	Windows 10	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	38845.50	414977.53	0.09	+49153	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores	bimodal
Slower	78981.98	438806.96	0.18	+49152	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	79174.15	241955.10	0.33	+16384	macOS Monterey 12.2.1	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)

System.Collections.CtorFromCollection.ConcurrentDictionary(Size: 512)

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name	Modality
Slower	84175.23	240287.50	0.35	+16385	Windows 10	Arm64	Microsoft SQ1 3.0 GHz
Slower	55556.09	222005.69	0.25	+24577	Windows 10	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	bimodal
Slower	50111.01	388393.75	0.13	+49153	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	62874.58	227398.68	0.28	+24576	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	69288.84	128176.10	0.54	+8192	macOS Monterey 12.2.1	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_ExpirationTokens

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name	Modality
Slower	77274.94	222586.06	0.35	+11263	Windows 10	Arm64	Microsoft SQ1 3.0 GHz
Slower	40825.59	158493.50	0.26	+28294	Windows 10	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	several?
Slower	38473.82	166670.27	0.23	-2699	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	43664.37	180698.83	0.24	+29846	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	several?
Slower	70694.75	160708.25	0.44	+11849	macOS Monterey 12.2.1	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_AbsoluteExpiration

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name
Slower	62336.17	141002.87	0.44	+6193	Windows 10	Arm64	Microsoft SQ1 3.0 GHz
Slower	38369.18	70234.58	0.55	-7994	Windows 10	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	30147.18	136249.59	0.22	+32694	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	44836.63	86477.55	0.52	-9434	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	58801.04	85427.09	0.69	+14189	macOS Monterey 12.2.1	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Repro:

git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f net7.0 --filter "System.Collections.CreateAddAndClear<Int32>.ConcurrentDictionary" --bdn-arguments "--keepFiles true --runtimes net7.0 nativeaot7.0 --ilCompilerVersion 7.0.0-preview.5.22254.9 --invocationCount 3488"

BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 10 (10.0.18363.2158/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-preview.5.22254.18
  [Host]     : .NET 7.0.0 (7.0.22.25401), X64 RyuJIT
  Job-ERDEPN : .NET 7.0.0 (7.0.22.25401), X64 RyuJIT                                                                                                                                                                                             
  Job-LHIWER : .NET 7.0.0-preview.5.22254.9, X64 NativeAOT

Method	Runtime	Mean	Ratio	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
ConcurrentDictionary	.NET 7.0	71.56 us	1.00	16.0550	4.0138	-	124.44 KB	1.00
ConcurrentDictionary	NativeAOT 7.0	468.09 us	6.51	21.4286	19.6429	1.7857	172.44 KB	1.39

I took a quick look at numbers reported by VTune and it seems that it might be caused by #67805, but I am not 100% sure so I am reporting a new issue (locking itself might just be slower).

NativeAOT

JIT

cc @jkotas @MichalStrehovsky

The text was updated successfully, but these errors were encountered:

VSadov · 2022-05-10T16:57:31Z

Yes, this looks very much like a result of #67805

I also wonder why ConcurrentDictionary is tested with workstation GC (WKS::). Is that intentional?

adamsitnik · 2022-06-23T07:22:12Z

also wonder why ConcurrentDictionary is tested with workstation GC (WKS::). Is that intentional?

By default BDN does not enforce any GC settings, so the defaults are used.

VSadov · 2022-06-23T09:25:06Z

It would be more realistic to test ConcurrentDictionary with server GC, so that concurrency is not limited by singlethreaded GC.

Anyways, the most likely problem for this regression has been fixed for Windows as of #70769
Not sure how soon it will get to the perf lab. I hope we will see improvements.

Thanks for raising this issue!!!

MichalStrehovsky · 2022-06-23T23:41:29Z

Anyways, the most likely problem for this regression has been fixed for Windows as of #70769

It might have helped a little, but the NativeAOT profile shows that we're spending a lot of time in infrastructure around Monitor.Enter/Exit (e.g. DeadEntryCollector.Finalize). Taking a lock on object is a bit more expensive on NativeAOT than in CoreCLR, especially if we then quickly throw away the objects we were locking on and make a new one. The trace seems to be dominated by the costs of Monitor.Entering on an object the first time, and discarding locking information that we have for objects that were collected.

The benchmark doesn't seem very real world in the sense that I wouldn't expect ConcurrentDictionaries to be used for such short periods of time that the cost of first/last using them dominates.

adamsitnik · 2022-06-24T14:28:59Z

Anyways, the most likely problem for this regression has been fixed for Windows as of #70769

I've used the latest bits and it turned out that it's now slower?

The results I got for ILCompiler 7.0.0-preview.5.22254.9:

 -------------------- Histogram --------------------
 [290.537 us ; 300.306 us) | @
 [300.306 us ; 312.848 us) | @@@@
 [312.848 us ; 324.572 us) | @@@@@@@@
 [324.572 us ; 336.464 us) | @@@@@@
 [336.464 us ; 347.056 us) | @
 ---------------------------------------------------

 BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.739/21H2)
 AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
 .NET SDK=7.0.100-preview.7.22323.2
   [Host]     : .NET 7.0.0 (7.0.22.32108), X64 RyuJIT
   Job-ELUXMM : .NET 7.0.0 (7.0.22.32108), X64 RyuJIT
   Job-EEOWYD : .NET 7.0.0-preview.5.22254.9, X64 NativeAOT

Method	Runtime	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
ConcurrentDictionary	.NET 7.0	512	40.50 us	0.677 us	0.600 us	40.32 us	39.81 us	41.78 us	1.00	0.00	11.7546	2.8670	-	98.27 KB	1.00
ConcurrentDictionary	NativeAOT 7.0	512	319.61 us	10.527 us	12.123 us	318.31 us	296.40 us	341.19 us	7.85	0.36	17.7752	17.4885	3.1537	146.27 KB	1.49

Latest bits:

 -------------------- Histogram --------------------
 [317.286 us ; 387.411 us) | @@@
 [387.411 us ; 441.485 us) |
 [441.485 us ; 501.692 us) | @@
 [501.692 us ; 571.816 us) | @@@@@@@@@@@@@@@
 ---------------------------------------------------

 BenchmarkDotNet=v0.13.1.1799-nightly, OS=Windows 11 (10.0.22000.739/21H2)
 AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
 .NET SDK=7.0.100-preview.7.22323.19
   [Host]     : .NET 7.0.0 (7.0.22.32207), X64 RyuJIT
   Job-SXLBJF : .NET 7.0.0 (7.0.22.32207), X64 RyuJIT
   Job-NTQTOB : .NET 7.0.0-preview.6.22323.6, X64 NativeAOT

Method	Runtime	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
ConcurrentDictionary	.NET 7.0	512	41.43 us	0.799 us	0.708 us	41.28 us	40.69 us	43.24 us	1.00	0.00	11.7546	2.8670	-	98.27 KB	1.00
ConcurrentDictionary	NativeAOT 7.0	512	507.14 us	62.967 us	72.513 us	530.69 us	338.30 us	569.98 us	11.98	2.03	18.0619	16.6284	3.4404	146.27 KB	1.49

adamsitnik · 2022-06-24T14:30:56Z

Taking a lock on object is a bit more expensive on NativeAOT than in CoreCLR

This would expect slowness that I have observed in other, simpler benchmarks.

VSadov · 2022-06-24T16:00:33Z

The new round of tests in on a machine with 2X cores compared to original. Maybe that made the situation with locks a bit worse.

Also while GC suspension in NativeAOT on Windows should be on parity with CoreCLR functionally (i.e. unlikely to pause/hang), it misses some optimizations. I wonder if that is important. We plan to implement that, but reliability is the first priority.

The scenario does look like impacted by locks a lot. ConcurrentDictionary uses locks internally and may dynamically add locks as needed.

I looked at NativeAOT implementation of object locks and it is relatively heavy on lock creation path. I have some ideas how that could be improved, but not sure how much that would help.

adamsitnik added tenet-performance Performance related issue area-NativeAOT-coreclr labels May 5, 2022

ghost added the untriaged New issue has not been triaged by the area owner label May 5, 2022

agocke added this to the Future milestone May 10, 2022

agocke removed the untriaged New issue has not been triaged by the area owner label May 10, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label May 10, 2022

VSadov self-assigned this May 29, 2022

VSadov mentioned this issue Sep 6, 2022

[NativeAOT] Implement thin locks #75108

Closed

VSadov mentioned this issue Dec 12, 2022

[NativeAOT] Thin locks #79519

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 12, 2022

VSadov closed this as completed in #79519 Dec 15, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 15, 2022

ghost locked as resolved and limited conversation to collaborators Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NativeAOT] ConcurrentDictionary is slower #68891

[NativeAOT] ConcurrentDictionary is slower #68891

adamsitnik commented May 5, 2022

System.Collections.CreateAddAndClear.ConcurrentDictionary(Size: 512)

System.Collections.CtorFromCollection.ConcurrentDictionary(Size: 512)

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_ExpirationTokens

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_AbsoluteExpiration

VSadov commented May 10, 2022

Uh oh!

adamsitnik commented Jun 23, 2022

Uh oh!

VSadov commented Jun 23, 2022

Uh oh!

MichalStrehovsky commented Jun 23, 2022

Uh oh!

adamsitnik commented Jun 24, 2022

Uh oh!

adamsitnik commented Jun 24, 2022

Uh oh!

VSadov commented Jun 24, 2022 •

edited

Loading

Uh oh!

[NativeAOT] ConcurrentDictionary is slower #68891

[NativeAOT] ConcurrentDictionary is slower #68891

Comments

adamsitnik commented May 5, 2022

System.Collections.CreateAddAndClear.ConcurrentDictionary(Size: 512)

System.Collections.CtorFromCollection.ConcurrentDictionary(Size: 512)

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_ExpirationTokens

Microsoft.Extensions.Caching.Memory.Tests.MemoryCacheTests.AddThenRemove_AbsoluteExpiration

NativeAOT

JIT

VSadov commented May 10, 2022

Uh oh!

adamsitnik commented Jun 23, 2022

Uh oh!

VSadov commented Jun 23, 2022

Uh oh!

MichalStrehovsky commented Jun 23, 2022

Uh oh!

adamsitnik commented Jun 24, 2022

Uh oh!

adamsitnik commented Jun 24, 2022

Uh oh!

VSadov commented Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Jun 24, 2022 •

edited

Loading