Workers being limited by RAM bandwidth #2077

Viren6 · 2024-06-16T23:57:01Z

The script from https://github.com/official-monty/montytools/blob/main/BenchNormalization/benchNormToolSF.py was ran with SF16.1 on 2 Ryzen 9 7950X (Eco mode off) systems. Both systems had DDR5 6000Mhz RAM but one had only 1x16GB (Ciekce) whereas the other had 2x16GB (Zuppa). The results (Final Average Benchmark NPS over a minute) are below:

Ciekce (1x16GB):
1 Process: 1869666
32 Processes: 332322

Zuppa (2x16GB):
1 Process: 1814631
32 Processes: 511087

The system with double the bandwidth had 54% higher nps when running 32 processes.

The script from https://github.com/official-monty/montytools/blob/main/BenchNormalization/benchNormToolMonty.py was ran with the monty chess engine on the same systems providing a reference for without RAM bandwidth limitation:

Ciekce (1x16GB):
1 Process: 685668
32 Processes: 365731

Zuppa (2x16GB):
1 Process: 688477
32 Processes: 370616

The same ratio is 1M nps for SF16.1 with 32 processes. It is therefore likely Zuppa system running SF is still limited by RAM bandwidth even though it represents the highest possible on consumer motherboards (dual channel).

Furthermore, the net in dev is 20% larger and this issue is also expected to become even more severe in the future as CPU speeds advance faster than RAM bandwidth.

The only solution I see is to raise the default threads for each test from 1 to 2. This may require the fastchess migration to prevent time losses as TC is scaled down. The solution of reducing processes equal to number of physical cores is still RAM bandwidth limited now and reducing processes further results in bad CPU utilization.

Additionally, the method in which we measure the nps of a worker is invalid. Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.

vondele · 2024-06-17T05:08:56Z

To verify that the "solution" works, one would also need to have the result of running 16process@2threads (ideally also 8@4 and 4@8 and ...) and see the nps. Would be great if you could collect some data for that.

On the other hand, we need to think once if there is no way to reduce SF memory BW needs.

Viren6 · 2024-06-17T07:14:47Z

Results of benches (Final Average Benchmark NPS over a minute) with 64MB hash and depth 16 on Ciekce system (1x16GB):

Concurrency: 1
Threads:     1
NPS:         1816044
NPS/Thread:  1816044
Concurrency: 32
Threads:     1
NPS: 	     316559
NPS/Thread:  316559
Concurrency: 16
Threads:     2
NPS:         893445
NPS/Thread:  446723
Concurrency: 8
Threads:     4
NPS:         2559356
NPS/Thread:  639839
Concurrency: 4
Threads:     8
NPS:         7272827
NPS/Thread:  909103
Concurrency: 2
Threads:     16
NPS:         17664663
NPS/Thread:  1104040
Concurrency: 1
Threads:     32
NPS:         40885107
NPS/Thread:  1277660

Sharing does help a lot. Though the curve isn't steep enough that changing threads can solve it, will need to find some other way to achieve it..

Viren6 mentioned this issue Jun 19, 2024

Fix NPS measurement for TC scaling #2081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers being limited by RAM bandwidth #2077

Workers being limited by RAM bandwidth #2077

Viren6 commented Jun 16, 2024

vondele commented Jun 17, 2024

Viren6 commented Jun 17, 2024 •

edited

Loading

Workers being limited by RAM bandwidth #2077

Workers being limited by RAM bandwidth #2077

Comments

Viren6 commented Jun 16, 2024

vondele commented Jun 17, 2024

Viren6 commented Jun 17, 2024 • edited Loading

Viren6 commented Jun 17, 2024 •

edited

Loading