Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal multithreading performance on windows. #81

Open
balintlaczko opened this issue Apr 27, 2021 · 11 comments
Open

Sub-optimal multithreading performance on windows. #81

balintlaczko opened this issue Apr 27, 2021 · 11 comments

Comments

@balintlaczko
Copy link

Again, thanks for the great package! I am still walking through the tutos, and I thought I just report this one too. In the multithreading tutorial:
image
when I switch on multithreading here:
image
I get almost no change in the "Median CPU" meter, (it is consistently +3-4 with multithreading).
If I am using WASAPI audio drivers, the CPU level of Max.exe stays relatively the same in the Task Manager. However if I am on ASIO, the CPU level of Max.exe jumps to 4-5x the level (in my case from around 4% to around 20%) when I turn on multithreading (the "Median CPU stays more-or-less the same again, with +3-4 percents as with WASAPI). This surge seems independent from I/O or sigvs size.
Audio still comes out unchanged from the patch with or without multithreading. (And no crash or error message.)

@AlexHarker
Copy link
Owner

Thanks - depending on the scenario you may or may not see a CPU benefit to multithreading - however, what you are describing doesn't sound ideal. The threading primitives on windows are a bit different to Mac, as well as the thread priority settings and it may be that these can be tweaked to improve the situation. I will aim to take a look when I can.

@balintlaczko
Copy link
Author

Thanks a lot! I suspected that this might be an issue, I remember I also had problems with a beta build of mubu a while ago when I tried multithreading on it, it worked as expected on Mac, and drove Max to a complete halt on Windows.

@AlexHarker
Copy link
Owner

AlexHarker commented Apr 28, 2021

Can you increase the value of the length of the ramp from 1024 to 8192 and check again? The wins are always likely to be better when the computer is working harder for longer periods, so that may give different results and some info on whether you ever get an improvement.

For reference my results are:

On Mac
Default settings I get 24% and 15% (multithreading off and on)
For 8192 I get 100+% and 35% (multithreading off and on)

Windows (on Mac hardware)
Default settings I get 23% and 23% (multithreading off and on)
For 8192 I get 100+% and 60% (multithreading off and on)

So - this suggest that the threading overheads are higher on windows, which I'd be keen to reduce if possible, but the basic functionality seems to work...

@AlexHarker
Copy link
Owner

Also - the idea that the CPU measured in task manager would increase is not so unexpected, as if CPU is being measured across cores the usage would increase, but Max measures CPU in terms of time only, so more cores doing the same work by the same time will look the same. I suspect there will be a limit to the extent of the reduction in threading overheads possible that might not match the Mac implementation, but if I can do better I will.

@balintlaczko
Copy link
Author

balintlaczko commented Apr 28, 2021

Aha! It woooorks! Tested only on ASIO at the moment, but my results with 8192 samples ramp (..and with 1024 i/o and sigvs if that matters):

CPU in Max
No MT: constant 100% | MT: 27-28%

CPU in Task Mgr:
No MT: around 16% (which on the 6-core machine means 100% in "Mac terms") | MT: 27-28%

I also noticed (following the fan noise ramps) that with no MT one of the cores is always near 100 degrees (and the load hops from core to core), and of course fans ramp up desperately. While with MT all cores stay firmly at around 60 degrees, and the fan calms down too.

So it seems like my original report was a false alarm, everything seems to work as intended, it's just the OS difference.

It is also interesting that with MT the Max CPU meter and the Max.exe in the Task Manager lined up (coincidence?).

Thanks a lot for the help!

@AlexHarker
Copy link
Owner

I'd still like to improve things further if I can, as on the same hardware here the speedup is not as good, but glad to hear that it is at least working...

@AlexHarker AlexHarker changed the title Multithreading does not seem to work as intended on Windows Sub-optimal multithreading performance on windows. Mar 16, 2022
@balintlaczko
Copy link
Author

Hey there! Just testing the multithreading performance again on Windows with the prerelease.

On Windows 10, I get median CPU of 11-12 with multithreading OFF, and 69-70 (plus audible crackle) with multithreading ON. This is with WASAPI drivers, and NOT in exclusive mode (which is totally good almost always). If I use a (still WASAPI-based) ASIO driver IN exclusive mode, then the crackle goes away, but the huge difference in Median CPU remains. The other weird thing is that if I look at CPU in the Task Manager, with multithreading OFF I get around 0.7-1.0% CPU, which consistently drops(!?) to 0.1-0.8% (yes, more variance), mostly hovering around 0.3. I/O and signal vectors both at 128. What sense does this make?

...and then some more tests:

All tests are made in WASAPI-based ASIO (FlexAsio) in exclusive mode, 128 samples I/O and signal vector sizes, 44100Hz.

Params: streams=100, interval=512, length=1024
ST: "Median CPU"=12, Tskmgr=0.6-1%
MT: "Median CPU"=61, Tskmgr=0.1-0.7%

Params: streams=100, interval=100, length=1024
ST: "Median CPU"=47, Tskmgr=3.0-3.6%
MT: "Median CPU"=100, Tskmgr=0.1-0.9%, unusable, constant dropouts

Params: streams=100, interval=512, length=10000
ST: "Median CPU"=79, Tskmgr=5.6-5.9%
MT: "Median CPU"=45-49, Tskmgr=6.3-7.5%

Params: streams=1000, interval=512, length=1024
ST: "Median CPU"=100, Tskmgr=6.9-7.4%, unusable, constant dropouts
MT: "Median CPU"=88-95, Tskmgr=18.6-20.6%

Params: streams=100, interval=100, length=10
ST: "Median CPU"=10, Tskmgr=0.5-0.7%
MT: "Median CPU"=100, Tskmgr=0.1-1.0%, unusable, constant dropouts

@AlexHarker
Copy link
Owner

Thanks

For my Mac running windows I now get:

Default settings I get 17% and 17% (multithreading off and on)
For 8192 I get 100+% and 45% (multithreading off and on)

So it looks like potentially the multithreading fixes might make things a bit worse on windows. I will try to attempt some improvement here if I can.

@AlexHarker
Copy link
Owner

I've tried a few things, but none of them have significantly improved the situation. I'm keeping notes here for future reference.

Things tried:

  • Replace semaphore direct Windows call with C++20 std::counting_semaphore
  • Grouping changes to atomic counters, rather than doing them as individually as increments/decrements in a loop
  • Use std::memory_order_relaxed on all counter operations (with the assumption that fences might be needed later)
  • Removed memory fences temporarily to assess impact on performance

Sadly, given that none of this has worked, at the moment there are no obvious routes to improvement that doesn't involve significantly rethinking the multithreading approach for windows, and that is not guaranteed to end up with a performance win.

@AlexHarker
Copy link
Owner

I've just tried one last thing which is in this build:

https://drive.google.com/file/d/1juxiO7XXsnkZFVSW3KGruRxuEvVG0IFf/view?usp=sharing

@balintlaczko - could you try this build at your end and report on the scenarios you outlined above?

@AlexHarker
Copy link
Owner

AlexHarker commented Apr 17, 2022

[Edited after more investigation]

Updates. With vcredist updated on the i9 it would seem that results are much improved (but still below the Mac side speedups). The build above is also improved calling into question the use of thread sleeping wherever it appears in framelib.

Observations / things to note for now. I aim to fix as much as I can before release and return to this over time:

  • The main slowdown is for the sleep on the main (calling) audio thread, which was designed to reduce contention when the stack was empty - this can be omitted or replaced with a busy wait (potentially hand-rolled)
  • The other use of sleep is in the lock
  • The behaviour of short sleeps on Mac is unclear (relies on nanosleep), but it does seem to solve contention issues.
  • Using yield() seemingly produces interactions with lower priority processing (despite this seeming to be against the documentation)
  • atomic wait from C++20 might be a good way to go eventually, but the project is C++11 and these methods are unavailable within clang on my machine, so I can't test and the underlying calls are also not supported, so not worth pursuing for now.
  • It is probably/possibly worth treating the two scenarios of sleeping differently - one (in the processing queue) is simply to reduce contention and can be a busy wait - in the lock we must yield to lower priority threads because that could be what we are waiting for and that (afaik) means sleeping the thread if we are unsuccessful at a certain point (and then waking).

At some point a good goal would still be to reduce the use of locks, particularly in relation to the memory allocator, although at present a fully lock free memory allocator is probably out of scope for quite some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants