-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
MultithreadedExecutor bottlenecking at 1000+ Systems #11378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Using a headless application is where this can become really obvious fwiw, I noticed the same issues as soon as schedule v3 was merged. see: https://discord.com/channels/691052431525675048/692572690833473578/1115422818012762274 |
Here are FPS comparisons between Bevy versions 0.9.1 and 0.12.1. I was told in the Discord I should test with LTO enabled, but cannot test 0.12.1 with LTO enabled due to an issue.
500 groups = 1500 FPS is 60+ FPS for all 3. Interestingly, LTO enabling reduces FPS for 0.9.1. Not sure why. |
I'm very curious what use case you have where you need that many systems, but this makes plenty of sense given that the executor cannot schedule systems fast enough if they're all terminate quickly. There are options like #8304 that has been thrown around, but I'm pretty sure that the contention introduced by it will be on par if not worse than what we see here. |
He was running a reinforcement learning simulation and used const generic systems as group markers. His use case would be solved by bevyengine/rfcs#16 |
#12990 should reduce the overhead by a large amount. Could you test out that PR and see if it works out for you? |
With that said, I just opened the provided trace and noticed that the bottleneck may actually be running run conditions, which are all run inline in the multithreaded executor, and the costs of making new spans for them while profiling. In this particular case where the cost of running a system and the run condition are very small, it may actually be better just to embed an early return in the system than to add a run condition. |
I tested removing the I also made a fork, upgraded to latest bevy and tried to use #12990 but I could remove |
Bevy version
0.12.1
Relevant system information
What you did
Hello! I have a use-case that essentially involves separating identical groups of entities. Since Bevy's subworld support is not complete and Bevy does not have shared components (like Unity DOTS), I opted for a solution where I use Rust generics to "duplicate" my systems for every group with a
SpareSet
marker component. SoMarker::<0>, Marker::<1>, ...
components andSystemA::<0>, SystemA::<1>, ...
systems. The idea was the separate systems/marker components will allow Bevy to properly parallelize logic across groups since there are no cross-group dependencies.What went wrong
It seems Bevy is bottlenecked by the number of systems for my use-case. Attempting 6000 systems (2000 groups, 3 systems/group) results in 7% CPU utilization with 12 FPS. A Tracy capture indicates that 80+% of the CPU time is spent in the
multithreaded executor
before sending tasks to my thread pool.I have created a Github with the capture and code https://github.com/UsaidPro/BevyLotsOfSystems
I was hoping Bevy would distribute the systems across the full thread pool provided by my 32-core CPU. However, instead what happens is 1 core gets consumed by the
multithreaded executor
which does distribute the tasks across all threads (I see 55+ thread pools in Tracy) but only after taking ~60+ms (80+% of compute time). The multithreaded executor has MTPC of 470us, but it is called 17k times compared to 129 Update calls resulting in 83% of time spent in the single thread.Here is a table of what systems vs FPS. All these used only 7% of my CPU, same bottleneck. I have 3 systems, 1 of them only runs if
run_if()
returned true.Additional information
Tracy screenshot:

The text was updated successfully, but these errors were encountered: