-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Radically simplify scheduler and implement new scheduling strategy for short running actions #751
base: master
Are you sure you want to change the base?
Conversation
The hlint suggestion
won't actually compile because of mainline GHC's lackluster support for impredicativity and the rank-2 types involved. |
estimates to guide scheduling of short actions on a single core.
Many thanks for the patch, and sorry it's taken so long to get around to a discussion (family all caught COVID, and been a slow recovery). I love the idea of improving the scheduler, and I appreciate the work you have put in. Having read the code, a few questions on the conceptual stuff.| Performance (After reading, this section looks like a wall of questions designed to be off-putting - but that's really not the intention! I'm keen to understand more, so please share what you already know, I don't expect answers to everything - just trying to find a thread to follow.) You seem to suggest most of the performance benefits come from using forkOn for small things? Is that accurate? If so, why? Is the cause that GHC stops doing something? What does it stop doing? Are there any other ways to stop it doing that? Does reusing already forked jobs help? Does only using forkOn help? Can we generalise these benefits for longer-running tasks, to regain uniformity? What is the percentage of short-lived tasks? Do we know why there are so many? What if a thing estimated to be quick takes a long time - do we end up stalling? What if the user has already pinned something long-running to capability 0? Is it possible to come up with a benchmark that demonstrates this policy has a significant advantage? What are the currently understood performance benefits in ghcide and Hadrian? Last reported numbers on the thread are 0.2s before vs 0.02s after. That's amazing. Is it still that, or has it got better/worse? Is the Hadrian difference measurable, or in the realm of noise? Complexity Simple code would be great, and since the scheduler was written, Haskell has improved and the job the scheduler does has changed radically. A few things I'm worried we might have lost:
|
Given the benchmark:
Run at I'm going to look into optimisations for avoiding spawning threads, as they should be fairly localised, preserve all the benefits, not require estimation, and might get to < 0.4s. With that done, we can revisit the things in this chunk. |
I think a simple way to keep the original scheduler and retain the performance improvements of this patch would be to just use |
What makes you believe the issue is all due to GHC moving between threads? In some microbenchmarks I've written, |
How do we find an unused capability? How do we be sure no one else is going to queue up for that capability? If we're wrong about our determination something is short what are the consequences? |
I benchmarked just a simple loop spawning threads. I got weird behaviour, so I shared it on StackOverflow: https://stackoverflow.com/questions/61971292/ghc-forkio-bimodal-performance Interestingly, if I replace that benchmark with forkOn performance crashes, taking vastly longer, and most scarily, quadratically long. I raised a bug at https://gitlab.haskell.org/ghc/ghc/issues/18221. I'd love to improve the performance of the pool, but am super worried about running into O(n^2) behaviour and taking out the whole IDE if things go wrong! |
Following the remarks added, and the information on the StackOverflow thread:
So it seems there's definitely something to be done, but I'm not sure what, so will wait for more information to appear on the two threads above. |
In case it's relevant, the source code to atomicModifyIORef is at https://gitlab.haskell.org/ghc/ghc/-/blob/8916e64e5437c99b82d5610286430328af1d86bc/rts/PrimOps.cmm#L634-702. |
More discussion at https://gitlab.haskell.org/ghc/ghc/issues/18224 |
I wrote up my investigations in a blog post, which might be easier to follow (nothing not already in the thread above) https://neilmitchell.blogspot.com/2020/05/fixing-space-leaks-in-ghcide.html. |
The goal of this patch was to reduce the overhead of Shake tasks in multicore systems due to context switches induced by forking trivial tasks (e.g. map lookups) off to a new thread. The core of the proposal was to use At the risk of sounding naive, an alternative would be to not fork at all, and run the trivial computation on the spot if all the dependencies are available. Is something like that possible? |
I tried a naive approach to not forking short running actions and it quickly lead to deadlocks. |
@pepeiborra - a sensible thing to try. I did so in https://github.com/digital-asset/ghcide/issues/503#issuecomment-633642426 - it gave a 25% speedup. Nice, but not amazing, and if you hypothesise wrong about what will be cheap, you loose potentially huge amounts of parallelism. Given the fact that forkIO seems to have a recognised GHC bug (it can often switch out immediately after it starts running, which is not a good idea) I think that is the best direction to push. However, I also intend to revisit benchmarking Ghcide and Shake at some point in the future. There were costs going to things like HashMap, which I probably reduced significantly by reducing collisions, so the Shake costs are likely to creep up over time as everything else gets optimised. |
I have been benchmarking ghcide recently and found the Shake overhead to be around 30% of the cost of a I agree that improving Alternatively, this could be exposed to the user. I would like to tag my rules as "cheap", "pure", etc, so that Shake has more info on how to schedule them |
If it was a pure win, I'd definitely commit that. But it's 25% faster for small values runs, and potentially 200% slower if the dynamic nature of how long certain operations take. Tagging to Shake that certain rules are cheap is one way to go, but it's both bothersome and potentially wrong. Can you share the benchmark? Perhaps as a separate ticket. And if you have it, the profile output? Things have shifted a lot, so it might even be the case that the scheduler is no longer the bottleneck. However, do be aware that the GHC profiler tends to over-attribute things to Shake because of the Monad instance it uses. |
Where |
My guess/understanding is that the big
|
I ran with 100 samples, which give a startup of 15s, a runtime of 29-35s, and 337M resident. I built for profiling by inlining Shake into Ghcide with:
That shows all the pieces of Shake broken down. If I naively add up all the Development.Shake.* bits, that gives me 19.2%. However, we know that the monad interpretation under profiling is very expensive, but under normal execution is relatively cheap - that accounts for 6.7%, leaving about 12.5% to Shake (plus there will be some cost that comes from the monad). Given how profiling works, it sometimes misses the cost of spawning threads. I see 2302 threads per test (measured by profiling 10 samples, then profiling 100, and dividing the excess threads by 90). That might take as high as 0.1s, based on benchmarks. However, from comments from SPJ, it seems like that's a bit of a bug, and we might be able to radically reduce that in future GHC's. I don't see any low hanging fruit, unfortunately. |
As noted by @mpickering and @pepeiborra in #503, the default behavior of the shake scheduler incurs a large overhead on ghcide response times on multicore systems.
On @mpickering's advice, first I tried to remove all the custom thread scheduling code in shake and delegated to ghc's native scheduling system. This didn't yield any consistent benefits, but also did not incur any noticeable penalty, both when using ghcide or when compiling GHC using hadrian.
We then noticed that shake stores the previous run time for actions, and decided to use this as a guide to scheduling short running action in such a way as to avoid the overhead of context switching and concurrency.
This fork now uses
forkOn
to schedule actions that we think will be short(< 0.05s currently) on the zeroth capability. This brings hover response times in ghcide running with multiple threads in line with those observed when running ghcide on a single thread.TODO:
PoolPriority
sSee also https://github.com/digital-asset/ghcide/issues/503#issuecomment-605666014