Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async runtime with graph-based dependency cycle checks #844

Merged
merged 5 commits into from
May 30, 2021
Merged

Async runtime with graph-based dependency cycle checks #844

merged 5 commits into from
May 30, 2021

Conversation

LaurentRDC
Copy link
Contributor

@LaurentRDC LaurentRDC commented Apr 22, 2021

Hello,

This is a PR to make hakyll's runtime concurrent.

The general idea is based on #725. That particular implementation incorrectly identified a dependency cycle in my personal website that did not exist. Hence, this PR includes more robust dependency cycle checking based on Data.Graph. A test case has been added to ensure this works.

The concurrency is implemented as follows.

  1. Identify all items that need to be compiled;
  2. Compile all items concurrently. Items that compiled successfully are removed from the todo list, while those with dependencies are kept;
  3. The remaining todo and their dependencies are checked for cycles in a single thread;
  4. Go to step 1 until all items have been compiled.
pickAndChase :: Runtime ()
pickAndChase = do
    todo <- fmap runtimeTodo . liftIO . readTVarIO =<< get
    unless (null todo) $ do
        checkForDependencyCycle
        forConcurrently_ (M.keys todo) chase
        pickAndChase

The advantage of this implementation is that it is easy to swap between concurrent/serial by simply swapping forConcurrently_ for forM_. Maybe we can offer this option as a compile flag?

I have tested this version of hakyll on my personal page and I'm very satisfied with the results.

closes #714

@LaurentRDC
Copy link
Contributor Author

LaurentRDC commented Apr 23, 2021

Here is the benchmark on a site with quite a few JPGs to compress with hakyll-images:

Pre-PR 6148 ms
This PR +RTS -N1 -RTS 6179 ms
This PR +RTS -N4 -RTS 4366 ms

Copy link
Collaborator

@Minoru Minoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know next to nothing about async, so this PR will have to wait until @jaspervdj has time to review it. I left a few stylistic comments though :)

lib/Hakyll/Core/Runtime.hs Outdated Show resolved Hide resolved
lib/Hakyll/Core/Runtime.hs Outdated Show resolved Hide resolved
@Minoru
Copy link
Collaborator

Minoru commented Apr 23, 2021

This shows noticeable improvements on my own site:

$ hyperfine \
    --prepare './debiania-old clean' './debiania-old build' \
    --prepare './debiania-new clean' './debiania-new build +RTS -N4'

Benchmark #1: ./debiania-old build
  Time (mean ± σ):     15.345 s ±  0.153 s    [User: 40.101 s, System: 15.224 s]
  Range (min … max):   15.168 s … 15.628 s    10 runs

Benchmark #2: ./debiania-new build +RTS -N4
  Time (mean ± σ):     10.556 s ±  0.219 s    [User: 22.514 s, System: 4.638 s]
  Range (min … max):   10.096 s … 10.848 s    10 runs

There is some scalability weirdness though. I have Intel Core i7 7700HQ, which is a mobile quad-core with HyperThreading. Theoretically, 8-thread workload should slightly outperform 4-thread, but it's not the case:

$ hyperfine \
    --prepare './debiania-new clean' './debiania-new build +RTS -N4' \
    --prepare './debiania-new clean' './debiania-new build +RTS -N8' \
    --prepare './debiania-new clean' './debiania-new build +RTS -N'

Benchmark #1: ./debiania-new build +RTS -N4
  Time (mean ± σ):     10.556 s ±  0.219 s    [User: 22.514 s, System: 4.638 s]
  Range (min … max):   10.096 s … 10.848 s    10 runs

Benchmark #2: ./debiania-new build +RTS -N8
  Time (mean ± σ):     12.546 s ±  0.224 s    [User: 48.112 s, System: 12.646 s]
  Range (min … max):   12.122 s … 12.871 s    10 runs

Benchmark #3: ./debiania-new build +RTS -N
  Time (mean ± σ):     12.766 s ±  0.243 s    [User: 48.934 s, System: 13.672 s]
  Range (min … max):   12.200 s … 13.071 s    10 runs

It's still faster than a single-thread, so not a blocker for this PR; just curious if anyone has any insight into what might be causing this.

(The numbers in two benchmarks match exactly because it's actually a single benchmark whose report I split in two.)

@LaurentRDC
Copy link
Contributor Author

I had this problem with pandoc-plot where if the number of async jobs was unlimited, resource contention would slow down the overall running time.

Let me try to limit the number of concurrent jobs to the number of cores and see if scaling makes more sense.

@Minoru
Copy link
Collaborator

Minoru commented Apr 23, 2021

@LaurentRDC Thanks for looking into that! I thought it might be related to my gzipFileCompiler, which compresses every page I have, but removing it didn't fix the scalability issue.

@LaurentRDC
Copy link
Contributor Author

Indeed, limiting the number of concurrent tasks helps with scaling. After the most recent patch, building my site with 1 - 4 cores yields the following results:

-N1 153s
-N2 80s
-N3 63s
-N4 55s

@Minoru
Copy link
Collaborator

Minoru commented Apr 23, 2021

I think you're looking at a different problem. Your results look like you have four physical cores or more, and you didn't go past their count. The problem I describe happens when you have SMT (e.g. HyperThreading) and you go past the number of physical cores (e.g. use 8 threads on a quad-core processor). The latest patch still exhibits this problem. Furthermore, it runs slower than the previous commit.

# Commit 06b1f0aad652d0541062af91b337deb42d220bc6
$ hyperfine \
    --prepare './debiania-06b1f0aa clean' \
    './debiania-06b1f0aa build +RTS -N1' \
    './debiania-06b1f0aa build +RTS -N2' \
    './debiania-06b1f0aa build +RTS -N3' \
    './debiania-06b1f0aa build +RTS -N4' \
    './debiania-06b1f0aa build +RTS -N5' \
    './debiania-06b1f0aa build +RTS -N6' \
    './debiania-06b1f0aa build +RTS -N7' \
    './debiania-06b1f0aa build +RTS -N8' \
    './debiania-06b1f0aa build +RTS -N9' \
    './debiania-06b1f0aa build +RTS -N10' 

Benchmark #1: ./debiania-06b1f0aa build +RTS -N1
  Time (mean ± σ):     14.656 s ±  0.125 s    [User: 14.520 s, System: 0.640 s]
  Range (min … max):   14.423 s … 14.864 s    10 runs
 
Benchmark #2: ./debiania-06b1f0aa build +RTS -N2
  Time (mean ± σ):     11.574 s ±  0.178 s    [User: 16.515 s, System: 2.050 s]
  Range (min … max):   11.376 s … 11.906 s    10 runs
 
Benchmark #3: ./debiania-06b1f0aa build +RTS -N3
  Time (mean ± σ):     10.665 s ±  0.217 s    [User: 19.085 s, System: 3.259 s]
  Range (min … max):   10.366 s … 11.039 s    10 runs
 
Benchmark #4: ./debiania-06b1f0aa build +RTS -N4
  Time (mean ± σ):     10.460 s ±  0.345 s    [User: 22.292 s, System: 4.528 s]
  Range (min … max):    9.761 s … 10.989 s    10 runs
 
Benchmark #5: ./debiania-06b1f0aa build +RTS -N5
  Time (mean ± σ):     11.027 s ±  0.262 s    [User: 27.804 s, System: 6.831 s]
  Range (min … max):   10.637 s … 11.416 s    10 runs
 
Benchmark #6: ./debiania-06b1f0aa build +RTS -N6
  Time (mean ± σ):     11.580 s ±  0.446 s    [User: 33.748 s, System: 9.530 s]
  Range (min … max):   10.877 s … 12.167 s    10 runs
 
Benchmark #7: ./debiania-06b1f0aa build +RTS -N7
  Time (mean ± σ):     11.770 s ±  0.429 s    [User: 38.785 s, System: 11.418 s]
  Range (min … max):   11.249 s … 12.290 s    10 runs
 
Benchmark #8: ./debiania-06b1f0aa build +RTS -N8
  Time (mean ± σ):     12.668 s ±  0.223 s    [User: 48.431 s, System: 13.382 s]
  Range (min … max):   12.275 s … 12.989 s    10 runs
 
Benchmark #9: ./debiania-06b1f0aa build +RTS -N9
  Time (mean ± σ):     13.864 s ±  0.274 s    [User: 54.508 s, System: 15.581 s]
  Range (min … max):   13.557 s … 14.497 s    10 runs
 
Benchmark #10: ./debiania-06b1f0aa build +RTS -N10
  Time (mean ± σ):     14.383 s ±  0.197 s    [User: 58.288 s, System: 15.287 s]
  Range (min … max):   14.137 s … 14.788 s    10 runs

# Commit 38984f6f5332632be8c4cab3e29d37e318492d70
$ hyperfine \
    --prepare './debiania clean' \
    './debiania build +RTS -N1' \
    './debiania build +RTS -N2' \
    './debiania build +RTS -N3' \
    './debiania build +RTS -N4' \
    './debiania build +RTS -N5' \
    './debiania build +RTS -N6' \
    './debiania build +RTS -N7' \
    './debiania build +RTS -N8' \
    './debiania build +RTS -N9' \
    './debiania build +RTS -N10'

Benchmark #1: ./debiania build +RTS -N1
  Time (mean ± σ):     15.840 s ±  0.214 s    [User: 15.525 s, System: 0.560 s]
  Range (min … max):   15.510 s … 16.160 s    10 runs
 
Benchmark #2: ./debiania build +RTS -N2
  Time (mean ± σ):     12.169 s ±  0.336 s    [User: 17.452 s, System: 2.202 s]
  Range (min … max):   11.510 s … 12.662 s    10 runs
 
Benchmark #3: ./debiania build +RTS -N3
  Time (mean ± σ):     11.016 s ±  0.202 s    [User: 19.760 s, System: 3.573 s]
  Range (min … max):   10.823 s … 11.384 s    10 runs
 
Benchmark #4: ./debiania build +RTS -N4
  Time (mean ± σ):     10.724 s ±  0.272 s    [User: 23.134 s, System: 4.808 s]
  Range (min … max):   10.366 s … 11.168 s    10 runs
 
Benchmark #5: ./debiania build +RTS -N5
  Time (mean ± σ):     11.477 s ±  0.471 s    [User: 29.296 s, System: 7.709 s]
  Range (min … max):   10.715 s … 12.333 s    10 runs
 
Benchmark #6: ./debiania build +RTS -N6
  Time (mean ± σ):     11.528 s ±  0.404 s    [User: 33.400 s, System: 9.823 s]
  Range (min … max):   10.784 s … 12.054 s    10 runs
 
Benchmark #7: ./debiania build +RTS -N7
  Time (mean ± σ):     11.834 s ±  0.341 s    [User: 39.003 s, System: 11.881 s]
  Range (min … max):   11.306 s … 12.276 s    10 runs
 
Benchmark #8: ./debiania build +RTS -N8
  Time (mean ± σ):     12.565 s ±  0.208 s    [User: 47.932 s, System: 13.619 s]
  Range (min … max):   12.197 s … 12.910 s    10 runs
 
Benchmark #9: ./debiania build +RTS -N9
  Time (mean ± σ):     13.834 s ±  0.310 s    [User: 54.218 s, System: 15.965 s]
  Range (min … max):   13.154 s … 14.173 s    10 runs
 
Benchmark #10: ./debiania build +RTS -N10
  Time (mean ± σ):     14.420 s ±  0.147 s    [User: 58.184 s, System: 15.948 s]
  Range (min … max):   14.183 s … 14.605 s    10 runs

Note how runtime starts to climb past 4 threads (i.e. once we got into HyperThreading territory), and the overall times are higher than they were on the previous commit. I included 9- and 10-thread cases just to see the penalty for overloading the machine.

It might be something specific to my setup though. I'm running Linux. What do you run, @LaurentRDC?

As a workaround, I'd suggest to use something like cpuinfo to figure out the number of physical cores, and cap the number of threads to that. I didn't check what it does on non-Linux platforms though.

@LaurentRDC
Copy link
Contributor Author

I run Windows, and I don't have access to a Hyperthreading machine.

Given that commit 38984f6 is not unequivocally better, I'll revert it. It appears cpuinfo only runs on Linux, so that's a no-go.

Maybe we can leave further optimizations until more people have had their hands on it?

@Minoru
Copy link
Collaborator

Minoru commented Apr 23, 2021

I run Windows, and I don't have access to a Hyperthreading machine.

Out of my personal curiosity, what processor do you have? Four cores, runs Windows, but no SMT — that's either ARM, or an SMT-capable chip with SMT disabled. I'm really curious! (Um, just realized I was calling SMT "SMP" throughout the thread =\ )

Given that commit 38984f6 is not unequivocally better, I'll revert it.

You could just force-push the previous commit, keeping the history tidy :P

It appears cpuinfo only runs on Linux, so that's a no-go.

Well, we could also fall back to getNumCapabilities when cpuinfo returns zero, but since you aren't running Linux and can't test it, it's definitely not on you to develop this.

Maybe we can leave further optimizations until more people have had their hands on it?

Yeah, it's still better than what we have at the moment, so I'm in favour of merging even if it doesn't scale as well as I wish it did. But the scalability issue should probably be mentioned in the changelog, so people can experiment with -N and see what kind of performance they get.

@LaurentRDC
Copy link
Contributor Author

I'm using an Intel Core i5-6500. 4 cores, no hyperthreading.

As for the the revert commit, we can squash everything at merge time.

@LaurentRDC
Copy link
Contributor Author

@jaspervdj Would it be possible for you to review this? Thanks!

@LaurentRDC
Copy link
Contributor Author

@Minoru Given that this is a clear performance win, can you merge this? I've been testing this branch daily in CI for nearly a month without hiccups. Thanks!

@Minoru
Copy link
Collaborator

Minoru commented May 30, 2021

Okay. I still don't know async Haskell and can't properly review this, but it doesn't look actively malicious either :) And it'll get more testing in master, I hope.

Note that I don't have credentials to release to Hackage. An actual release will have to wait for Jasper.

@Minoru Minoru merged commit 6e77b4e into jaspervdj:master May 30, 2021
@Minoru
Copy link
Collaborator

Minoru commented May 30, 2021

Scalability problem is now tracked in #850.

@LaurentRDC
Copy link
Contributor Author

Thank you very much for your help!

@Minoru
Copy link
Collaborator

Minoru commented May 30, 2021

Thank you for doing the actual work :)

@vaibhavsagar
Copy link
Contributor

vaibhavsagar commented Jul 17, 2021

Hey @LaurentRDC, silly question: why did you end up using TVar instead of something like MVar when we move between STM and IO anyway which loses the benefit of STM? I tried changing TVars to MVars and ended up with #863.

@LaurentRDC
Copy link
Contributor Author

@vaibhavsagar hmmm good question. I can't recall really. Maybe it's an artifact of using more STM in an earlier prototype?

One advantage is that a Tvar is never empty, but that might not be important.

@vaibhavsagar
Copy link
Contributor

Thanks for clarifying! If you have time I would definitely appreciate your review on my PR. I don't know how to test performance but I would expect it to be on par or better than the TVar approach because of reduced overhead.

@LaurentRDC
Copy link
Contributor Author

I'm on mobile for a few days unfortunately. I just tried two different builds of Hakyll on my personal website and timed it, nothing very scientific. @Minoru may be able to help more with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Compilation Support
4 participants