Async runtime with graph-based dependency cycle checks #844

LaurentRDC · 2021-04-22T16:44:36Z

Hello,

This is a PR to make hakyll's runtime concurrent.

The general idea is based on #725. That particular implementation incorrectly identified a dependency cycle in my personal website that did not exist. Hence, this PR includes more robust dependency cycle checking based on Data.Graph. A test case has been added to ensure this works.

The concurrency is implemented as follows.

Identify all items that need to be compiled;
Compile all items concurrently. Items that compiled successfully are removed from the todo list, while those with dependencies are kept;
The remaining todo and their dependencies are checked for cycles in a single thread;
Go to step 1 until all items have been compiled.

pickAndChase :: Runtime ()
pickAndChase = do
    todo <- fmap runtimeTodo . liftIO . readTVarIO =<< get
    unless (null todo) $ do
        checkForDependencyCycle
        forConcurrently_ (M.keys todo) chase
        pickAndChase

The advantage of this implementation is that it is easy to swap between concurrent/serial by simply swapping forConcurrently_ for forM_. Maybe we can offer this option as a compile flag?

I have tested this version of hakyll on my personal page and I'm very satisfied with the results.

closes #714

LaurentRDC · 2021-04-23T11:59:18Z

Here is the benchmark on a site with quite a few JPGs to compress with hakyll-images:


Pre-PR	6148 ms
This PR `+RTS -N1 -RTS`	6179 ms
This PR `+RTS -N4 -RTS`	4366 ms

Minoru

I know next to nothing about async, so this PR will have to wait until @jaspervdj has time to review it. I left a few stylistic comments though :)

lib/Hakyll/Core/Runtime.hs

Minoru · 2021-04-23T16:38:38Z

This shows noticeable improvements on my own site:

$ hyperfine \
    --prepare './debiania-old clean' './debiania-old build' \
    --prepare './debiania-new clean' './debiania-new build +RTS -N4'

Benchmark #1: ./debiania-old build
  Time (mean ± σ):     15.345 s ±  0.153 s    [User: 40.101 s, System: 15.224 s]
  Range (min … max):   15.168 s … 15.628 s    10 runs

Benchmark #2: ./debiania-new build +RTS -N4
  Time (mean ± σ):     10.556 s ±  0.219 s    [User: 22.514 s, System: 4.638 s]
  Range (min … max):   10.096 s … 10.848 s    10 runs

There is some scalability weirdness though. I have Intel Core i7 7700HQ, which is a mobile quad-core with HyperThreading. Theoretically, 8-thread workload should slightly outperform 4-thread, but it's not the case:

$ hyperfine \
    --prepare './debiania-new clean' './debiania-new build +RTS -N4' \
    --prepare './debiania-new clean' './debiania-new build +RTS -N8' \
    --prepare './debiania-new clean' './debiania-new build +RTS -N'

Benchmark #1: ./debiania-new build +RTS -N4
  Time (mean ± σ):     10.556 s ±  0.219 s    [User: 22.514 s, System: 4.638 s]
  Range (min … max):   10.096 s … 10.848 s    10 runs

Benchmark #2: ./debiania-new build +RTS -N8
  Time (mean ± σ):     12.546 s ±  0.224 s    [User: 48.112 s, System: 12.646 s]
  Range (min … max):   12.122 s … 12.871 s    10 runs

Benchmark #3: ./debiania-new build +RTS -N
  Time (mean ± σ):     12.766 s ±  0.243 s    [User: 48.934 s, System: 13.672 s]
  Range (min … max):   12.200 s … 13.071 s    10 runs

It's still faster than a single-thread, so not a blocker for this PR; just curious if anyone has any insight into what might be causing this.

(The numbers in two benchmarks match exactly because it's actually a single benchmark whose report I split in two.)

LaurentRDC · 2021-04-23T16:52:17Z

I had this problem with pandoc-plot where if the number of async jobs was unlimited, resource contention would slow down the overall running time.

Let me try to limit the number of concurrent jobs to the number of cores and see if scaling makes more sense.

Minoru · 2021-04-23T16:56:06Z

@LaurentRDC Thanks for looking into that! I thought it might be related to my gzipFileCompiler, which compresses every page I have, but removing it didn't fix the scalability issue.

LaurentRDC · 2021-04-23T17:40:08Z

Indeed, limiting the number of concurrent tasks helps with scaling. After the most recent patch, building my site with 1 - 4 cores yields the following results:


-N1	153s
-N2	80s
-N3	63s
-N4	55s

Minoru · 2021-04-23T20:03:44Z

I think you're looking at a different problem. Your results look like you have four physical cores or more, and you didn't go past their count. The problem I describe happens when you have SMT (e.g. HyperThreading) and you go past the number of physical cores (e.g. use 8 threads on a quad-core processor). The latest patch still exhibits this problem. Furthermore, it runs slower than the previous commit.

# Commit 06b1f0aad652d0541062af91b337deb42d220bc6
$ hyperfine \
    --prepare './debiania-06b1f0aa clean' \
    './debiania-06b1f0aa build +RTS -N1' \
    './debiania-06b1f0aa build +RTS -N2' \
    './debiania-06b1f0aa build +RTS -N3' \
    './debiania-06b1f0aa build +RTS -N4' \
    './debiania-06b1f0aa build +RTS -N5' \
    './debiania-06b1f0aa build +RTS -N6' \
    './debiania-06b1f0aa build +RTS -N7' \
    './debiania-06b1f0aa build +RTS -N8' \
    './debiania-06b1f0aa build +RTS -N9' \
    './debiania-06b1f0aa build +RTS -N10' 

Benchmark #1: ./debiania-06b1f0aa build +RTS -N1
  Time (mean ± σ):     14.656 s ±  0.125 s    [User: 14.520 s, System: 0.640 s]
  Range (min … max):   14.423 s … 14.864 s    10 runs
 
Benchmark #2: ./debiania-06b1f0aa build +RTS -N2
  Time (mean ± σ):     11.574 s ±  0.178 s    [User: 16.515 s, System: 2.050 s]
  Range (min … max):   11.376 s … 11.906 s    10 runs
 
Benchmark #3: ./debiania-06b1f0aa build +RTS -N3
  Time (mean ± σ):     10.665 s ±  0.217 s    [User: 19.085 s, System: 3.259 s]
  Range (min … max):   10.366 s … 11.039 s    10 runs
 
Benchmark #4: ./debiania-06b1f0aa build +RTS -N4
  Time (mean ± σ):     10.460 s ±  0.345 s    [User: 22.292 s, System: 4.528 s]
  Range (min … max):    9.761 s … 10.989 s    10 runs
 
Benchmark #5: ./debiania-06b1f0aa build +RTS -N5
  Time (mean ± σ):     11.027 s ±  0.262 s    [User: 27.804 s, System: 6.831 s]
  Range (min … max):   10.637 s … 11.416 s    10 runs
 
Benchmark #6: ./debiania-06b1f0aa build +RTS -N6
  Time (mean ± σ):     11.580 s ±  0.446 s    [User: 33.748 s, System: 9.530 s]
  Range (min … max):   10.877 s … 12.167 s    10 runs
 
Benchmark #7: ./debiania-06b1f0aa build +RTS -N7
  Time (mean ± σ):     11.770 s ±  0.429 s    [User: 38.785 s, System: 11.418 s]
  Range (min … max):   11.249 s … 12.290 s    10 runs
 
Benchmark #8: ./debiania-06b1f0aa build +RTS -N8
  Time (mean ± σ):     12.668 s ±  0.223 s    [User: 48.431 s, System: 13.382 s]
  Range (min … max):   12.275 s … 12.989 s    10 runs
 
Benchmark #9: ./debiania-06b1f0aa build +RTS -N9
  Time (mean ± σ):     13.864 s ±  0.274 s    [User: 54.508 s, System: 15.581 s]
  Range (min … max):   13.557 s … 14.497 s    10 runs
 
Benchmark #10: ./debiania-06b1f0aa build +RTS -N10
  Time (mean ± σ):     14.383 s ±  0.197 s    [User: 58.288 s, System: 15.287 s]
  Range (min … max):   14.137 s … 14.788 s    10 runs

# Commit 38984f6f5332632be8c4cab3e29d37e318492d70
$ hyperfine \
    --prepare './debiania clean' \
    './debiania build +RTS -N1' \
    './debiania build +RTS -N2' \
    './debiania build +RTS -N3' \
    './debiania build +RTS -N4' \
    './debiania build +RTS -N5' \
    './debiania build +RTS -N6' \
    './debiania build +RTS -N7' \
    './debiania build +RTS -N8' \
    './debiania build +RTS -N9' \
    './debiania build +RTS -N10'

Benchmark #1: ./debiania build +RTS -N1
  Time (mean ± σ):     15.840 s ±  0.214 s    [User: 15.525 s, System: 0.560 s]
  Range (min … max):   15.510 s … 16.160 s    10 runs
 
Benchmark #2: ./debiania build +RTS -N2
  Time (mean ± σ):     12.169 s ±  0.336 s    [User: 17.452 s, System: 2.202 s]
  Range (min … max):   11.510 s … 12.662 s    10 runs
 
Benchmark #3: ./debiania build +RTS -N3
  Time (mean ± σ):     11.016 s ±  0.202 s    [User: 19.760 s, System: 3.573 s]
  Range (min … max):   10.823 s … 11.384 s    10 runs
 
Benchmark #4: ./debiania build +RTS -N4
  Time (mean ± σ):     10.724 s ±  0.272 s    [User: 23.134 s, System: 4.808 s]
  Range (min … max):   10.366 s … 11.168 s    10 runs
 
Benchmark #5: ./debiania build +RTS -N5
  Time (mean ± σ):     11.477 s ±  0.471 s    [User: 29.296 s, System: 7.709 s]
  Range (min … max):   10.715 s … 12.333 s    10 runs
 
Benchmark #6: ./debiania build +RTS -N6
  Time (mean ± σ):     11.528 s ±  0.404 s    [User: 33.400 s, System: 9.823 s]
  Range (min … max):   10.784 s … 12.054 s    10 runs
 
Benchmark #7: ./debiania build +RTS -N7
  Time (mean ± σ):     11.834 s ±  0.341 s    [User: 39.003 s, System: 11.881 s]
  Range (min … max):   11.306 s … 12.276 s    10 runs
 
Benchmark #8: ./debiania build +RTS -N8
  Time (mean ± σ):     12.565 s ±  0.208 s    [User: 47.932 s, System: 13.619 s]
  Range (min … max):   12.197 s … 12.910 s    10 runs
 
Benchmark #9: ./debiania build +RTS -N9
  Time (mean ± σ):     13.834 s ±  0.310 s    [User: 54.218 s, System: 15.965 s]
  Range (min … max):   13.154 s … 14.173 s    10 runs
 
Benchmark #10: ./debiania build +RTS -N10
  Time (mean ± σ):     14.420 s ±  0.147 s    [User: 58.184 s, System: 15.948 s]
  Range (min … max):   14.183 s … 14.605 s    10 runs

Note how runtime starts to climb past 4 threads (i.e. once we got into HyperThreading territory), and the overall times are higher than they were on the previous commit. I included 9- and 10-thread cases just to see the penalty for overloading the machine.

It might be something specific to my setup though. I'm running Linux. What do you run, @LaurentRDC?

As a workaround, I'd suggest to use something like cpuinfo to figure out the number of physical cores, and cap the number of threads to that. I didn't check what it does on non-Linux platforms though.

LaurentRDC · 2021-04-23T22:40:36Z

I run Windows, and I don't have access to a Hyperthreading machine.

Given that commit 38984f6 is not unequivocally better, I'll revert it. It appears cpuinfo only runs on Linux, so that's a no-go.

Maybe we can leave further optimizations until more people have had their hands on it?

This reverts commit 38984f6.

Minoru · 2021-04-23T22:52:11Z

I run Windows, and I don't have access to a Hyperthreading machine.

Out of my personal curiosity, what processor do you have? Four cores, runs Windows, but no SMT — that's either ARM, or an SMT-capable chip with SMT disabled. I'm really curious! (Um, just realized I was calling SMT "SMP" throughout the thread =\ )

Given that commit 38984f6 is not unequivocally better, I'll revert it.

You could just force-push the previous commit, keeping the history tidy :P

It appears cpuinfo only runs on Linux, so that's a no-go.

Well, we could also fall back to getNumCapabilities when cpuinfo returns zero, but since you aren't running Linux and can't test it, it's definitely not on you to develop this.

Maybe we can leave further optimizations until more people have had their hands on it?

Yeah, it's still better than what we have at the moment, so I'm in favour of merging even if it doesn't scale as well as I wish it did. But the scalability issue should probably be mentioned in the changelog, so people can experiment with -N and see what kind of performance they get.

LaurentRDC · 2021-04-24T21:00:31Z

I'm using an Intel Core i5-6500. 4 cores, no hyperthreading.

As for the the revert commit, we can squash everything at merge time.

LaurentRDC · 2021-05-01T17:01:19Z

@jaspervdj Would it be possible for you to review this? Thanks!

LaurentRDC · 2021-05-30T13:45:30Z

@Minoru Given that this is a clear performance win, can you merge this? I've been testing this branch daily in CI for nearly a month without hiccups. Thanks!

Minoru · 2021-05-30T19:00:51Z

Okay. I still don't know async Haskell and can't properly review this, but it doesn't look actively malicious either :) And it'll get more testing in master, I hope.

Note that I don't have credentials to release to Hackage. An actual release will have to wait for Jasper.

Minoru · 2021-05-30T19:09:18Z

Scalability problem is now tracked in #850.

LaurentRDC · 2021-05-30T19:32:00Z

Thank you very much for your help!

Minoru · 2021-05-30T19:32:51Z

Thank you for doing the actual work :)

vaibhavsagar · 2021-07-17T13:48:47Z

Hey @LaurentRDC, silly question: why did you end up using TVar instead of something like MVar when we move between STM and IO anyway which loses the benefit of STM? I tried changing TVars to MVars and ended up with #863.

LaurentRDC · 2021-07-17T13:57:06Z

@vaibhavsagar hmmm good question. I can't recall really. Maybe it's an artifact of using more STM in an earlier prototype?

One advantage is that a Tvar is never empty, but that might not be important.

vaibhavsagar · 2021-07-17T13:59:47Z

Thanks for clarifying! If you have time I would definitely appreciate your review on my PR. I don't know how to test performance but I would expect it to be on par or better than the TVar approach because of reduced overhead.

LaurentRDC · 2021-07-17T14:03:01Z

I'm on mobile for a few days unfortunately. I just tried two different builds of Hakyll on my personal website and timed it, nothing very scientific. @Minoru may be able to help more with this

LaurentRDC added 2 commits April 22, 2021 12:26

Async runtime

17cb404

Activate multi-threading in template repo

9788205

Minoru reviewed Apr 23, 2021

View reviewed changes

lib/Hakyll/Core/Runtime.hs Outdated Show resolved Hide resolved

lib/Hakyll/Core/Runtime.hs Outdated Show resolved Hide resolved

Style changes after feedback

06b1f0a

Limiting the number of concurrent tasks

38984f6

Revert "Limiting the number of concurrent tasks"

38f5c3d

This reverts commit 38984f6.

Minoru merged commit 6e77b4e into jaspervdj:master May 30, 2021

Minoru mentioned this pull request May 30, 2021

Make async runtime scale better on SMT machines #850

Open

This was referenced May 30, 2021

Added async compilation to runtime. Closes #714 #725

Closed

(Not ready for merge) Proposal for #714: Parallelizing compiler #715

Closed

vaibhavsagar mentioned this pull request Jul 17, 2021

Hakyll.Core.Runtime: use MVar instead of TVar #863

Merged

LaurentRDC mentioned this pull request Sep 22, 2021

Remove array dependency #886

Merged

dbp mentioned this pull request Jan 1, 2024

"thread blocked indefinitely in an MVar operation" errors #1014

Closed

Facebooklpe approved these changes Jan 8, 2024

View reviewed changes

Minoru mentioned this pull request Oct 13, 2024

Call for more co-maintainers (as I step down) #1048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async runtime with graph-based dependency cycle checks #844

Async runtime with graph-based dependency cycle checks #844

LaurentRDC commented Apr 22, 2021 •

edited

Loading

LaurentRDC commented Apr 23, 2021 •

edited

Loading

Minoru left a comment

Minoru commented Apr 23, 2021

LaurentRDC commented Apr 23, 2021

Minoru commented Apr 23, 2021

LaurentRDC commented Apr 23, 2021

Minoru commented Apr 23, 2021 •

edited

Loading

LaurentRDC commented Apr 23, 2021

Minoru commented Apr 23, 2021 •

edited

Loading

LaurentRDC commented Apr 24, 2021

LaurentRDC commented May 1, 2021

LaurentRDC commented May 30, 2021

Minoru commented May 30, 2021

Minoru commented May 30, 2021

LaurentRDC commented May 30, 2021

Minoru commented May 30, 2021

vaibhavsagar commented Jul 17, 2021 •

edited

Loading

LaurentRDC commented Jul 17, 2021

vaibhavsagar commented Jul 17, 2021

LaurentRDC commented Jul 17, 2021

Async runtime with graph-based dependency cycle checks #844

Async runtime with graph-based dependency cycle checks #844

Conversation

LaurentRDC commented Apr 22, 2021 • edited Loading

LaurentRDC commented Apr 23, 2021 • edited Loading

Minoru left a comment

Choose a reason for hiding this comment

Minoru commented Apr 23, 2021

LaurentRDC commented Apr 23, 2021

Minoru commented Apr 23, 2021

LaurentRDC commented Apr 23, 2021

Minoru commented Apr 23, 2021 • edited Loading

LaurentRDC commented Apr 23, 2021

Minoru commented Apr 23, 2021 • edited Loading

LaurentRDC commented Apr 24, 2021

LaurentRDC commented May 1, 2021

LaurentRDC commented May 30, 2021

Minoru commented May 30, 2021

Minoru commented May 30, 2021

LaurentRDC commented May 30, 2021

Minoru commented May 30, 2021

vaibhavsagar commented Jul 17, 2021 • edited Loading

LaurentRDC commented Jul 17, 2021

vaibhavsagar commented Jul 17, 2021

LaurentRDC commented Jul 17, 2021

LaurentRDC commented Apr 22, 2021 •

edited

Loading

LaurentRDC commented Apr 23, 2021 •

edited

Loading

Minoru commented Apr 23, 2021 •

edited

Loading

Minoru commented Apr 23, 2021 •

edited

Loading

vaibhavsagar commented Jul 17, 2021 •

edited

Loading