Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: build rule memoization #530

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

takano-akio
Copy link
Contributor

This branch implements 'memoized rules', which use a new kind of caches to re-use results from not only the last build, but also from builds further back in history, or builds from different build trees.

In principle it's also possible to share caches between multiple computers, but I left out this feature because it seemed to have a lot of trade-offs to be made, whose right choice depends on users' environments. Instead I made the backend pluggable by allowing the user to override some fields in ShakeOptions.

I have been using this at work for about two weeks, and it has been working well so far.

I expect that more work will have to be done on this branch before it can get merged, but I thought it would be good to ask for comments about it at this point.

What do you think?

@ndmitchell
Copy link
Owner

First reaction is "cool" 😎. I'll go through the code and say something slightly more refined in a few days.

What is the practical impact at work? Does it always save a bit? Sometimes save a lot?

@takano-akio
Copy link
Contributor Author

What is the practical impact at work? Does it always save a bit? Sometimes save a lot?

It sometimes saves a lot. When I build a branch that I have previously built, everything is taken from the cache, and it can save about 80-90% of the build time (relative to a normal incremental build). It also helps when the new branch is similar but not identical to some branch I have built, e.g. when they are both based on a third branch.

I find myself less reluctant to switching between branches now, because the build is quick when I come back.

On the other hand, it doesn't help while I'm writing code, because in this case it doesn't add anything over what Shake's dependency checker already does.

@ndmitchell
Copy link
Owner

Sorry for the disastrously long time to respond....

This code is very interesting. I'll attempt to summarise, so you can correct my mistakes! It seems that when a command line runs (or arbitrarily any action - but command line seems the one you're aiming at) you define which things you output, and it captures them along with all dependencies thusfar and stores them in a separate directory. It also uses this point to reinject previously saved items.

My immediate thought is wondering if this approach would work better at the level of whole rules, rather than just actions. You could memoise tuples of input Key, output Value, all the dependency Keys and all the "disk state" required. I think that gives more flexibility to later on only copy over the final end point, rather than having to recursively build all intermediate rules (which will be essential when you are on a network). It also works everywhere, without requiring users to specify which outputs come from which command lines. It does fundamentally assume that rules don't create stuff "off to the side", but that's probably not a terrible restriction.

Thinking further, you could imagine for each completed rule having:

  • Key/Value - the key and value that got stored
  • [[Key]] - the dependencies in need ordering
  • IO () - an action to inflate the result.

You could then run the entire algorithm at the level of Key/Value without inflating the results unless they are top-level wants or dependencies of things that have rebuilt. Each key type would then have to have a way to store/unstore the IO actions.

cc @snowleopard who has been thinking about similar problems.

@snowleopard
Copy link
Contributor

I think this is a great idea. The design space seems to be pretty large, so I'm not sure the current approach is best, but it looks easy to use -- just modify some cmds and you immediately start getting benefits.

Should work great for my use-case by the way -- I'm just rebuilding GHC over and over again every day, without changing a line in it :))

@takano-akio
Copy link
Contributor Author

@ndmitchell

Thank you for looking at the code!

I'll attempt to summarise, so you can correct my mistakes! It seems that when a command line runs (or arbitrarily any action - but command line seems the one you're aiming at) you define which things you output, and it captures them along with all dependencies thusfar and stores them in a separate directory. It also uses this point to reinject previously saved items.

Yes, that's right.

My immediate thought is wondering if this approach would work better at the level of whole rules, rather than just actions.

I thought about this possibility as well. I can't remember all the reasons why I didn't take this approach, but I think one reason was that I wanted the flexibility of not having to precisely specify the output of all rules I define. For example, I have a rule for registering a Cabal package into a local package database, and it's not immediately clear how this desired output ("having the package A registered at the package database") can be precisely described in a way that Shake can save it elsewhere and then later restore it.

I think that gives more flexibility to later on only copy over the final end point, rather than having to recursively build all intermediate rules (which will be essential when you are on a network).

This is true, but in practice (at least in my use case) the impact may be small. When I use my implementation of memoization, very often the cached results are not complete enough to deliver the final result (in my case an executable) by themselves. They can still be very helpful by providing a large number of pre-built object files.

@mpickering
Copy link
Contributor

mpickering commented Feb 14, 2018

I rebased this branch - https://github.com/mpickering/shake/tree/memo

The tests fail but I don't think they ever passed?

I was also quite confused about the difference between BinaryEx and Binary especially as Rules/Directory uses a different one to Rules/Oracle.

@shmish111
Copy link

shmish111 commented Mar 5, 2018

@mpickering is your branch in place of this PR or the same? @ndmitchell @takano-akio wouldn't it make sense to have caching on both cmd and rule levels, you would probably only look at rules over the network but locally cmds would be useful (I'm assuming this is because a Rule might have many cmds and some of them are in the cache and some not)? If both were going to be implemented then could this one be merged pretty soon, it would really help our devs a lot, they have the same problem of being scared of switching branches as they'll probably have to wait a few hours for a rebuild.

@ndmitchell
Copy link
Owner

For info, myself and @snowleopard are working on solving the Shake in the Cloud issue at the moment. I'm going to try and look at this branch in the context of that work soon. Certainly Shake with caching/cloud support is high on the agenda.

@ndmitchell
Copy link
Owner

I just pushed a version of Shake that has a --shared flag. Pass that and build rules will run and dump their input hashes and outputs in the directory. It should mean that changing branches takes a few seconds of copying files over. You may have to look at provides (extra files that the rule generates but are not tracked), but everything else should work pretty smoothly. Feedback most welcome!

Thanks to @takano-akio, whose ideas I leaned on quite heavily! I'm keen to understand if this rule level caching is good enough to replace the command level caching, or if we need both?

@mpickering
Copy link
Contributor

@shmish111 Sorry for the late reply but my branch was just Akio's patch but rebased. It seems that Neil has picked this up anyway.

@takano-akio
Copy link
Contributor Author

@ndmitchell Thank you for working on this! I'll try to adapt our system to the new version and report back how it goes.

@takano-akio
Copy link
Contributor Author

I encountered two issues:

First, I'm unable to figure out how to rewrite certain rules in a way it can use histories. Consider this case: A depends on B, and B depends on C. Both rules are deterministic, but it's hard to tell Shake how to compare states of B, because it's not a simple file. For example, it might be a state of a particular Haskell package in a GHC pakcage datagase. I'd still like to re-use build results for A from histories.

When not using histories, this situation can be modeled by having a timestamp file as a proxy for B. Whenever B is rebuilt, the timestamp file is updated, causing A to be rebuilt again.

With histories, this doens't work very well, because when C is changed, A is rebuilt, and C is changed back to the original state, the timestamp of B won't go back to the original, meaning the rule for C cannot re-use the result of the first build.

In my implementation, I could work around this difficulty by using the stateString function in the rule for B. Instead of using the timestamp, the rule for B would generate a text file containing the state string. I don't see a way to do a similar trick with Shake HEAD.

Second, I see that Shake HEAD uses a 32-bit hash as an index to the history. I worry that this might be too small, because you'd only need about 2^16 random files to create a collision, even if we assume an ideal hash function.

@ndmitchell
Copy link
Owner

Thanks for the feedback @takano-akio!

Second issue is not really a problem - it caches them per key, so you actually need 2^^16 different files for a given key, so vastly less chance of hitting a problem. That said, it's likely I'll switch to something more robust at some point in the future - I don't imagine it's the most pressing issue right now. If Haskell had a good, high performance, crypto hash without a boat load of dependencies I'd use that...

For the package file problem, it sounds like you want an oracle that pulls the relevant information out of the package database, and depend on the results of the oracle. When you say the ghc-pkg database, is that what you are literally talking about, or just an example? I am wondering if the ghc-pkg database should be modelled as a builtin rule type - it will likely be necessary for Hadrian.

@takano-akio
Copy link
Contributor Author

Second issue is not really a problem - it caches them per key, so you actually need 2^^16 different files for a given key, so vastly less chance of hitting a problem.

I see, thank you for the clarification. I'll try to actually understand the code now.

For the package file problem, it sounds like you want an oracle that pulls the relevant information out of the package database, and depend on the results of the oracle.

Thank you for the advice, I will try this.

When you say the ghc-pkg database, is that what you are literally talking about, or just an example?

Yes, my issue is actually about a ghc package database, although I think it might still be useful to have a general way to sidestep the issue of telling Shake about complex states.

@takano-akio
Copy link
Contributor Author

@ndmitchell

I managed to get our modified build system to working, although it still has a few bugs. Here are what I found so far:

  • The history mechanism in Shake HEAD is much faster than my branch, which is very nice.
  • The fact that caching is on by default for all rules (once shakeShare is set) made it easy to create buggy rules. Whenever I forget to add necessary calls historyDisable or produces, the rule just behaves incorrectly. Once this happens, adding the correct calls is not sufficient to fix the issue. I need to either clear the cache directory or bump the rule version. A large number of versioned rules makes the code harder to read, in my opinion.
  • I'd still like some function like stateString.
  • Sometimes I want to cache only a part of a rule. For example, I have a rule for installing a Cabal package, and I want to cache the result of Setup.hs build, but not configure, copy or register. Enconding this using rule-level caching is rather awkward.

I think I'd prefer a system with action-level caching because:

  • It would be an opt-in feature, and it'd be able to be introduced to a build system without risking breakage of existing rules.
  • It would reduce the need of versioning the rules. Each cached actions could be individually versioned, rather than getting the rules themselves versioned. I think this leads to less clutter in the code.
  • Some rules can be written more naturally this way.

@ndmitchell
Copy link
Owner

We gave an attempt at getting Hadrian working with the Shared mechanism and succeeded at https://gitlab.haskell.org/ghc/ghc/merge_requests/317. It also quickly became clear we need a more principled approach for finding the missing produces or historyDisable, which we've done at https://github.com/ndmitchell/shake/blob/master/docs/Cloud.md#modifying-your-build-system.

I've overcome the timestamp problem you mention by making Shake enable hashes, so that works fine.

Generally, if different parts of your rule have different dependencies, they should be different rules. I suggest separate configure/build/install rules, because they have different dependencies. That way you get caching of all but the install rule quite nicely.

To partially replace stateString I added versioned, https://hackage.haskell.org/package/shake-0.17.5/docs/Development-Shake.html#v:versioned, which lets you version a rule individually. It's not dynamic like stateString was, but if you have rule level caching it can't be dynamic.

My feel after the experiments is that Shake should probably have rule level caching, but not action level caching, at least not in the Core (it's possible to add action level caching on the outside). The reason is partly because as you say, if you want a cloud build experience, and are doing at the action level, you have to modify everything in your build system - whereas the rule-level can be more automatic.

I've written up all the notes in https://github.com/ndmitchell/shake/blob/master/docs/Cloud.md and would welcome your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants