-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: build rule memoization #530
base: master
Are you sure you want to change the base?
Conversation
First reaction is "cool" 😎. I'll go through the code and say something slightly more refined in a few days. What is the practical impact at work? Does it always save a bit? Sometimes save a lot? |
It sometimes saves a lot. When I build a branch that I have previously built, everything is taken from the cache, and it can save about 80-90% of the build time (relative to a normal incremental build). It also helps when the new branch is similar but not identical to some branch I have built, e.g. when they are both based on a third branch. I find myself less reluctant to switching between branches now, because the build is quick when I come back. On the other hand, it doesn't help while I'm writing code, because in this case it doesn't add anything over what Shake's dependency checker already does. |
Sorry for the disastrously long time to respond.... This code is very interesting. I'll attempt to summarise, so you can correct my mistakes! It seems that when a command line runs (or arbitrarily any action - but command line seems the one you're aiming at) you define which things you output, and it captures them along with all dependencies thusfar and stores them in a separate directory. It also uses this point to reinject previously saved items. My immediate thought is wondering if this approach would work better at the level of whole rules, rather than just actions. You could memoise tuples of input Thinking further, you could imagine for each completed rule having:
You could then run the entire algorithm at the level of cc @snowleopard who has been thinking about similar problems. |
I think this is a great idea. The design space seems to be pretty large, so I'm not sure the current approach is best, but it looks easy to use -- just modify some Should work great for my use-case by the way -- I'm just rebuilding GHC over and over again every day, without changing a line in it :)) |
Thank you for looking at the code!
Yes, that's right.
I thought about this possibility as well. I can't remember all the reasons why I didn't take this approach, but I think one reason was that I wanted the flexibility of not having to precisely specify the output of all rules I define. For example, I have a rule for registering a Cabal package into a local package database, and it's not immediately clear how this desired output ("having the package A registered at the package database") can be precisely described in a way that Shake can save it elsewhere and then later restore it.
This is true, but in practice (at least in my use case) the impact may be small. When I use my implementation of memoization, very often the cached results are not complete enough to deliver the final result (in my case an executable) by themselves. They can still be very helpful by providing a large number of pre-built object files. |
I rebased this branch - https://github.com/mpickering/shake/tree/memo The tests fail but I don't think they ever passed? I was also quite confused about the difference between |
@mpickering is your branch in place of this PR or the same? @ndmitchell @takano-akio wouldn't it make sense to have caching on both cmd and rule levels, you would probably only look at rules over the network but locally cmds would be useful (I'm assuming this is because a Rule might have many cmds and some of them are in the cache and some not)? If both were going to be implemented then could this one be merged pretty soon, it would really help our devs a lot, they have the same problem of being scared of switching branches as they'll probably have to wait a few hours for a rebuild. |
For info, myself and @snowleopard are working on solving the Shake in the Cloud issue at the moment. I'm going to try and look at this branch in the context of that work soon. Certainly Shake with caching/cloud support is high on the agenda. |
I just pushed a version of Shake that has a Thanks to @takano-akio, whose ideas I leaned on quite heavily! I'm keen to understand if this rule level caching is good enough to replace the command level caching, or if we need both? |
@shmish111 Sorry for the late reply but my branch was just Akio's patch but rebased. It seems that Neil has picked this up anyway. |
@ndmitchell Thank you for working on this! I'll try to adapt our system to the new version and report back how it goes. |
I encountered two issues: First, I'm unable to figure out how to rewrite certain rules in a way it can use histories. Consider this case: A depends on B, and B depends on C. Both rules are deterministic, but it's hard to tell Shake how to compare states of B, because it's not a simple file. For example, it might be a state of a particular Haskell package in a GHC pakcage datagase. I'd still like to re-use build results for A from histories. When not using histories, this situation can be modeled by having a timestamp file as a proxy for B. Whenever B is rebuilt, the timestamp file is updated, causing A to be rebuilt again. With histories, this doens't work very well, because when C is changed, A is rebuilt, and C is changed back to the original state, the timestamp of B won't go back to the original, meaning the rule for C cannot re-use the result of the first build. In my implementation, I could work around this difficulty by using the Second, I see that Shake HEAD uses a 32-bit hash as an index to the history. I worry that this might be too small, because you'd only need about |
Thanks for the feedback @takano-akio! Second issue is not really a problem - it caches them per key, so you actually need 2^^16 different files for a given key, so vastly less chance of hitting a problem. That said, it's likely I'll switch to something more robust at some point in the future - I don't imagine it's the most pressing issue right now. If Haskell had a good, high performance, crypto hash without a boat load of dependencies I'd use that... For the package file problem, it sounds like you want an oracle that pulls the relevant information out of the package database, and depend on the results of the oracle. When you say the ghc-pkg database, is that what you are literally talking about, or just an example? I am wondering if the ghc-pkg database should be modelled as a builtin rule type - it will likely be necessary for Hadrian. |
I see, thank you for the clarification. I'll try to actually understand the code now.
Thank you for the advice, I will try this.
Yes, my issue is actually about a ghc package database, although I think it might still be useful to have a general way to sidestep the issue of telling Shake about complex states. |
I managed to get our modified build system to working, although it still has a few bugs. Here are what I found so far:
I think I'd prefer a system with action-level caching because:
|
We gave an attempt at getting Hadrian working with the Shared mechanism and succeeded at https://gitlab.haskell.org/ghc/ghc/merge_requests/317. It also quickly became clear we need a more principled approach for finding the missing I've overcome the timestamp problem you mention by making Shake enable hashes, so that works fine. Generally, if different parts of your rule have different dependencies, they should be different rules. I suggest separate configure/build/install rules, because they have different dependencies. That way you get caching of all but the install rule quite nicely. To partially replace My feel after the experiments is that Shake should probably have rule level caching, but not action level caching, at least not in the Core (it's possible to add action level caching on the outside). The reason is partly because as you say, if you want a cloud build experience, and are doing at the action level, you have to modify everything in your build system - whereas the rule-level can be more automatic. I've written up all the notes in https://github.com/ndmitchell/shake/blob/master/docs/Cloud.md and would welcome your feedback. |
This branch implements 'memoized rules', which use a new kind of caches to re-use results from not only the last build, but also from builds further back in history, or builds from different build trees.
In principle it's also possible to share caches between multiple computers, but I left out this feature because it seemed to have a lot of trade-offs to be made, whose right choice depends on users' environments. Instead I made the backend pluggable by allowing the user to override some fields in
ShakeOptions
.I have been using this at work for about two weeks, and it has been working well so far.
I expect that more work will have to be done on this branch before it can get merged, but I thought it would be good to ask for comments about it at this point.
What do you think?