-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notebook execution engine and cacheing #4
Comments
What about the interaction with git? I imagine including notebooks with outputs in git is impractical, doubly so if they are evaluated automatically. Using sphinx caching system would be another option, but I'm not sure how to make it play nice with notebooks, in particular as far as keying cache by inputs and not raw text. |
Agreed that notebooks are terrible w/ git :-) Our thinking was to have a two-way sync between |
Ah, so the ipynb files would not be exposed to the user? What is the advantage of the ipynb format then, as opposed to e.g. sphinx caching? |
More like users have the option of using either ipynb or text files. We want to treat people writing text files, and people writing notebooks in an interface, as co equal citizens. We'd still use sphinx to cache content at the parsing level, but not at the executing level. |
Thanks, I now understand the plan better. Can you also explain what you mean with "Generation of some kind of report that can be fed into Sphinx, or read on its own"? Further:
One requirement that is quite important, and that isn't in the original list is a transparent and robust cache invalidation: imagine the user updating an external dependency or a Python module, or checking out a different git branch. The git branch is a rather annoying scenario: say if the user is advanced, then all notebooks are in gitignore. Further, as far as I can see, they must be automatically force-overwritten on check-out, or otherwise the two-way sync breaks. This, in turn creates a setup that is also tough to maintain even for advanced users, since it would require commit hooks. The commit hooks that already just read/write a bunch of notebooks may take around minutes even if there are no cache misses. On the other hand a cache that's fully wiped after checking out a different branch isn't really useful because of how often it would break. |
Also some background here: I've used nbconvert + git hooks based notebook output caching in a course a while ago. It turned out to be more trouble than it was worth: explaining to users how to work with it, waiting for hooks to run, and repository breaking in unpredictable ways. |
This is because of a nifty feature in jupinx that we'd like to keep. It will generate an HTML report of the "status" of each of the runnable content pages in the book, things like CI badges etc.
I agree - I think we'll need to choose some balance between robustness and how complex we want the build system to be (and how edge-case-proof it is). To me the obvious minimal things that would invalidate the cache would be:
Beyond that, I'd want to discuss the pros and cons. I think that for many of these things we should design best-practices and documentation around, but not necessarily try to design technical solutions for them. |
I didn't mean that caching should automatically be smarter than just checking whether the code fed to the kernel has changed. Trying to go beyond that is most likely diving into a rabbit hole. Instead I meant that the existence of cache must never interfere with user actions. That seems to be fragile if the build process supports bidirectional sync. |
More inspiration (thanks @akhmerov :-) ): https://github.com/minrk/delft-visit/blob/master/cachedoutput.py |
I'd like to provide some background about the design of that module—@minrk built it following a discussion of how to deal with long computations of the MOOC materials.
Right now we aren't using any of this anymore. Instead all the files are jupytext markdown, and the outputs are provided by the CI. Being able to edit plaintext files via the web interface is a killer feature of such a workflow. |
hey @gregcaporaso - I spoke with @chrisjsewell who mentioned that you're certainly welcome to start trying out the code here, and contributions are welcome as well! There may be some things that will change, and a few of the major issues are still in discussion mode, but you+your team's input would be appreciated! Just FYI. Check out the issues in this repo to see where conversations / to-dos / etc are right now |
Yeh and obviously have a read through the documentation first: https://jupyter-cache.readthedocs.io, where hopefully I’ve give a decent explanation of the basic structure of things (particularly look at the python api section) |
This is an issue to discuss the notebook execution and cacheing step of the build process.
As people are writing computational content, potentially that takes a significant amount of time to run, we need the ability to efficiently run notebooks only when needed. Here are some things that I think we'd need:
What else am I missing?
The text was updated successfully, but these errors were encountered: