Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook execution engine and cacheing #4

Open
choldgraf opened this issue Feb 11, 2020 · 12 comments
Open

Notebook execution engine and cacheing #4

choldgraf opened this issue Feb 11, 2020 · 12 comments

Comments

@choldgraf
Copy link
Member

This is an issue to discuss the notebook execution and cacheing step of the build process.

As people are writing computational content, potentially that takes a significant amount of time to run, we need the ability to efficiently run notebooks only when needed. Here are some things that I think we'd need:

  • Separate out "edits to content" from "edits to code cells". Cell rearranges and code cell changes should require a re-execution. Content changes should not.
  • Parallel execution of notebooks
  • Generation of some kind of report that can be fed into Sphinx, or read on its own

What else am I missing?

@akhmerov
Copy link
Contributor

What about the interaction with git? I imagine including notebooks with outputs in git is impractical, doubly so if they are evaluated automatically.

Using sphinx caching system would be another option, but I'm not sure how to make it play nice with notebooks, in particular as far as keying cache by inputs and not raw text.

@choldgraf
Copy link
Member Author

Agreed that notebooks are terrible w/ git :-)

Our thinking was to have a two-way sync between ipynb files and a text-based version of the content of those ipynb files. To use the text-based version for version control etc, and use the ipynb file to store outputs and extra metadata etc. Maybe handle this two-way sync with Jupytext.

@akhmerov
Copy link
Contributor

Ah, so the ipynb files would not be exposed to the user? What is the advantage of the ipynb format then, as opposed to e.g. sphinx caching?

@choldgraf
Copy link
Member Author

More like users have the option of using either ipynb or text files. We want to treat people writing text files, and people writing notebooks in an interface, as co equal citizens.

We'd still use sphinx to cache content at the parsing level, but not at the executing level.

@akhmerov
Copy link
Contributor

Thanks, I now understand the plan better.

Can you also explain what you mean with "Generation of some kind of report that can be fed into Sphinx, or read on its own"?

Further:

What else am I missing?

One requirement that is quite important, and that isn't in the original list is a transparent and robust cache invalidation: imagine the user updating an external dependency or a Python module, or checking out a different git branch.


The git branch is a rather annoying scenario: say if the user is advanced, then all notebooks are in gitignore. Further, as far as I can see, they must be automatically force-overwritten on check-out, or otherwise the two-way sync breaks.

This, in turn creates a setup that is also tough to maintain even for advanced users, since it would require commit hooks. The commit hooks that already just read/write a bunch of notebooks may take around minutes even if there are no cache misses.

On the other hand a cache that's fully wiped after checking out a different branch isn't really useful because of how often it would break.

@akhmerov
Copy link
Contributor

Also some background here: I've used nbconvert + git hooks based notebook output caching in a course a while ago. It turned out to be more trouble than it was worth: explaining to users how to work with it, waiting for hooks to run, and repository breaking in unpredictable ways.

@choldgraf
Copy link
Member Author

Can you also explain what you mean with "Generation of some kind of report that can be fed into Sphinx, or read on its own"?

This is because of a nifty feature in jupinx that we'd like to keep. It will generate an HTML report of the "status" of each of the runnable content pages in the book, things like CI badges etc.

One requirement that is quite important, and that isn't in the original list is a transparent and robust cache invalidation: imagine the user updating an external dependency or a Python module, or checking out a different git branch.

I agree - I think we'll need to choose some balance between robustness and how complex we want the build system to be (and how edge-case-proof it is). To me the obvious minimal things that would invalidate the cache would be:

  • Editing runnable code content
  • Rearranging runnable code cells

Beyond that, I'd want to discuss the pros and cons. I think that for many of these things we should design best-practices and documentation around, but not necessarily try to design technical solutions for them.

@akhmerov
Copy link
Contributor

I didn't mean that caching should automatically be smarter than just checking whether the code fed to the kernel has changed. Trying to go beyond that is most likely diving into a rabbit hole.

Instead I meant that the existence of cache must never interfere with user actions. That seems to be fragile if the build process supports bidirectional sync.

@choldgraf
Copy link
Member Author

More inspiration (thanks @akhmerov :-) ): https://github.com/minrk/delft-visit/blob/master/cachedoutput.py

@akhmerov
Copy link
Contributor

I'd like to provide some background about the design of that module—@minrk built it following a discussion of how to deal with long computations of the MOOC materials.

  • A lot of the complexity stems from the design decision we made, where each cell within a notebook would only depend on the contents of the very first cell and itself. In hindsight, while this is a reasonable design decision, making the cache depend on this is too much work.
  • The course was developed by many people, with switching across branches. We wanted for cache to not be invalidated by checking out a different branch. This is why the cache storage is separated from the sources.
  • We set up git filters to execute all outputs on checkout and clean the notebooks on commit. This, caused an intolerable slowdown of working with the repo, and eventually we abandoned the idea.

Right now we aren't using any of this anymore. Instead all the files are jupytext markdown, and the outputs are provided by the CI. Being able to edit plaintext files via the web interface is a killer feature of such a workflow.

@choldgraf choldgraf transferred this issue from executablebooks/meta Feb 19, 2020
@choldgraf
Copy link
Member Author

choldgraf commented Mar 13, 2020

hey @gregcaporaso - I spoke with @chrisjsewell who mentioned that you're certainly welcome to start trying out the code here, and contributions are welcome as well! There may be some things that will change, and a few of the major issues are still in discussion mode, but you+your team's input would be appreciated! Just FYI.

Check out the issues in this repo to see where conversations / to-dos / etc are right now

@chrisjsewell
Copy link
Member

hey @gregcaporaso - I spoke with @chrisjsewell who mentioned that you're certainly welcome to start trying out the code here, and contributions are welcome as well! There may be some things that will change, and a few of the major issues are still in discussion mode, but you+your team's input would be appreciated! Just FYI.

Check out the issues in this repo to see where conversations / to-dos / etc are right now

Yeh and obviously have a read through the documentation first: https://jupyter-cache.readthedocs.io, where hopefully I’ve give a decent explanation of the basic structure of things (particularly look at the python api section)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants