Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should/would the cache be used remotely? #9

Open
chrisjsewell opened this issue Feb 24, 2020 · 8 comments
Open

How should/would the cache be used remotely? #9

chrisjsewell opened this issue Feb 24, 2020 · 8 comments

Comments

@chrisjsewell
Copy link
Member

Originally posted by @choldgraf in #6 (comment)

Maybe a use-case to consider here.

A team has a really big book, it takes 2 hours to complete. An author forks the book, clones it locally, edits one page. They want to contribute the page back. A few questions:

  • Do they need to run the entire 2-hour build process locally before seeing what the page looks like? --> seems like this could be handled by letting cache execution step be configurable only to specific files
  • When they make a PR, does the entire book need to re-build top to bottom on the CI/CD job? --> here the cache could probably be stored as a build artifact in a CI/CD job independent of the .git repository
  • Is there any way for a "master cache" to be bundled with the book?
    • If so, then is that a pattern we want to encourage?

I could see a benefit of committing the cache, in the sense that then git would keep track of changes to the cache and diffs to the pages would propagate through github, clones, etc. However, I worry about a few things:

  • The cache would probably become gigantic for non-trivial projects, unless it could be incrementally-updated and have some kind of "shallow clone" behavior.
  • It would require sub-moduling a book repository, so I think it would only work for fairly power-users.
  • The cache diffs themselves would be binary (I think?) so they wouldn't make any sense in github which would make it hard to know what has changed in the cache.
@jstac
Copy link
Member

jstac commented Feb 24, 2020

Our Python lectures take around 1.5 hours to build from scratch, so this is our scenario.

For 99% of our PRs, we just make the edits in RST, generate the ipynb for that one page and then run it manually to see if it looks OK. This is fine for most edits, which typically adjust language or tweak code.

If we're concerned about how this looks in the PDF, say, we generate that one page locally. Sometimes RAs will include an image in the PR to show that the PDF looks fine.

These are imperfect systems but they work OK for the most part. So my vote would be for us to favor simplicity, at least initially, but not committing the cache. (Plus, I'm a reasonably sophisticated user, but submodules still confuse me. My instinct is to fear and distrust them.)

@choldgraf
Copy link
Member

@jstac you can never trust two things: politicians, and sub-modules.

I wonder if one potential way to address this would involve meeting another use-case: building single-page documents. If we make the CLI easy for building the HTML or PDF of one page and letting users quickly preview what it looks like, the same machinery could be re-used for people that only want to build a single page and not an entire book...

@jstac
Copy link
Member

jstac commented Feb 24, 2020

Yep, that seems like a good idea. Two birds with one stone, etc. And the single-page use case is certainly important.

Such tools are available in jupinx for reviewing edits to QE lectures. I suppose cross references involving other pages won't work. But, for 99% of cases, it's perfectly fine.

@mmcky
Copy link
Member

mmcky commented Feb 25, 2020

Glad you have mentioned this @jstac. It will be really important to support rendering of single pages for usability. We currently do this using environment variable FILES= and passing that through to SPHINX. I agree the CLI tool needs to cater to this and make it easier :-)

@choldgraf
Copy link
Member

An approach I was playing around with for the jupyter book CLI was to use jupyter-book page: https://jupyterbook.org/features/page.html

perhaps we could use the same pattern, but also allow for PDF output with a kwarg or something?

@chrisjsewell
Copy link
Member Author

As discussed with @mmcky, jupinx currently uses a static cache, housed in the Sphinx _build folder on an Amazon server. The build is persisted for all execution triggers (cron jobbed every hour), which run a 'git-pull' then sphinx-build. For this use case, the (just merged) hash implementation of jupter-cache should work fine.

@mmcky also noted that their current (sphinx based) cache implementation doesn't work on Travis CI; presumably because the cache is compressed/un-compressed, changing the file mtime's that sphinx uses to determine re-builds (matching to a dictionary stored in the pickled environment object). This wouldn't be an issue for jupyter-cache since it is hash based.

It would also be interesting to think how it might work with GitHub actions, CircleCI and ReadTheDocs builds.

@chrisjsewell
Copy link
Member Author

Just a note to self, in case this is issue is encountered (sqlite on NFS): jupyter/notebook#1782

@choldgraf
Copy link
Member

Another related note: for jupyter book I was starting to collect a repository with several CI/CD patterns that could be used to deploy books: https://github.com/choldgraf/jupyter-book-deploy-demo

I think it'd be helpful if we replicated that repository for the new build system, ideally with multiple levels of complexity that users may want (e.g. vanilla build w/o execute then host online, execute and build, and execute+cache and build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants