-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new notebook document format for improved workflow integration #4
Conversation
Hey Konrad. One of the things I would like to distinguish more in Jupyter notebook is the in-memory format, vs on disk format. There are for sure things that you can keep in memory that give you more information of not-yet ran cell, and wether the kernel has restarted and cell are not in sync with kernel, that do not (obviously) belong on disk. I'll read your proposal with more attention later. Thanks ! |
|
||
## Problem | ||
|
||
Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Version control systems require a clear separation of human-edited content and computed content. The current notebook file format mixes them. Workflow managers and provenance trackers require that all computations be replicable. For interactive computations, replicability requires storing a full log of user actions. The current notebook file format does not preserve this information, although it is available at execution time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I agree with this statement about the separation of human and compute content in VCS. Also, I think your working definition of replicability is subtle enough that many folks in the community will disagree with your statement about it requiring a full log of user actions. More background on your definitions would be helpful. To make it more clear, we regularly speak of the notebook as offering reproducibility for computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my definitions of replicability and reproducibility see my blog post. This specific use of the terms is quite common by now, but not yet universal. In short, replication refers to repeating a calculation identically for verification, whereas reproduction is about re-doing a computational experiment using different tools. Replication is a purely technical step that requires no understanding of the scientific content, whereas reproduction implies understanding a method and implementing it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @khinsen on replicable
, I try to use replicable
with notebook, even if the habbit of saying reproducible
is hard to get rid of. Nothing prevent from linking to content that describe in more precision replicable vs reproducible. Also people that will read this document are most likely more aware of the difference.
Some general comments... I think some of the ideas you have here are very interesting. The main point for me is that it would be useful to have a full record of code blocks that a kernel runs and a clear link between those code blocks+output and the ones that appear in a notebook. That idea is worth thinking about and is mostly independent of the broader version control issues. At the same time, given the large number of users we currently have (and their millions of notebooks), there is no way we can completely break the existing notebook format. I am not at all convinced that breaking the existing notebook format is required to address the main point above. It would not be difficult to write a kernel session monitor that records the full record of the cells and their output in a way that is linkable to the same cells in a current format notebook. With a small amount of changes to the notebook format (hashes of code cells and/or cell uuids) the relationship between the kernel record and the notebook document could be strengthened even further. If you can come up with concrete proposals that address the questions here without requiring any changes to the notebook format, there is a chance that the community could become interested. Most importantly, in order to justify even small breakages to the notebook format, we would need to see that prototypes of the ideas here, that leveraged the existing notebook format, were actually solving user's problems in significant ways. |
Sorry, no, I cannot do that. I am not sufficiently familiar with the internals of Jupyter to make such a proposal. The notebook format definition is not sufficient, as it doesn't specify what is and isn't a correct notebook file. For example, if I add a file to the "code cell" structure, is that a change to the notebook format or not? As for solving user's problems, I am mainly interested in solving non-user's problems, i.e. the problems that prevent people like me from using Jupyter. It is unlikely that there is much demand for those in the existing community. My proposal is about extending the community. |
@khinsen some of the statements you are making just aren't true. For example, we have a json schema for the notebook format and we validate notebooks against that schema. Here is that schema: https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.schema.json If a notebook doesn't validate against that schema, then it is not a valid notebook. If it does it is. |
@ellisonbg Thanks for the pointer to the schema! There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool. But the main information that's missing from my point of view is a definition of notebook semantics. I have added an example to the repository which is syntactically valid but semantically invalid: the output doesn't match the source code. My tiny example is obviously wrong, so it's not a real problem. But for more complex computations it is not obvious which relations between source code and output are supposed to hold inside a notebook file. This is a core issue for replicability. It is also an issue for version control, because merge operations can easily lead to syntactically correct but semantically invalid files. There is no way to validate semantics with reasonable effort, so notebook files that have been tampered with (such as my example) are not easy to detect. But a good notebook format should allow detection of accidentally introduced semantic inconsistencies. This is why my proposal includes SHA-1 hashes. Could such hashes be added to the current notebook format? Syntactically, this looks difficult: if I understand the schema correctly, there is no room for adding fields. Perhaps one could figure out a way to squeeze this information into existing fields somehow. But the first question is: does the notebook format make any promises about consistency at all? |
Good point, we can try to fix that. About CRC, and other cryptographic sum that insure consistency, I (personally) think it will be a tough sell to make them mandatory, and tools would have to implement them correctly to guaranty consistency. A tool can perfectly save a 3+1 = 7 notebok with valid hashes. We had discussion on marking "dirty" cells in UI, which turned out to be more complicated than we thought. One of the problem with current way the notebook works is that the kernel can get disconnected so some decision on how to persist what where are a bit weird,
Yes,
No the current schema does support adding keys. In general Some extra-field in other place make the notebook valid but cell become unrecognized , so technically valid, but implementations are allow to ignore these. This would allow us to make a minor revision, by adding fields, that will not be backward incompatible. Though, before comitting to, for example a sha1 key at top level, nothing prevent us or any one to to play with Jhamrick had a prototype of that to grade notebook with nbgrader, in order to check that the test-case cell where not tampered with by students (in the end the hash was moved to SQlite for other reason), but the metadata does contain other info which is nbgrader specific.
In the format itself, no. There used to be an optional signature to be sure the notebook was actually generated by the current machine (for security). Does that make sens and respond to some of your question ? I can try to see if I can come up with a nbconvert plugin that hash all cells, store the hash, and allows you to check the hash. Would that help ? |
Making hashes optional sounds fine, as long as it is straightforward for users to produce notebooks that do contain them. Any tool attempting validation would flag a hash-less notebook as "dubious". The point of hashes is not to prevent buggy software from producing wrong notebooks; there is no way to prevent that in general. The point is to allow merging of independent changes to a notebook and recognize output data that has become invalidated in the course of the merge. However, I am not convinced that the addition of hashes is of much interest in itself. To make notebooks good citizens of version controlled repositories, I think it is also necessary to separate human input from computational output as I explain in my proposal. The reason is that merging differences in the computational output will most likely lead to a complete mess, including syntactically wrong MIME data and other unpleasant things. I looked at the discussion about "dirty" cells and it seems to me that the difficulties with that idea are ultimately the same as the problems I am trying to solve with this proposal: the current notebook data model has no clear notion of dependencies between its data items. My "stale output" cell type addresses the same issue as those "dirty" cells but does so on the basis of real computational dependency information. I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes. |
You can't verify the accuracy of all computations with hashes alone. You can't even fully verify with certifying algorithms. Trivial ones certainly, but you're still also at the behest of the operating environment (versions of software, hardware, etc.) That's not to say that it shouldn't be done or isn't a plausible goal, just that it is a way larger scope than can be dictated in this proposal. |
If the primary goal is separating input from output for version control, this can be done relatively simply, and there are a variety of ways to go about it (ipymd does it, nbexplode does it, etc.). Hashes are one possible implementation detail for locating output with its matching input, and since those hashes would reside exclusively in the not-always-tracked output file / directory / database / whatever, they wouldn't be polluting anything. We've talked about the 'output sidecar' file before, and could consider adopting one such implementation as an optional, official way to split the notebook storage. |
Making a field optional and hard to get the semantic right is a receipt to get something not or badly used. We can do it right in the notebook, but people rely I don't want to get to something like windows vista UAC where everybody clicks without reading. |
@Carreau Which programs other than Jupyter actually create notebook files from scratch? I have tried to find some but so far without success. |
Pycharm from the top of my head. |
Sphinx gallery from Gael Varoquaux want to auto-generate notebook from sphinx doc, so that you can write docs as rst and have a "download as notebook" for user. In progress maybe not finished yet. ipymd have to generate at least in memory one, runipy, likely too as they have templated variables. I don't know how much they rely on nbformat to do so though. |
I saw a presentation this morning at the Saclay Open Software Day on Sphinx Gallery and also another project that generates notebooks as a documentation of a computation. I think they actually illustrate the problem I am trying to solve, because they use notebooks not as a storage and exchange format, but for output only - it's strictly one-way. A bit like generating PDF, with some obvious added value. The goal of my proposal is that such tools could read and write notebooks. |
Do you know if these presentations have been recorded. I saw Gael make a 5min Lightning Talk on Sphinx Gallery, but would like to know more. I'm not sure why Sphinx Gallery couldn't read notebooks, IIRC Gael was complaining about manual edition, not format. Also @fperez is likely to be around Saclay these days, you might be able to get a back and forth with him in person, which might much more productive than discussing by mail. |
There's a camera next to me, so I suppose the sessions were recorded. I'll post a link when I know more. And yes, @fperez is here as well, he gave the opening keynote. |
Ok, great ! Say Hi ! (and looking forward for the video) |
@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue. |
@khinsen have you seen org-markup? Bonus: Its a native github format too [https://github.com/fniessen/refcard-org-mode/blob/master/README.org]. @ellisonbg org-markup (specifically org-babel) already has mechanisms to separate the source from the result of any given embedded calculation. Even better it has tagging support. Tagging means you can tag parts of the document, to assign completeness status (q.g. TODO) or what is executed at publishing time (e.g. noexport). I use this feature myself as part of my test driven document development process, and literate programing development process. in addition:
|
@timoc Yes, I know org-markup, I use it all the time for lots of things. And yes, it is one step up from Jupyter's format in terms of managing the ingredients of a notebook. But it doesn't keep a trace of the computation either, so in my view it is not sufficient. |
@khinsen , maybe I'm missing the point of this feature request. I am completely new to jupyter, and i came from an emacs background using org mode. I posted to this feature request explicitly because i saw the overlap. If i understand this feature request at all, its more from the comments than the premise, but if i understand premise of your original feature, it is to separate these concerns. I agree. The concerns being those of the (org/jupyter) document as a source artefact, that of the 'computation' as one or more compilation artefact(s), and that of the result, which is the final set of result artefact(s) based on the 'compilation' artefacts. Even in a distributed computation environment, this would seem to be the case. This seems to be the same process you find in any sufficiently mature continuous build and test and delivery infrastructure, if you separate the concerns as you outline. I think org-markup is the choice for the source document format, because with tags you can encode the code to test and validate the outcome in the org document. I would suggest a pre, post and final tagset, so that computational code fragments that can be used to validate (possibly with a hash?) the computation and result artefacts as part of a traditional build approach. I have yet to look at any of the videos, so maybe i am being naive about the challenges you face that org does not address. in the presumption my assumptions are not correct, can you suggest an English presentation that will give better context on this problem? |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/jupyter-and-github-alternative-file-formant/4972/38 |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/jupyter-and-github-alternative-file-formant/4972/41 |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/jupyter-and-github-alternative-file-formant/4972/51 |
Hi @khinsen, this is Zach from the @jupyter/software-steering-council. We're working through old JEPs and closing proposals that are no longer active or may not be relevant anymore. Under Jupyter's new governance model, we have an active Software Steering Council who reviews JEPs weekly. We are catching up on the backlog now. Since there has been no active discussion on this JEP in awhile, I'd propose we close it here (we'll leave it open for two more weeks in case you'd like to revive the conversation). If you would like to re-open the discussion after we close it, you are welcome to do that too. I'd like to mention, this proposal could be replaced by #103, which proposes a Markdown based Notebook format. If you might be interested in joining that conversation. |
For the background, see this blog post.