-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional stable cell IDs #3177
Comments
FWIW, https://github.com/dbos-inc/dbos-transact-py doesn't seem to be any smarter than this, they also require putting stable workflow ID (equivalent of cell_id for Marimo) right in the code: see https://docs.dbos.dev/python/tutorials/idempotency-tutorial. By the way, for potential integration with DBOS which might be helpful in productionising Marimo notebooks, the fact that DBOS's workflow IDs are UUIDs (by default) is another reason to make cell IDs uuids: as they could be matched: cell = workflow, cell ID = workflow ID. Although, another alternative would be to identify cells with a DBOS's steps. |
Stable IDs are also required for UI-Elements to be properly re-used @mscolnick did something towards this here: #3061 but aside from that, not sure why changing the IDs system is needed (esp, on a user exposed level) The ID is an internal mechanism for runtime, but isn't saved anywhere. In cached execution, the cell is hashed just like it is wrapped in a persistent_cache block.
This makes it robust to:
If the issue is tracking changes, cells can be given names. This was tangentially suggested as a solution for #3129 (which could have used execution based hashing). But hashes and IDs aren't super readable. Might be good for detecting downstream notebook changes, but limits marimo notebooks from being written in any editor. Maybe HTML exports could be tagged with an execution hash on a per cell level? |
Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells. And therefore, cache cleaning couldn't be made automatic and relatively fast, as it would require computing the entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations (for example, if the user removed parts of the cache manually, for whatever reason, or if they off-loaded it with git-annex to another device and didn't re-download no the machine where the hypothetical Whereas, with And, the alternative to "targeted cleanup" would be removing the entire cache only, which again, is problematic because it will potentially force very expensive re-computations. I really see the DX/UX gap that may grow here between Marimo and Jupyter here (because with all its problems, Jupyter generally doesn't lose your cell results), and a potential deal-breaker for scaling Marimo workspaces beyond "I work alone by myself or within a small team" scenarios. Cf. also this comment by Leo Meyerov, I think it's a relevant bit of practical insight about notebook DX/UX. |
Re: editing notebooks in any editors,
When the user edits the notebook in the most bare bones editor (vi, nano, Notepad, whatever), and they know they really want to cache this cell, opening https://www.uuidgenerator.net/ or running To increase the legibility of the fact that these IDs are required for caching, probably the optional param should be called |
The execution path hashing is done statically and is very cheap.
since the diff of what's in the notebook and what's in the cache is easy to find If it is useful, then maybe #3129 with the execution path hashing might be worth pursuing. Opt in or opt out cache should be as easy as edit: woops, tagged the wrong issue initally, meant the unique hash name one |
See relevant to this feature request part of this comment:
|
Description
Stable (and per-workspace-unique) cell IDs are absolutely needed for scalable, on-by-default cell results persistent cache, see #3176. Stable cell IDs can even enable "moving the cell" to another notebook while preserving its cache.
Stable cell IDs would also be very helpful at the interaction of Marimo and Git. As a minimum, to better track changes per cell when Git itself fumbles with its regular line tracking feature (e.g., when files are renamed or cell order is changed).
Suggested solution
The only solution that I can think of at the moment is as follows:
id
parameter in cell decorator:@cell(id="uuid")
. Marimo tries to make sure they are not duplicated on the code level (if the user copy-pastes cell code and doesn't change the id, for example).id
is not specified, it's generated by marimo app internally (and thus, will not be stable), and this means the cell will be excluded from the on-by-default caching, and writing an explicitwith persistent_cache()
block within such a cell will fail.Alternative
Two main problems with the solution above:
However, I don't see alternatives. Using hashes of cell's code + hashes of all its code dependencies will lead to a lot of trashing of cell_ids and thus will subvert the proposed solution for persistent caching of results at #3176.
Additional context
No response
The text was updated successfully, but these errors were encountered: