Optional stable cell IDs #3177

leventov · 2024-12-15T18:46:01Z

Description

Stable (and per-workspace-unique) cell IDs are absolutely needed for scalable, on-by-default cell results persistent cache, see #3176. Stable cell IDs can even enable "moving the cell" to another notebook while preserving its cache.

Stable cell IDs would also be very helpful at the interaction of Marimo and Git. As a minimum, to better track changes per cell when Git itself fumbles with its regular line tracking feature (e.g., when files are renamed or cell order is changed).

Alternative

Two main problems with the solution above:

UUIDs in the code are ugly and clutter. Not as much accidental complexity as requiring users to think about caching themselves, but still not ideal.
When the user manipulates the code of the notebook outside the Marimo UI, and they need to create a new cell, they need to generate UUID themselves and insert it in code, which is a speed/flow bump.

However, I don't see alternatives. Using hashes of cell's code + hashes of all its code dependencies will lead to a lot of trashing of cell_ids and thus will subvert the proposed solution for persistent caching of results at #3176.

Additional context

No response

The text was updated successfully, but these errors were encountered:

leventov · 2024-12-16T07:36:00Z

FWIW, https://github.com/dbos-inc/dbos-transact-py doesn't seem to be any smarter than this, they also require putting stable workflow ID (equivalent of cell_id for Marimo) right in the code: see https://docs.dbos.dev/python/tutorials/idempotency-tutorial.

By the way, for potential integration with DBOS which might be helpful in productionising Marimo notebooks, the fact that DBOS's workflow IDs are UUIDs (by default) is another reason to make cell IDs uuids: as they could be matched: cell = workflow, cell ID = workflow ID. Although, another alternative would be to identify cells with a DBOS's steps.

dmadisetti · 2024-12-16T10:04:34Z

Stable IDs are also required for UI-Elements to be properly re-used

@mscolnick did something towards this here: #3061
which could be made more deterministic, but seems OK for the time.

but aside from that, not sure why changing the IDs system is needed (esp, on a user exposed level)

The ID is an internal mechanism for runtime, but isn't saved anywhere.
The current ID system does have an issue: #1962 but there's trivial fixes for this

In cached execution, the cell is hashed just like it is wrapped in a persistent_cache block.
There's 2 steps to hashing:

Execution path- hashing the relevant executing cells into a given path
Content Addressing- additional hashing content is added to the cell based on certain inputs, helping mitigate state and UI based changes.

This makes it robust to:

Ordering (hashing occurs on a graph level
Accidental state changes producing an inconsistent caching state

If the issue is tracking changes, cells can be given names. This was tangentially suggested as a solution for #3129 (which could have used execution based hashing).

But hashes and IDs aren't super readable. Might be good for detecting downstream notebook changes, but limits marimo notebooks from being written in any editor.

Maybe HTML exports could be tagged with an execution hash on a per cell level?
Thoughts on this?

leventov · 2024-12-16T10:22:02Z

Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells. And therefore, cache cleaning couldn't be made automatic and relatively fast, as it would require computing the entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations (for example, if the user removed parts of the cache manually, for whatever reason, or if they off-loaded it with git-annex to another device and didn't re-download no the machine where the hypothetical marimo cache tidy command is executed).

Whereas, with .py-level cell ids, this is a very straightforward python code scanning excercise. The same complexity as pytest does when it scans directories for test files and functions, for example.

And, the alternative to "targeted cleanup" would be removing the entire cache only, which again, is problematic because it will potentially force very expensive re-computations.

I really see the DX/UX gap that may grow here between Marimo and Jupyter here (because with all its problems, Jupyter generally doesn't lose your cell results), and a potential deal-breaker for scaling Marimo workspaces beyond "I work alone by myself or within a small team" scenarios.

Cf. also this comment by Leo Meyerov, I think it's a relevant bit of practical insight about notebook DX/UX.

leventov · 2024-12-16T10:31:44Z

Re: editing notebooks in any editors,

IDs are still optional. The user may write @app.cell() def __(): ... by hand. In many cases it will be fine because actually in some cases caching cell results is superfluous (e.g., in unit tests). If the "cache cell results by default" setting is ON, next time marimo runtime "lays its hands" on the cell code (either via UI opening or command-line processing), it may supplant missing UUIDs itself unless the cell is also tagged with "test" or some other tag which excludes the cells from automatic result caching (which could also be configured).
In popular IDEs, generating UUID with a shortcut is not that hard: VS Code, IntelliJ.

When the user edits the notebook in the most bare bones editor (vi, nano, Notepad, whatever), and they know they really want to cache this cell, opening https://www.uuidgenerator.net/ or running uuidgen from command line and copying the UUID over from there is also not that hard.

To increase the legibility of the fact that these IDs are required for caching, probably the optional param should be called cache_id, i.e., @app.cell(cache_id=...). In this way, the users who are used to seeing bare cells code will learn quickly and remember firmly that if they want the cell to be auto-cached, they need to add cache_id.

dmadisetti · 2024-12-16T10:50:37Z

entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations

The execution path hashing is done statically and is very cheap.
So this doesn't really follow either:

Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells.

since the diff of what's in the notebook and what's in the cache is easy to find

If it is useful, then maybe #3129 with the execution path hashing might be worth pursuing.

Opt in or opt out cache should be as easy as cache = True | False, with the reference mechanism hidden, but cell name or persistent_cache name should be leveraged where possible

edit: woops, tagged the wrong issue initally, meant the unique hash name one

leventov · 2024-12-16T18:36:05Z

See relevant to this feature request part of this comment:

Note important thing: whereas you probably most think about two scenarios:

Editing cells through Marimo Web UI (and VS Code extension provides the same, or its closer to scenario two?) where the JS side can "remember" the cell_id the user is editing to ensure "short history" continuity

Editing cells as code in simple text editors, by a human, where UUID proposal is ugly and makes somewhat bumpy user experience with need to generate UUIDs by hand or figure out how to do it in IDE,

They main scenario as I see using Marimo is a third one: the majority of cells are created and edited by AIs, e.g. within Co-STORM. The wrapper code can add UUIDs trivially, so the clunkiness of (2) above is not an issue. But also unless that wrapping logic is also knee-deep in Marimo internals (calls to module_code_hash directly before it instructs the AI to re-write the cell, for instance, and then manually tells Marimo somehow to look up potentially duplicate results there), the recent history dedupliication wouldn't work.

The abiliity to provide explicit cell_ids (which my wrapper logic can generate easily) would enable much better separation of concerns between AI workflow wrappers and Marimo layer. And also, those UUIDs needed to "correlate cells across Git branches" and moving cells from one notebook to another without "losing the context" is also directly motivated by those AI-workflow-ish scenarios.

leventov added the enhancement New feature or request label Dec 15, 2024

leventov mentioned this issue Dec 15, 2024

Persistent cache with auto-cleanup #3176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional stable cell IDs #3177

Optional stable cell IDs #3177

leventov commented Dec 15, 2024

leventov commented Dec 16, 2024 •

edited

Loading

dmadisetti commented Dec 16, 2024

leventov commented Dec 16, 2024 •

edited

Loading

leventov commented Dec 16, 2024 •

edited

Loading

dmadisetti commented Dec 16, 2024 •

edited

Loading

leventov commented Dec 16, 2024

Optional stable cell IDs #3177

Optional stable cell IDs #3177

Comments

leventov commented Dec 15, 2024

Description

Suggested solution

Alternative

Additional context

leventov commented Dec 16, 2024 • edited Loading

dmadisetti commented Dec 16, 2024

leventov commented Dec 16, 2024 • edited Loading

leventov commented Dec 16, 2024 • edited Loading

dmadisetti commented Dec 16, 2024 • edited Loading

leventov commented Dec 16, 2024

leventov commented Dec 16, 2024 •

edited

Loading

leventov commented Dec 16, 2024 •

edited

Loading

leventov commented Dec 16, 2024 •

edited

Loading

dmadisetti commented Dec 16, 2024 •

edited

Loading