Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional stable cell IDs #3177

Open
leventov opened this issue Dec 15, 2024 · 6 comments
Open

Optional stable cell IDs #3177

leventov opened this issue Dec 15, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@leventov
Copy link

Description

Stable (and per-workspace-unique) cell IDs are absolutely needed for scalable, on-by-default cell results persistent cache, see #3176. Stable cell IDs can even enable "moving the cell" to another notebook while preserving its cache.

Stable cell IDs would also be very helpful at the interaction of Marimo and Git. As a minimum, to better track changes per cell when Git itself fumbles with its regular line tracking feature (e.g., when files are renamed or cell order is changed).

Suggested solution

The only solution that I can think of at the moment is as follows:

  1. Marimo app uses uuids as cell-ids internally instead of what it's using at the moment (four letters whose uniqueness doesn't seem to be enforced even within the App?)
  2. Optional id parameter in cell decorator: @cell(id="uuid"). Marimo tries to make sure they are not duplicated on the code level (if the user copy-pastes cell code and doesn't change the id, for example).
  3. If id is not specified, it's generated by marimo app internally (and thus, will not be stable), and this means the cell will be excluded from the on-by-default caching, and writing an explicit with persistent_cache() block within such a cell will fail.
  4. When editing through Marimo UI, and when cell results cache is on-by-default, Marimo app will generate uuid for the cell itself and will add it to the code (thankfully the user will not see this code most of the time in the app).

Alternative

Two main problems with the solution above:

  • UUIDs in the code are ugly and clutter. Not as much accidental complexity as requiring users to think about caching themselves, but still not ideal.
  • When the user manipulates the code of the notebook outside the Marimo UI, and they need to create a new cell, they need to generate UUID themselves and insert it in code, which is a speed/flow bump.

However, I don't see alternatives. Using hashes of cell's code + hashes of all its code dependencies will lead to a lot of trashing of cell_ids and thus will subvert the proposed solution for persistent caching of results at #3176.

Additional context

No response

@leventov leventov added the enhancement New feature or request label Dec 15, 2024
@leventov
Copy link
Author

leventov commented Dec 16, 2024

FWIW, https://github.com/dbos-inc/dbos-transact-py doesn't seem to be any smarter than this, they also require putting stable workflow ID (equivalent of cell_id for Marimo) right in the code: see https://docs.dbos.dev/python/tutorials/idempotency-tutorial.

By the way, for potential integration with DBOS which might be helpful in productionising Marimo notebooks, the fact that DBOS's workflow IDs are UUIDs (by default) is another reason to make cell IDs uuids: as they could be matched: cell = workflow, cell ID = workflow ID. Although, another alternative would be to identify cells with a DBOS's steps.

@dmadisetti
Copy link
Collaborator

Stable IDs are also required for UI-Elements to be properly re-used

@mscolnick did something towards this here: #3061
which could be made more deterministic, but seems OK for the time.

but aside from that, not sure why changing the IDs system is needed (esp, on a user exposed level)

The ID is an internal mechanism for runtime, but isn't saved anywhere.
The current ID system does have an issue: #1962 but there's trivial fixes for this


In cached execution, the cell is hashed just like it is wrapped in a persistent_cache block.
There's 2 steps to hashing:

  1. Execution path- hashing the relevant executing cells into a given path
  2. Content Addressing- additional hashing content is added to the cell based on certain inputs, helping mitigate state and UI based changes.

This makes it robust to:

  1. Ordering (hashing occurs on a graph level
  2. Accidental state changes producing an inconsistent caching state

If the issue is tracking changes, cells can be given names. This was tangentially suggested as a solution for #3129 (which could have used execution based hashing).

But hashes and IDs aren't super readable. Might be good for detecting downstream notebook changes, but limits marimo notebooks from being written in any editor.


Maybe HTML exports could be tagged with an execution hash on a per cell level?
Thoughts on this?

@leventov
Copy link
Author

leventov commented Dec 16, 2024

Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells. And therefore, cache cleaning couldn't be made automatic and relatively fast, as it would require computing the entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations (for example, if the user removed parts of the cache manually, for whatever reason, or if they off-loaded it with git-annex to another device and didn't re-download no the machine where the hypothetical marimo cache tidy command is executed).

Whereas, with .py-level cell ids, this is a very straightforward python code scanning excercise. The same complexity as pytest does when it scans directories for test files and functions, for example.

And, the alternative to "targeted cleanup" would be removing the entire cache only, which again, is problematic because it will potentially force very expensive re-computations.

I really see the DX/UX gap that may grow here between Marimo and Jupyter here (because with all its problems, Jupyter generally doesn't lose your cell results), and a potential deal-breaker for scaling Marimo workspaces beyond "I work alone by myself or within a small team" scenarios.

Cf. also this comment by Leo Meyerov, I think it's a relevant bit of practical insight about notebook DX/UX.

@leventov
Copy link
Author

leventov commented Dec 16, 2024

Re: editing notebooks in any editors,

  1. IDs are still optional. The user may write @app.cell() def __(): ... by hand. In many cases it will be fine because actually in some cases caching cell results is superfluous (e.g., in unit tests). If the "cache cell results by default" setting is ON, next time marimo runtime "lays its hands" on the cell code (either via UI opening or command-line processing), it may supplant missing UUIDs itself unless the cell is also tagged with "test" or some other tag which excludes the cells from automatic result caching (which could also be configured).
  2. In popular IDEs, generating UUID with a shortcut is not that hard: VS Code, IntelliJ.

When the user edits the notebook in the most bare bones editor (vi, nano, Notepad, whatever), and they know they really want to cache this cell, opening https://www.uuidgenerator.net/ or running uuidgen from command line and copying the UUID over from there is also not that hard.

To increase the legibility of the fact that these IDs are required for caching, probably the optional param should be called cache_id, i.e., @app.cell(cache_id=...). In this way, the users who are used to seeing bare cells code will learn quickly and remember firmly that if they want the cell to be auto-cached, they need to add cache_id.

@dmadisetti
Copy link
Collaborator

dmadisetti commented Dec 16, 2024

entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations

The execution path hashing is done statically and is very cheap.
So this doesn't really follow either:

Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells.

since the diff of what's in the notebook and what's in the cache is easy to find

If it is useful, then maybe #3129 with the execution path hashing might be worth pursuing.

Opt in or opt out cache should be as easy as cache = True | False, with the reference mechanism hidden, but cell name or persistent_cache name should be leveraged where possible


edit: woops, tagged the wrong issue initally, meant the unique hash name one

@leventov
Copy link
Author

See relevant to this feature request part of this comment:

Note important thing: whereas you probably most think about two scenarios:

  1. Editing cells through Marimo Web UI (and VS Code extension provides the same, or its closer to scenario two?) where the JS side can "remember" the cell_id the user is editing to ensure "short history" continuity
  2. Editing cells as code in simple text editors, by a human, where UUID proposal is ugly and makes somewhat bumpy user experience with need to generate UUIDs by hand or figure out how to do it in IDE,

They main scenario as I see using Marimo is a third one: the majority of cells are created and edited by AIs, e.g. within Co-STORM. The wrapper code can add UUIDs trivially, so the clunkiness of (2) above is not an issue. But also unless that wrapping logic is also knee-deep in Marimo internals (calls to module_code_hash directly before it instructs the AI to re-write the cell, for instance, and then manually tells Marimo somehow to look up potentially duplicate results there), the recent history dedupliication wouldn't work.

The abiliity to provide explicit cell_ids (which my wrapper logic can generate easily) would enable much better separation of concerns between AI workflow wrappers and Marimo layer. And also, those UUIDs needed to "correlate cells across Git branches" and moving cells from one notebook to another without "losing the context" is also directly motivated by those AI-workflow-ish scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants