Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258

leventov · 2024-12-20T11:03:02Z

Description

Part of Roadmap: #2899

Related to refresh of the app itself on file changes: #1511, #2675

(Creating this issue for now so that I can refer and link to it from another issue.)

Alternative

No response

Additional context

No response

leventov · 2024-12-23T12:36:23Z

Initial thoughts, high uncertainty:

I think this should be a context manager API rather something like automatic picking up of builtin open() or other "known" file-opening library calls in the cell code. For example:

with mo.watched_file("mymodel.pt") as f:
    # note: we don't try to "pick up" torch.load() calls automatically,
    # relying on the user to do more explicit thing, i.e., this context block.
    m = torch.load(f)
    ....

The file name should probably(?) be a string literal like "mymodel.pt" rather than dynamic (watched_file(filename_coming_from_other_cell_as_variable), so that the need for re-run of the cell can be determined upon a static parsing pass through the code.

Note: these watched_file() contexts could be inside persistent_cache context manager, optionally within the same with statement, like with mo.persistent_cache(...) as _cache, mo.watched_file(...) as f:, or within a cached cell (whether persistently or not), then the changes to the file should feed into cache invalidation: #3271.

It seems that making these files writable within the same context block, even if theoretically possible, makes the whole thing way too hard to reason about even from the user perspective. So, it should probably be prohibited, and not just on syntactic level: should probably use https://github.com/samuelcolvin/watchfiles and raise an exception on existing from the mo.watched_file() context if there were changes to the file between the context entrance and exit.

Note sure: what to do about file writes within other cells? should they also be explicitly marked with a context block? Perhaps, that would be necessary to make a static DAG where "cell connected via file writes/reads" semantically completely coincides with "cells connected in a "normal" way via variables (otherwise, we risk circular DAGs). But to avoid user omissions of that, which would be way too easy, there perhaps should be a global watch on all the watched files within the notebook, and if there is a change from the notebook process itself, it would raise an error. But how to reliably detect all file openings within the process? I don't know yet.

To explore: whether using https://github.com/samuelcolvin/watchfiles within the context manager alone would be sufficient to prevent races like https://git-scm.com/docs/racy-git, or "extra tricks" along the lines of https://git-scm.com/docs/racy-git would also be needed (at least, in the case where we care not just about watched file's mtime, but its contents as well).

Content-addressible

When the "cached watch" does care about the contents rather than mtime alone (the API how to indicate that in user's cell code is TBD), e.g., if there is a noisy external process that the user doesn't have control over and cannot easily patch, and that process frequently "touches" the file without changing its contents. FWIW: when another cell within the notebook writes to the file, we may provide an extra wrapper that avoids touching the file via first writing to a temp file and then replacing it if it has a different content_hash, along the lines of how it's already done in cache or in Git itself.

Note: this is orthogonal with #3271: cache may or may not be fine with mtimes alone; "simple runtime watch" with cell/blocked simply @cache-d, not persistently, or not cached at all, may likewise be fine (or not) with mtimes alone.

I don't currently see a way to avoid races without, upon entering the watched_file context manager, to either:

Read the file into memory in full, along with computing its hash, and then serving this file "from memory" i.e., just a single bytearray.
Add the file to a Git index, either "native", if we discover the user already tracks this file in some of their own Git, or if not, a special-purpose submodule that marimo would maintain specifically for this purpose.

The second option is needed if the file is too big to read in memory in full at once. However, the second option has a downside because the file content will be copied in full. Also, if file is bigger than core.bigFileThreshold (default is 500 MB), my understanding is that (but not 100% sure, need to double check) Git itself will basically ignore its own stored hash and will degrade to "if mtime is higher than file is considered updated", which draws the whole exercise pointless.

leventov · 2024-12-24T09:01:02Z

TODO: at least check that the chosen approach will be forward-compatible with the following systems for data sharing/management, in the order of priority as I see it:

https://kedro.org/
https://dvc.org/
https://www.datalad.org/, and transitively git-annex
https://github.com/datonic/datadex (?) Don't clearly understand how it works. cc @davidgasquez
https://pachyderm.com/ - but not sure because its usage dwindles(?)

Note: I tried to choose data management systems that are simple local files-first (even if for metadata only, not the actual beefy files for downloading later). For the plethora of non-files-first systems, such as Data Catalogs, Hugging Face, etc., the alternatives are either:

Write a dedicated cell that fetches the latest data(set) version via the API. In effect, this cell would be necessarily non-deterministic, but the user should still be able to "force rerun" it even if it cached.
Create a separate mechanism/function within Marimo that would be familiar with certain data catalog / remote dataset versioning metadata APIs, and if configured a la with mo.watched_hf_dataset("https://huggingface.co/datasets/datonic/world_development_indicators") as _x: ... and would do a metadata fetch and trigger the run on update if needed. But, that would be a separate feature.

davidgasquez · 2024-12-24T09:27:54Z

Heya @leventov! Sharing a bit of context around Datadex in case it helps. Datadex itself is not a data management system, but an "opinionated glue" around some of the tools used in data these days. I don't think it fits with the DVCs or Git Annexes/LFSs around.

Over the years I compiled a list of "Data Package Managers" which might be helpful in your search!

PS: Making a small (not very informed) suggestion.

Create a separate mechanism/function within Marimo that would be familiar with certain data catalog / remote dataset versioning metadata APIs, and if configured a la with mo.watched_hf_dataset("https://huggingface.co/datasets/datonic/world_development_indicators") as _x: ... and would do a metadata fetch and trigger the run on update if needed. But, that would be a separate feature.

Since hf has a working file system URI in Python, it would be great to make the API something like watched_dataset("hf://...")

leventov added the enhancement New feature or request label Dec 20, 2024

leventov changed the title ~~Reactively respond to non-python file changes~~ Reactively respond to non-python file changes (e.g. parquet, csv, json) Dec 20, 2024

leventov mentioned this issue Dec 20, 2024

Caching robustness and determinism #3259

Closed

dmadisetti mentioned this issue Dec 21, 2024

Use external file contents for cache invalidation #3271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258

Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258

leventov commented Dec 20, 2024

leventov commented Dec 23, 2024

leventov commented Dec 24, 2024

davidgasquez commented Dec 24, 2024 •

edited

Loading

Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258

Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258

Comments

leventov commented Dec 20, 2024

Description

Suggested solution

Alternative

Additional context

leventov commented Dec 23, 2024

Content-addressible

leventov commented Dec 24, 2024

davidgasquez commented Dec 24, 2024 • edited Loading

davidgasquez commented Dec 24, 2024 •

edited

Loading