Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258

Open
leventov opened this issue Dec 20, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@leventov
Copy link

Description

Part of Roadmap: #2899

Related to refresh of the app itself on file changes: #1511, #2675

(Creating this issue for now so that I can refer and link to it from another issue.)

Suggested solution

Perhaps not a universal solution, but in some use cases when these files are tracked in Git, piggy-backing Git's own file change tracking (which itself is highly optimised) would be nice. Such as using pygit2.Diff.

Alternative

No response

Additional context

No response

@leventov leventov added the enhancement New feature or request label Dec 20, 2024
@leventov leventov changed the title Reactively respond to non-python file changes Reactively respond to non-python file changes (e.g. parquet, csv, json) Dec 20, 2024
@leventov
Copy link
Author

Initial thoughts, high uncertainty:

I think this should be a context manager API rather something like automatic picking up of builtin open() or other "known" file-opening library calls in the cell code. For example:

with mo.watched_file("mymodel.pt") as f:
    # note: we don't try to "pick up" torch.load() calls automatically,
    # relying on the user to do more explicit thing, i.e., this context block.
    m = torch.load(f)
    ....

The file name should probably(?) be a string literal like "mymodel.pt" rather than dynamic (watched_file(filename_coming_from_other_cell_as_variable), so that the need for re-run of the cell can be determined upon a static parsing pass through the code.

Note: these watched_file() contexts could be inside persistent_cache context manager, optionally within the same with statement, like with mo.persistent_cache(...) as _cache, mo.watched_file(...) as f:, or within a cached cell (whether persistently or not), then the changes to the file should feed into cache invalidation: #3271.

It seems that making these files writable within the same context block, even if theoretically possible, makes the whole thing way too hard to reason about even from the user perspective. So, it should probably be prohibited, and not just on syntactic level: should probably use https://github.com/samuelcolvin/watchfiles and raise an exception on existing from the mo.watched_file() context if there were changes to the file between the context entrance and exit.

Note sure: what to do about file writes within other cells? should they also be explicitly marked with a context block? Perhaps, that would be necessary to make a static DAG where "cell connected via file writes/reads" semantically completely coincides with "cells connected in a "normal" way via variables (otherwise, we risk circular DAGs). But to avoid user omissions of that, which would be way too easy, there perhaps should be a global watch on all the watched files within the notebook, and if there is a change from the notebook process itself, it would raise an error. But how to reliably detect all file openings within the process? I don't know yet.

To explore: whether using https://github.com/samuelcolvin/watchfiles within the context manager alone would be sufficient to prevent races like https://git-scm.com/docs/racy-git, or "extra tricks" along the lines of https://git-scm.com/docs/racy-git would also be needed (at least, in the case where we care not just about watched file's mtime, but its contents as well).

Content-addressible

When the "cached watch" does care about the contents rather than mtime alone (the API how to indicate that in user's cell code is TBD), e.g., if there is a noisy external process that the user doesn't have control over and cannot easily patch, and that process frequently "touches" the file without changing its contents. FWIW: when another cell within the notebook writes to the file, we may provide an extra wrapper that avoids touching the file via first writing to a temp file and then replacing it if it has a different content_hash, along the lines of how it's already done in cache or in Git itself.

Note: this is orthogonal with #3271: cache may or may not be fine with mtimes alone; "simple runtime watch" with cell/blocked simply @cache-d, not persistently, or not cached at all, may likewise be fine (or not) with mtimes alone.

I don't currently see a way to avoid races without, upon entering the watched_file context manager, to either:

  • Read the file into memory in full, along with computing its hash, and then serving this file "from memory" i.e., just a single bytearray.
  • Add the file to a Git index, either "native", if we discover the user already tracks this file in some of their own Git, or if not, a special-purpose submodule that marimo would maintain specifically for this purpose.

The second option is needed if the file is too big to read in memory in full at once. However, the second option has a downside because the file content will be copied in full. Also, if file is bigger than core.bigFileThreshold (default is 500 MB), my understanding is that (but not 100% sure, need to double check) Git itself will basically ignore its own stored hash and will degrade to "if mtime is higher than file is considered updated", which draws the whole exercise pointless.

@leventov
Copy link
Author

TODO: at least check that the chosen approach will be forward-compatible with the following systems for data sharing/management, in the order of priority as I see it:

Note: I tried to choose data management systems that are simple local files-first (even if for metadata only, not the actual beefy files for downloading later). For the plethora of non-files-first systems, such as Data Catalogs, Hugging Face, etc., the alternatives are either:

  1. Write a dedicated cell that fetches the latest data(set) version via the API. In effect, this cell would be necessarily non-deterministic, but the user should still be able to "force rerun" it even if it cached.
  2. Create a separate mechanism/function within Marimo that would be familiar with certain data catalog / remote dataset versioning metadata APIs, and if configured a la with mo.watched_hf_dataset("https://huggingface.co/datasets/datonic/world_development_indicators") as _x: ... and would do a metadata fetch and trigger the run on update if needed. But, that would be a separate feature.

@davidgasquez
Copy link

davidgasquez commented Dec 24, 2024

Heya @leventov! Sharing a bit of context around Datadex in case it helps. Datadex itself is not a data management system, but an "opinionated glue" around some of the tools used in data these days. I don't think it fits with the DVCs or Git Annexes/LFSs around.

Over the years I compiled a list of "Data Package Managers" which might be helpful in your search!

PS: Making a small (not very informed) suggestion.

Create a separate mechanism/function within Marimo that would be familiar with certain data catalog / remote dataset versioning metadata APIs, and if configured a la with mo.watched_hf_dataset("https://huggingface.co/datasets/datonic/world_development_indicators") as _x: ... and would do a metadata fetch and trigger the run on update if needed. But, that would be a separate feature.

Since hf has a working file system URI in Python, it would be great to make the API something like watched_dataset("hf://...")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants