-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactively respond to non-python file changes (e.g. parquet, csv, json) #3258
Comments
Initial thoughts, high uncertainty: I think this should be a context manager API rather something like automatic picking up of builtin with mo.watched_file("mymodel.pt") as f:
# note: we don't try to "pick up" torch.load() calls automatically,
# relying on the user to do more explicit thing, i.e., this context block.
m = torch.load(f)
.... The file name should probably(?) be a string literal like "mymodel.pt" rather than dynamic ( Note: these It seems that making these files writable within the same context block, even if theoretically possible, makes the whole thing way too hard to reason about even from the user perspective. So, it should probably be prohibited, and not just on syntactic level: should probably use https://github.com/samuelcolvin/watchfiles and raise an exception on existing from the Note sure: what to do about file writes within other cells? should they also be explicitly marked with a context block? Perhaps, that would be necessary to make a static DAG where "cell connected via file writes/reads" semantically completely coincides with "cells connected in a "normal" way via variables (otherwise, we risk circular DAGs). But to avoid user omissions of that, which would be way too easy, there perhaps should be a global watch on all the watched files within the notebook, and if there is a change from the notebook process itself, it would raise an error. But how to reliably detect all file openings within the process? I don't know yet. To explore: whether using https://github.com/samuelcolvin/watchfiles within the context manager alone would be sufficient to prevent races like https://git-scm.com/docs/racy-git, or "extra tricks" along the lines of https://git-scm.com/docs/racy-git would also be needed (at least, in the case where we care not just about watched file's mtime, but its contents as well). Content-addressibleWhen the "cached watch" does care about the contents rather than mtime alone (the API how to indicate that in user's cell code is TBD), e.g., if there is a noisy external process that the user doesn't have control over and cannot easily patch, and that process frequently "touches" the file without changing its contents. FWIW: when another cell within the notebook writes to the file, we may provide an extra wrapper that avoids touching the file via first writing to a temp file and then replacing it if it has a different content_hash, along the lines of how it's already done in cache or in Git itself. Note: this is orthogonal with #3271: cache may or may not be fine with mtimes alone; "simple runtime watch" with cell/blocked simply I don't currently see a way to avoid races without, upon entering the
The second option is needed if the file is too big to read in memory in full at once. However, the second option has a downside because the file content will be copied in full. Also, if file is bigger than core.bigFileThreshold (default is 500 MB), my understanding is that (but not 100% sure, need to double check) Git itself will basically ignore its own stored hash and will degrade to "if mtime is higher than file is considered updated", which draws the whole exercise pointless. |
TODO: at least check that the chosen approach will be forward-compatible with the following systems for data sharing/management, in the order of priority as I see it:
Note: I tried to choose data management systems that are simple local files-first (even if for metadata only, not the actual beefy files for downloading later). For the plethora of non-files-first systems, such as Data Catalogs, Hugging Face, etc., the alternatives are either:
|
Heya @leventov! Sharing a bit of context around Datadex in case it helps. Datadex itself is not a data management system, but an "opinionated glue" around some of the tools used in data these days. I don't think it fits with the DVCs or Git Annexes/LFSs around. Over the years I compiled a list of "Data Package Managers" which might be helpful in your search! PS: Making a small (not very informed) suggestion.
Since |
Description
Part of Roadmap: #2899
Related to refresh of the app itself on file changes: #1511, #2675
(Creating this issue for now so that I can refer and link to it from another issue.)
Suggested solution
Perhaps not a universal solution, but in some use cases when these files are tracked in Git, piggy-backing Git's own file change tracking (which itself is highly optimised) would be nice. Such as using pygit2.Diff.
Alternative
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: