-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645
Comments
hehhe funnily it seems like you could reverse the graph so inputs become outputs, apply a selection to the inputs to isolate the chunks that have changed, call cull, then reverse the graph back. |
I'm still struggling to imagine what the API for using this idea in Cubed would look like. Icechunk has the concept of a And how do you pass it in? Xarray doesn't have a concept of like "re-run this". It's really a culling optimization that should be applied to the latest_changeset = icechunkstore.changesets[-1]
derived_ds.compute(changeset=latest_changest) is the same as the one that it read the data from at the start ds = xr.open_zarr(icechunkstore) |
I think there are a couple of cases here:
Case 2 can be implemented by tracing all the output keys transitively from the input to the output, and then using Case 1. Cubed has a I wonder if Case 1 is sufficient for many needs? For example, if you want to update a derived dataset with the latest day's data then you already know the region you want to update? It would be good to have a concrete example to talk about, perhaps with some code to sketch the idea. |
I think you're right @tomwhite. The Note there could be multiple input datasets in general.
One simple case is recomputing some aggregated quantity every time new daily data is appended to the end. This is not changing an existing region but adding a new one. The simplest case might look something like a 1D chunked timeseries, where you want to know the windowed running mean of the signal. We should be able to efficiently compute the new running mean for the cases:
|
If you wanted a real-world example then the ENSO state (i.e. whether or not we're in El Nino or La Nina) is interesting. IIUC you basically just average the SST over a particular region of the pacific ocean, and the deviation of that average from the running mean is the Nino SST index. Could make for a cool example notebook. (I'm sure @dcherian can chime in and tell me I've butchered that explanation though 😁) https://climatedataguide.ucar.edu/climate-data/nino-sst-indices-nino-12-3-34-4-oni-and-tni |
@TomNicholas that would make a great example. Who could turn it into a notebook? 😄 |
I wonder if we could build an executor for cubed which translated the bounded-memory primitive array operations into these differential operators under the hood, then retrigger it https://timelydataflow.github.io/differential-dataflow/chapter_2/chapter_2.html EDIT: Or prototype with this https://github.com/brurucy/pydbsp |
BTW I sketched what block tracing from might look like (for #642 - from output to input) here: https://github.com/cubed-dev/cubed/tree/region. (That doesn't cover any of the differential stuff.) |
Concept
Icechunk solves the problem of handling incremental updates to Zarr stores, meaning that users can transparently track changes to datasets at the chunk level. Often real-world data pipelines involve computing some aggregate result from an entire input dataset(s), but currently if you change just one chunk in a store then to get the new result you likely have to recompute using the entire dataset. This is potentially massively wasteful if only part of the new result actually depends on chunks that were changed since the last version of the input dataset.
Can we use Cubed to automatically re-compute only the output chunks that actually depend on the updated input chunks?
This would be an extremely powerful optimization - in the pathological case the differences in the re-computed result might only depend on 1 or 2 updated chunks in the original dataset, so only 1 or 2 chunks need to be re-computed instead of re-computing the entire thing.
Cubed potentially has enough information in the plan to trace back up from the desired result all the way to which input chunks are actually necessary.
cc @rabernat (whose idea this was) @tomwhite @sharkinsspatial
The text was updated successfully, but these errors were encountered: