Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Open
TomNicholas opened this issue Dec 13, 2024 · 8 comments
Labels
enhancement New feature or request icechunk 🧊

Comments

@TomNicholas
Copy link
Member

Concept

Icechunk solves the problem of handling incremental updates to Zarr stores, meaning that users can transparently track changes to datasets at the chunk level. Often real-world data pipelines involve computing some aggregate result from an entire input dataset(s), but currently if you change just one chunk in a store then to get the new result you likely have to recompute using the entire dataset. This is potentially massively wasteful if only part of the new result actually depends on chunks that were changed since the last version of the input dataset.

Can we use Cubed to automatically re-compute only the output chunks that actually depend on the updated input chunks?

This would be an extremely powerful optimization - in the pathological case the differences in the re-computed result might only depend on 1 or 2 updated chunks in the original dataset, so only 1 or 2 chunks need to be re-computed instead of re-computing the entire thing.

Cubed potentially has enough information in the plan to trace back up from the desired result all the way to which input chunks are actually necessary.

cc @rabernat (whose idea this was) @tomwhite @sharkinsspatial

@TomNicholas TomNicholas added enhancement New feature or request icechunk 🧊 labels Dec 13, 2024
@dcherian
Copy link

hehhe funnily it seems like you could reverse the graph so inputs become outputs, apply a selection to the inputs to isolate the chunks that have changed, call cull, then reverse the graph back.

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 1, 2025

I'm still struggling to imagine what the API for using this idea in Cubed would look like.

Icechunk has the concept of a ChangeSet, which contains more than enough information to know which input chunks have been updated by a commit. But there's no python exposure of this Rust struct.

And how do you pass it in? Xarray doesn't have a concept of like "re-run this".

It's really a culling optimization that should be applied to the Plan just before execution, so it would make sense to pass that into the .compute() method. But at that point in the user code xarray is no longer really aware of where it got the lazy arrays that it originally opened, so I don't see how it can know that the icechunkstore it needs at the end here

latest_changeset = icechunkstore.changesets[-1]
derived_ds.compute(changeset=latest_changest)

is the same as the one that it read the data from at the start

ds = xr.open_zarr(icechunkstore)

@tomwhite
Copy link
Member

tomwhite commented Feb 3, 2025

I think there are a couple of cases here:

  1. Computing a given region of a derived dataset. (This is Support region(s) in to_zarr and store #642)
  2. Updating a derived dataset given a changeset for the input dataset. (This issue.)

Case 2 can be implemented by tracing all the output keys transitively from the input to the output, and then using Case 1. Cubed has a key_function for blockwise operations that maps an output key to a set of input keys, so do this we'd have to also provide an inverse key function so it can go in the other direction (input to output).

I wonder if Case 1 is sufficient for many needs? For example, if you want to update a derived dataset with the latest day's data then you already know the region you want to update? It would be good to have a concrete example to talk about, perhaps with some code to sketch the idea.

@TomNicholas
Copy link
Member Author

I think you're right @tomwhite. The ChangeSet tells you the chunks that changed, from which you can derive the region that changed.

Note there could be multiple input datasets in general.

It would be good to have a concrete example to talk about, perhaps with some code to sketch the idea.

One simple case is recomputing some aggregated quantity every time new daily data is appended to the end. This is not changing an existing region but adding a new one.

The simplest case might look something like a 1D chunked timeseries, where you want to know the windowed running mean of the signal. We should be able to efficiently compute the new running mean for the cases:

  • that any region of the timeseries is replaced with corrected historical data,
  • that new daily data is appended on the end.

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 4, 2025

If you wanted a real-world example then the ENSO state (i.e. whether or not we're in El Nino or La Nina) is interesting. IIUC you basically just average the SST over a particular region of the pacific ocean, and the deviation of that average from the running mean is the Nino SST index. Could make for a cool example notebook. (I'm sure @dcherian can chime in and tell me I've butchered that explanation though 😁)

https://climatedataguide.ucar.edu/climate-data/nino-sst-indices-nino-12-3-34-4-oni-and-tni
https://iri.columbia.edu/our-expertise/climate/forecasts/enso/current/

@tomwhite
Copy link
Member

tomwhite commented Feb 4, 2025

@TomNicholas that would make a great example. Who could turn it into a notebook? 😄

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 12, 2025

I wonder if we could build an executor for cubed which translated the bounded-memory primitive array operations into these differential operators under the hood, then retrigger it

https://timelydataflow.github.io/differential-dataflow/chapter_2/chapter_2.html

EDIT: Or prototype with this https://github.com/brurucy/pydbsp

@tomwhite
Copy link
Member

BTW I sketched what block tracing from might look like (for #642 - from output to input) here: https://github.com/cubed-dev/cubed/tree/region. (That doesn't cover any of the differential stuff.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request icechunk 🧊
Projects
None yet
Development

No branches or pull requests

3 participants