Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

TomNicholas · 2024-12-13T16:32:22Z

Concept

Icechunk solves the problem of handling incremental updates to Zarr stores, meaning that users can transparently track changes to datasets at the chunk level. Often real-world data pipelines involve computing some aggregate result from an entire input dataset(s), but currently if you change just one chunk in a store then to get the new result you likely have to recompute using the entire dataset. This is potentially massively wasteful if only part of the new result actually depends on chunks that were changed since the last version of the input dataset.

Can we use Cubed to automatically re-compute only the output chunks that actually depend on the updated input chunks?

This would be an extremely powerful optimization - in the pathological case the differences in the re-computed result might only depend on 1 or 2 updated chunks in the original dataset, so only 1 or 2 chunks need to be re-computed instead of re-computing the entire thing.

Cubed potentially has enough information in the plan to trace back up from the desired result all the way to which input chunks are actually necessary.

cc @rabernat (whose idea this was) @tomwhite @sharkinsspatial

dcherian · 2024-12-13T19:25:46Z

hehhe funnily it seems like you could reverse the graph so inputs become outputs, apply a selection to the inputs to isolate the chunks that have changed, call cull, then reverse the graph back.

TomNicholas · 2025-02-01T04:24:07Z

I'm still struggling to imagine what the API for using this idea in Cubed would look like.

Icechunk has the concept of a ChangeSet, which contains more than enough information to know which input chunks have been updated by a commit. But there's no python exposure of this Rust struct.

And how do you pass it in? Xarray doesn't have a concept of like "re-run this".

It's really a culling optimization that should be applied to the Plan just before execution, so it would make sense to pass that into the .compute() method. But at that point in the user code xarray is no longer really aware of where it got the lazy arrays that it originally opened, so I don't see how it can know that the icechunkstore it needs at the end here

latest_changeset = icechunkstore.changesets[-1]
derived_ds.compute(changeset=latest_changest)

is the same as the one that it read the data from at the start

ds = xr.open_zarr(icechunkstore)

tomwhite · 2025-02-03T17:09:34Z

I think there are a couple of cases here:

Computing a given region of a derived dataset. (This is Support region(s) in to_zarr and store #642)
Updating a derived dataset given a changeset for the input dataset. (This issue.)

Case 2 can be implemented by tracing all the output keys transitively from the input to the output, and then using Case 1. Cubed has a key_function for blockwise operations that maps an output key to a set of input keys, so do this we'd have to also provide an inverse key function so it can go in the other direction (input to output).

I wonder if Case 1 is sufficient for many needs? For example, if you want to update a derived dataset with the latest day's data then you already know the region you want to update? It would be good to have a concrete example to talk about, perhaps with some code to sketch the idea.

TomNicholas · 2025-02-03T17:55:14Z

I think you're right @tomwhite. The ChangeSet tells you the chunks that changed, from which you can derive the region that changed.

Note there could be multiple input datasets in general.

It would be good to have a concrete example to talk about, perhaps with some code to sketch the idea.

One simple case is recomputing some aggregated quantity every time new daily data is appended to the end. This is not changing an existing region but adding a new one.

The simplest case might look something like a 1D chunked timeseries, where you want to know the windowed running mean of the signal. We should be able to efficiently compute the new running mean for the cases:

that any region of the timeseries is replaced with corrected historical data,
that new daily data is appended on the end.

TomNicholas · 2025-02-04T00:33:43Z

If you wanted a real-world example then the ENSO state (i.e. whether or not we're in El Nino or La Nina) is interesting. IIUC you basically just average the SST over a particular region of the pacific ocean, and the deviation of that average from the running mean is the Nino SST index. Could make for a cool example notebook. (I'm sure @dcherian can chime in and tell me I've butchered that explanation though 😁)

https://climatedataguide.ucar.edu/climate-data/nino-sst-indices-nino-12-3-34-4-oni-and-tni
https://iri.columbia.edu/our-expertise/climate/forecasts/enso/current/

tomwhite · 2025-02-04T14:58:08Z

@TomNicholas that would make a great example. Who could turn it into a notebook? 😄

TomNicholas · 2025-02-12T22:27:05Z

I wonder if we could build an executor for cubed which translated the bounded-memory primitive array operations into these differential operators under the hood, then retrigger it

https://timelydataflow.github.io/differential-dataflow/chapter_2/chapter_2.html

EDIT: Or prototype with this https://github.com/brurucy/pydbsp

tomwhite · 2025-02-17T12:28:57Z

BTW I sketched what block tracing from might look like (for #642 - from output to input) here: https://github.com/cubed-dev/cubed/tree/region. (That doesn't cover any of the differential stuff.)

TomNicholas added enhancement New feature or request icechunk 🧊 labels Dec 13, 2024

TomNicholas mentioned this issue Feb 13, 2025

Streaming xarray? brurucy/pydbsp#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

TomNicholas commented Dec 13, 2024

dcherian commented Dec 13, 2024

TomNicholas commented Feb 1, 2025 •

edited

Loading

tomwhite commented Feb 3, 2025

TomNicholas commented Feb 3, 2025

TomNicholas commented Feb 4, 2025 •

edited

Loading

tomwhite commented Feb 4, 2025

TomNicholas commented Feb 12, 2025 •

edited

Loading

tomwhite commented Feb 17, 2025

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Comments

TomNicholas commented Dec 13, 2024

Concept

dcherian commented Dec 13, 2024

TomNicholas commented Feb 1, 2025 • edited Loading

tomwhite commented Feb 3, 2025

TomNicholas commented Feb 3, 2025

TomNicholas commented Feb 4, 2025 • edited Loading

tomwhite commented Feb 4, 2025

TomNicholas commented Feb 12, 2025 • edited Loading

tomwhite commented Feb 17, 2025

TomNicholas commented Feb 1, 2025 •

edited

Loading

TomNicholas commented Feb 4, 2025 •

edited

Loading

TomNicholas commented Feb 12, 2025 •

edited

Loading