-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add from_pickle
option to FilePattern
#205
Conversation
I appreciate everyone's work on this. I think that @jbusecke's CMIP6 preprocessing work is an interesting possibility for Pangeo Forge. However, I don't think this is the right approach. I don't think we want to be dumping dask graphs to pickles and then reading those as "inputs". The reason is that this loses the provenance chain between the true inputs (i.e. the CMIP6 files in the cloud) and the datasets that Pangeo Forge will produce. Ideally we would want those input exposed via a In order to move forward with this, it would be very valuable if @jbusecke could explain what happens inside |
I think that is a good idea. I will try to come up with a schematic that also illustrates our thinking here. |
Just text would also be fine. I'm thinking something like
etc. |
Ah ok that is indeed faster. So my basic workflow has two stages: 1. Reading, preprocessing, and combining datasetsA very basic level example would be for globally averaged SST.
Everything up until here incorporates all the heavy lifting with cmip6_pp, but at the end you will end up with a dict (or list) of clean and homogenous datasets that can be independently processed.
ds.weighted(ds.areacello).mean(['x','y']) 2. Execute the built up computation for each dataset (in parallel?)Of course we could execute this with a loop over the dictionary (that is in fact what I used to do up until now), but that would be slow. Instead I was hoping that we can somehow write a recipe that does Step 1, and then is somehow able to 'dispatch' each of these datasets (with all their dask graphs and modification applied) to a recipe similar to @cisaacstern and my idea was initially that this would all be written into a single recipe (which writes pickeled representations to a temp space) and then return a bunch of I still think that if possible, this
|
@jbusecke thanks for this helpful context.
Pangeo Forge already supports definition of the recipe as a dictionary of recipes (see defintion of The real work will be to create an intermediate layer (possibly through a plugin) which allows you to experiment with dataset dictionaries in a notebook in way that's familiar and comfortable to you, but which translates that process into FilePatterns and other Pangeo Forge types when you're ready to move to executing your recipe. |
Thanks so much for taking the time to write that all up! I have a much better idea of this use case now. I think it would be a useful exercise to first recode a simplified version of this workflow from scratch using existing pangeo-forge-recipes features. The existing CMIP6 recipe tutorial would be a good place to start. (It should be no problem to use Zarr rather than netCDF inputs.) That will reveal a crux issue: Pangeo Forge as currently constituted simply does not allow arbitrary computations to be run on a dask distributed cluster. Instead, we enforce a fairly limited model of parallel execution. At its core, Pangeo Forge requires that datasets be written in parallel in "chunks" (which are not the same as dask chunks; see #182). For our purposes a "chunk" is an individual Xarray dataset that can fit entirely into memory. A chunk can be a combination (concatentation / merge) of multiple inputs or, alternatively, a strided subset of one input. This requires rethinking the whole approach to how to produce a derived dataset. Fortunately, for this application of global mean SST, we could easily accomplish that via a def global_mean(ds):
return ds.weighted(ds.areacello).mean(['x', 'y'])
recipe = XarrayZarrRecipe(..., process_chunk=global_mean) Then it basically comes down to the challenge of defining the appropriate FilePattern object for a particular model. Other preprocessing, such as augmenting the coordinates, could be accomplished either at the One problem that I can't immediately see how to resolve is how to bring in an extra coordinate file, separate from the FilePattern. |
We might be able to close this in favor of #242? If that avenue is successful it seems like a more general solution. |
@jbusecke, @dgergel, and I are continuing to explore ways to support real world CMIP6 workflows in Pangeo Forge.
https://github.com/jbusecke/cmip6_derived_cloud_datasets uses Prefect and Coiled to build derived CMIP6 Zarr datasets without Pangeo Forge. The first step of that process is
discovery_and_preprocessing
, which returns a dictionary of preprocessedxarray.Dataset
s.Julius thought of a possible shortcut for plugging his dictionary into Pangeo Forge: if the objects in the CMIP6 dataset dictionary were written out to local pickle files, then they might be able to be arbitrarily executed with his Prefect/Coiled infrastructure or, alternatively, instantiated as an
XarrayZarrRecipe
and executed with Pangeo Forge infrastructure. This turns out to be true (at least for manual execution), as proven by a notebook he screen-shared with me, and requires surprisingly minimal changes to the codebase. This PR (again, h/t Julius for the concept) makes the necessary changes topangeo-forge-recipes
so that Julius can continue his experiments by installingpangeo-forge-recipes
from!pip install git+https://github.com/cisaacstern/pangeo-forge-recipes@from-pickle
This might be part of an eventual solution for #176 and the various CMIP6-related issues linked therein. Here is a pseudo-code example of how this feature would be used in combination with Julius's dataset dictionary:
Among other things, security is a consideration if we ever choose to support pickled inputs in production. The feature of
pickle
which does not appear to be easily replicated with other intermediate caching methods is its ability—in theory, at least?—to encode the entirexarray.Dataset
(including combinations of arbitrary numbers of remote sources and delayed Dask operations) in a small, portable file that doesn't require.load()
ing. Currently, the tests I've included with this PR don't test that full suite of potential, but we can certainly expand them to confirm.