Make "inputs" concept more general in order to accommodate data access from APIs #242

rabernat · 2021-11-19T18:57:39Z

Our discussion today with ECMWF about ERA5 (see pangeo-forge/staged-recipes#92) surfaced the need for Pangeo Forge to be able to extract data from APIs.

Current Situation in pangeo-forge-recipes

We assume that the recipe "inputs" are either

Files that can be opened with fsspec
OpenDAP endpoints

These inputs are generated by a FilePattern object. The file pattern must contain a function which translates "input keys" into a string. This string is then passed either to fsspec or directly to xarray.

The FilePattern interfaces directly with the recipe as follows:

fp = FilePattern(...)
rec = Recipe(fp, ...)

The fact that the FilePattern returns a string is a limitation.

Examples of data sources that don't fit with this model

Data from the ECMWF CDS API -> The API ultimately gives you a netCDF file based on various query parameters
MITgcm data opened with xmitgcm, e.g ECCO data portal -> returns an Xarray Dataset directly

One Solution: An additional "Opener" layer

Rather than passing the FilePattern directly to the Recipe and hoping that the Recipe knows what to do with it, we could imagine the following sort of pattern

fp = FilePattern(...)
opener = XarrayFsspecOpener()
inputs = Inputs(fp, opener)
recipe = Recipe(inputs, ...)

Basically the "opener" would be guaranteed to return certain things that the recipe could use, for example:

File-like objects that could be cached
Xarray datasets that could be used in the recipe

The Inputs interface would be pretty light, something like

@dataclass
class Inputs
    file_pattern: FilePattern
    opener: Opener

    def __getitem__(self, input_key):
        thing_to_open = self.fp[input_key]
        return self.opener.open(thing_to_open)

The FilePattern could then return whatever it wants, provided that the opener knows what to do with it. For example, FilePattern could return a dict of parameters to pass to an API, or any other Python object that is useful to the opener.

We could imagine providing some default openers

XarrayFsspecOpener
XarrayOpendapOpener
PandasOpener
etc.

... plus also allowing users to define their own custom openers to handle more specialized situations.

We could also consider making some of these mixins or using inheritance, such that we could do something like

class FsspecOpener:
    def open_file(key):
        ...

class XarrayFsspecOpener(FsspecOpener):
    def open_dataset(key):
        with self.open_file(key) as f:
            yield xr.open_dataset(f)

etc.

Pros and Cons

This would be a significant refactor. In the end, code would end up leaving XarrayZarrRecipe and moving to a new class (XarrayFsspecOpener). Overall I think this would make for better separation of concerns and more reuseability of code.

I am concerned that this would create yet another layer of complexity for the users. This could be mitigated by creating some convenience functions or adaptors that would reduce the amount of boilerplate that would need to be written.

The text was updated successfully, but these errors were encountered:

cisaacstern · 2021-11-22T20:27:30Z

Notes from the coordination meeting today:

@martindurant mentioned that the Intake community is considering separating the backend reader component (perhaps there is a better descriptor for this) from that project into its own package. If so, some of this may be able to rely on that.
For the CDS/MARS case specifically: @alxmrs has experience with formatting efficient CDS requests and may be able to offer relevant insights (and possibly code, tbd).

jbusecke · 2021-11-24T00:46:22Z

I just had a very nice chat with @cisaacstern and he pointed me to this issue. From a brief reading I think this would fit our usecase of processing CMIP6 data to e.g. remove control run drift.

If I understand this broadly I could see the Opener contain logic to load/filter datasets from a catalog (using cmip6_preprocessing) and return an xarray dataset to the actual recipe?

We set some time aside next week to hack about with this, and will report back.

rabernat mentioned this issue Nov 22, 2021

Redo: readers and base package intake/intake#627

Open

jbusecke mentioned this issue Nov 24, 2021

Add from_pickle option to FilePattern #205

Closed

rabernat mentioned this issue Nov 29, 2021

Opener refactor #245

Closed

jbusecke mentioned this issue Dec 3, 2021

Derived CMIP6 data recipe builder (WIP) #252

Closed

rabernat mentioned this issue Jan 31, 2022

Pangeo Forge regridding recipe? ocean-transport/scale-aware-air-sea#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make "inputs" concept more general in order to accommodate data access from APIs #242

Make "inputs" concept more general in order to accommodate data access from APIs #242

rabernat commented Nov 19, 2021

cisaacstern commented Nov 22, 2021

jbusecke commented Nov 24, 2021

Make "inputs" concept more general in order to accommodate data access from APIs #242

Make "inputs" concept more general in order to accommodate data access from APIs #242

Comments

rabernat commented Nov 19, 2021

Current Situation in pangeo-forge-recipes

Examples of data sources that don't fit with this model

One Solution: An additional "Opener" layer

Pros and Cons

cisaacstern commented Nov 22, 2021

jbusecke commented Nov 24, 2021