Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make "inputs" concept more general in order to accommodate data access from APIs #242

Open
rabernat opened this issue Nov 19, 2021 · 2 comments

Comments

@rabernat
Copy link
Contributor

Our discussion today with ECMWF about ERA5 (see pangeo-forge/staged-recipes#92) surfaced the need for Pangeo Forge to be able to extract data from APIs.

Current Situation in pangeo-forge-recipes

We assume that the recipe "inputs" are either

  • Files that can be opened with fsspec
  • OpenDAP endpoints

These inputs are generated by a FilePattern object. The file pattern must contain a function which translates "input keys" into a string. This string is then passed either to fsspec or directly to xarray.

The FilePattern interfaces directly with the recipe as follows:

fp = FilePattern(...)
rec = Recipe(fp, ...)

The fact that the FilePattern returns a string is a limitation.

Examples of data sources that don't fit with this model

  • Data from the ECMWF CDS API -> The API ultimately gives you a netCDF file based on various query parameters
  • MITgcm data opened with xmitgcm, e.g ECCO data portal -> returns an Xarray Dataset directly

One Solution: An additional "Opener" layer

Rather than passing the FilePattern directly to the Recipe and hoping that the Recipe knows what to do with it, we could imagine the following sort of pattern

fp = FilePattern(...)
opener = XarrayFsspecOpener()
inputs = Inputs(fp, opener)
recipe = Recipe(inputs, ...)

Basically the "opener" would be guaranteed to return certain things that the recipe could use, for example:

  • File-like objects that could be cached
  • Xarray datasets that could be used in the recipe

The Inputs interface would be pretty light, something like

@dataclass
class Inputs
    file_pattern: FilePattern
    opener: Opener

    def __getitem__(self, input_key):
        thing_to_open = self.fp[input_key]
        return self.opener.open(thing_to_open)

The FilePattern could then return whatever it wants, provided that the opener knows what to do with it. For example, FilePattern could return a dict of parameters to pass to an API, or any other Python object that is useful to the opener.

We could imagine providing some default openers

  • XarrayFsspecOpener
  • XarrayOpendapOpener
  • PandasOpener
  • etc.

... plus also allowing users to define their own custom openers to handle more specialized situations.

We could also consider making some of these mixins or using inheritance, such that we could do something like

class FsspecOpener:
    def open_file(key):
        ...

class XarrayFsspecOpener(FsspecOpener):
    def open_dataset(key):
        with self.open_file(key) as f:
            yield xr.open_dataset(f)

etc.

Pros and Cons

This would be a significant refactor. In the end, code would end up leaving XarrayZarrRecipe and moving to a new class (XarrayFsspecOpener). Overall I think this would make for better separation of concerns and more reuseability of code.

I am concerned that this would create yet another layer of complexity for the users. This could be mitigated by creating some convenience functions or adaptors that would reduce the amount of boilerplate that would need to be written.

@cisaacstern
Copy link
Member

Notes from the coordination meeting today:

  • @martindurant mentioned that the Intake community is considering separating the backend reader component (perhaps there is a better descriptor for this) from that project into its own package. If so, some of this may be able to rely on that.
  • For the CDS/MARS case specifically: @alxmrs has experience with formatting efficient CDS requests and may be able to offer relevant insights (and possibly code, tbd).

@jbusecke
Copy link
Contributor

I just had a very nice chat with @cisaacstern and he pointed me to this issue. From a brief reading I think this would fit our usecase of processing CMIP6 data to e.g. remove control run drift.

If I understand this broadly I could see the Opener contain logic to load/filter datasets from a catalog (using cmip6_preprocessing) and return an xarray dataset to the actual recipe?

We set some time aside next week to hack about with this, and will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants