Skip to content

Convenience methods for converting STAC objects / linked assets to data containers #846

Open
@TomAugspurger

Description

@TomAugspurger

Hi all,

I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (xarray.Dataset, geopandas.GeoDataFrame, pandas.DataFrame, etc.).

I'm opening this in pystac since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, and pystac owns the namespaces I care about. You can already do everything I'm showing today without any changes to pystac but it feels less nice. I really think that pd.read_csv is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use as pd.read_csv.

Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to to_dict().

As a couple hypothetical examples, to give an idea:

ds = (
    catalog
        .get_collection("sentinel-2-l2a")
        .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
        .assets["B03"]
        .to_xarray()
)
ds

Or building a datacube from a pystac-client search (which subclasses pystac).

ds = (
    catalog
        .search(collections="sentinel-2-l2a", bbox=bbox)
        .get_all_items()  # ItemCollection
        .to_xarray()
)
ds

Implementation details

This would be optional. pystac would not add required dependencies on pandas, xarray, etc. It would merely provide the methods Item.to_xarray, Asset.to_xarray, ... Internally those methods would try to import the implementation and raise an ImportError if the optional dependencies aren't met at runtime.

Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html). pystac certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with an engine argument).

Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63).
Maybe those could live in pystac; I'd be happy to maintain them.

https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't too complicated.

Problems

A non-exhaustive list of reasons not to do this:

  • It's not strictly necessary: You can do all this today, with some effort.
  • It's a can of worms: Why to_xarray and not to_numpy(), to_PIL, ...? Why to_pandas() and not to_spark(), to_modin, ...?

Alternatives

Alternatively, we could recommend using intake, along with intake-stac, which would wrap pystac-client and pystac. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.

Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.

Examples

A whole bunch of examples, under the <details>, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.

catalog -> collection -> item -> asset -> xarray (raster)

ds = (
    catalog
        .get_collection("sentinel-2-l2a")
        .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
        .assets["B03"]
        .to_xarray()
)

catalog -> collection -> item -> asset -> xarray (zarr)

ds = (
    catalog
        .get_collection("cil-gdpcir-cc0")
        .get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day")
        .assets["pr"]
        .to_xarray()
)

catlaog -> collection -> item -> asset -> xarray (references)

ds = (
    catalog
        .get_collection("deltares-floods")
        .get_item("NASADEM-90m-2050-0010")
        .assets["index"]
        .to_xarray()
)

catalog -> collection -> item -> asset -> geodataframe

df = (
    catalog
        .get_collection("us-census")
        .get_item("2020-cb_2020_us_tbg_500k")
        .assets["data"]
        .to_geopandas()
)
df.head()

search / ItemCollection -> geopandas

df = catalog.search(collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1]).to_geopandas()
df

Proposed Methods

We should figure out what the "target" in each of these to_ methods is. I think there are a couple things to figure out:

  • Do we target container types ("to_dataframe", "to_datarray") or libraries ("to_pandas", "to_geopandas", "to_xarray", ...)
  • Do we support larger-than-memory results through an argument (to_dataframe(..., engine="dask") or to_dataframe(..., npartitions=...)) or alternative methods .to_dask_dataframe()).

But I would loosely propose

  • Asset.to_xarray -> xarray.Dataset
  • {Item,ItemCollection}.to_xarray -> xarray.Dataset
  • Asset.to_geopandas -> geopandas.GeoDataFrame
  • Asset.to_pandas -> pandas.DataFrame
  • Asset.to_dask_geopandas -> dask_geopandas.GeoDataFrame
  • Asset.to_dask_dataframe -> dask.dataframe.DataFrame
  • ItemCollection.to_geopandas -> geopandas.GeoDataFrame

There's a bunch more to figure out, but hopefully that's enough to get started.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions