Description
Hi all,
I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (xarray.Dataset
, geopandas.GeoDataFrame
, pandas.DataFrame
, etc.).
I'm opening this in pystac
since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, and pystac
owns the namespaces I care about. You can already do everything I'm showing today without any changes to pystac
but it feels less nice. I really think that pd.read_csv
is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use as pd.read_csv
.
Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to to_dict()
.
As a couple hypothetical examples, to give an idea:
ds = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
.to_xarray()
)
ds
Or building a datacube from a pystac-client search (which subclasses pystac).
ds = (
catalog
.search(collections="sentinel-2-l2a", bbox=bbox)
.get_all_items() # ItemCollection
.to_xarray()
)
ds
Implementation details
This would be optional. pystac
would not add required dependencies on pandas
, xarray
, etc. It would merely provide the methods Item.to_xarray
, Asset.to_xarray
, ... Internally those methods would try to import the implementation and raise an ImportError
if the optional dependencies aren't met at runtime.
Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html). pystac
certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with an engine
argument).
Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63).
Maybe those could live in pystac; I'd be happy to maintain them.
https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't too complicated.
Problems
A non-exhaustive list of reasons not to do this:
- It's not strictly necessary: You can do all this today, with some effort.
- It's a can of worms: Why
to_xarray
and notto_numpy()
,to_PIL
, ...? Whyto_pandas()
and notto_spark()
,to_modin
, ...?
Alternatives
Alternatively, we could recommend using intake, along with intake-stac, which would wrap pystac-client
and pystac
. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.
Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.
Examples
A whole bunch of examples, under the <details>
, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.
catalog -> collection -> item -> asset -> xarray (raster)
ds = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
.to_xarray()
)
catalog -> collection -> item -> asset -> xarray (zarr)
ds = (
catalog
.get_collection("cil-gdpcir-cc0")
.get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day")
.assets["pr"]
.to_xarray()
)
catlaog -> collection -> item -> asset -> xarray (references)
ds = (
catalog
.get_collection("deltares-floods")
.get_item("NASADEM-90m-2050-0010")
.assets["index"]
.to_xarray()
)
catalog -> collection -> item -> asset -> geodataframe
df = (
catalog
.get_collection("us-census")
.get_item("2020-cb_2020_us_tbg_500k")
.assets["data"]
.to_geopandas()
)
df.head()
search / ItemCollection -> geopandas
df = catalog.search(collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1]).to_geopandas()
df
Proposed Methods
We should figure out what the "target" in each of these to_
methods is. I think there are a couple things to figure out:
- Do we target container types ("to_dataframe", "to_datarray") or libraries ("to_pandas", "to_geopandas", "to_xarray", ...)
- Do we support larger-than-memory results through an argument (
to_dataframe(..., engine="dask")
orto_dataframe(..., npartitions=...)
) or alternative methods.to_dask_dataframe()
).
But I would loosely propose
Asset.to_xarray
->xarray.Dataset
{Item,ItemCollection}.to_xarray
->xarray.Dataset
Asset.to_geopandas
-> geopandas.GeoDataFrameAsset.to_pandas
-> pandas.DataFrameAsset.to_dask_geopandas
->dask_geopandas.GeoDataFrame
Asset.to_dask_dataframe
->dask.dataframe.DataFrame
ItemCollection.to_geopandas
->geopandas.GeoDataFrame
There's a bunch more to figure out, but hopefully that's enough to get started.