Convenience methods for converting STAC objects / linked assets to data containers

Hi all,

I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (`xarray.Dataset`, `geopandas.GeoDataFrame`, `pandas.DataFrame`, etc.).

I'm opening this in `pystac` since **this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container**, and `pystac` owns the namespaces I care about. You can already do everything I'm showing today without any changes to `pystac` but it *feels less nice*. I really think that `pd.read_csv` is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use as `pd.read_csv`.

Secondarily, it *can* elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to `to_dict()`.

As a couple hypothetical examples, to give an idea:

```python
ds = (
    catalog
        .get_collection("sentinel-2-l2a")
        .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
        .assets["B03"]
        .to_xarray()
)
ds
```

Or building a datacube from a pystac-client search (which subclasses pystac).

```python
ds = (
    catalog
        .search(collections="sentinel-2-l2a", bbox=bbox)
        .get_all_items()  # ItemCollection
        .to_xarray()
)
ds
```

## Implementation details

This would be optional. `pystac` would not add required dependencies on `pandas`, `xarray`, etc. It would merely provide the methods `Item.to_xarray`, `Asset.to_xarray`, ... Internally those methods would try to import the implementation and raise an `ImportError` if the optional dependencies aren't met at runtime.

Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html). `pystac` certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with an `engine` argument).

Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63).
*Maybe* those could live in pystac; I'd be happy to maintain them.

https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't *too* complicated.

## Problems

A non-exhaustive list of reasons not to do this:

- It's not strictly necessary: You *can* do all this today, with some effort.
- It's a can of worms: Why `to_xarray` and not `to_numpy()`, `to_PIL`, ...? Why `to_pandas()` and not `to_spark()`, `to_modin`, ...?

## Alternatives

Alternatively, we could recommend using [intake](https://intake.readthedocs.io/), along with [intake-stac](https://intake-stac.readthedocs.io/en/latest/), which would wrap `pystac-client` and `pystac`. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.

Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.

## Examples

A whole bunch of examples, under the `<details>`, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.

<details>

### catalog -> collection -> item -> asset -> xarray (raster)


```python
ds = (
    catalog
        .get_collection("sentinel-2-l2a")
        .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
        .assets["B03"]
        .to_xarray()
)
```

### catalog -> collection -> item -> asset -> xarray (zarr)

```python
ds = (
    catalog
        .get_collection("cil-gdpcir-cc0")
        .get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day")
        .assets["pr"]
        .to_xarray()
)
```

### catlaog -> collection -> item -> asset -> xarray (references)

```python
ds = (
    catalog
        .get_collection("deltares-floods")
        .get_item("NASADEM-90m-2050-0010")
        .assets["index"]
        .to_xarray()
)
```
### catalog -> collection -> item -> asset -> geodataframe

```python
df = (
    catalog
        .get_collection("us-census")
        .get_item("2020-cb_2020_us_tbg_500k")
        .assets["data"]
        .to_geopandas()
)
df.head()
```

### search / ItemCollection -> geopandas

```python
df = catalog.search(collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1]).to_geopandas()
df
```

</details>

## Proposed Methods

We should figure out what the "target" in each of these `to_` methods is. I think there are a couple things to figure out:

- Do we target container types ("to_dataframe", "to_datarray") or libraries ("to_pandas", "to_geopandas", "to_xarray", ...)
- Do we support larger-than-memory results through an argument (`to_dataframe(..., engine="dask")` or `to_dataframe(..., npartitions=...)`) or alternative methods `.to_dask_dataframe()`).

But I would loosely propose

- `Asset.to_xarray` -> `xarray.Dataset`
- `{Item,ItemCollection}.to_xarray` -> `xarray.Dataset` 
- `Asset.to_geopandas` -> geopandas.GeoDataFrame
- `Asset.to_pandas` -> pandas.DataFrame
- `Asset.to_dask_geopandas` -> `dask_geopandas.GeoDataFrame`
- `Asset.to_dask_dataframe` -> `dask.dataframe.DataFrame`
- `ItemCollection.to_geopandas` -> `geopandas.GeoDataFrame`

There's a bunch more to figure out, but hopefully that's enough to get started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convenience methods for converting STAC objects / linked assets to data containers #846

Implementation details

Problems

Alternatives

Examples

catalog -> collection -> item -> asset -> xarray (raster)

catalog -> collection -> item -> asset -> xarray (zarr)

catlaog -> collection -> item -> asset -> xarray (references)

catalog -> collection -> item -> asset -> geodataframe

search / ItemCollection -> geopandas

Proposed Methods

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Convenience methods for converting STAC objects / linked assets to data containers #846

Description

Implementation details

Problems

Alternatives

Examples

catalog -> collection -> item -> asset -> xarray (raster)

catalog -> collection -> item -> asset -> xarray (zarr)

catlaog -> collection -> item -> asset -> xarray (references)

catalog -> collection -> item -> asset -> geodataframe

search / ItemCollection -> geopandas

Proposed Methods

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions