Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single vs. multi CRS datasets #2

Open
benbovy opened this issue Dec 4, 2024 · 9 comments
Open

Single vs. multi CRS datasets #2

benbovy opened this issue Dec 4, 2024 · 9 comments

Comments

@benbovy
Copy link
Owner

benbovy commented Dec 4, 2024

This is a big question that may generate lots of discussion: should we allow here only one CRS defined per xarray.Dataset / xarray.DataArray or should we support multiple CRS?

Why supporting multiple CRSs?

Xproj relies on scalar Xarray coordinates with a CRSIndex. This is inspired by the CF-conventions, and such kind of coordinate is used already in tools like rioxarray and odc-geo. AFAIK there doesn't seem to be any restriction in the CF-conventions about the number of grid mapping coordinate variables? (see CF-conventions 1.10, section 5.6, although I might have overlooked it? (EDIT: example 5.13 has both lat-lon geographic and x-y projected coordinate systems in the same dataset). There's no real technical barrier either in supporting multi-CRS with the current Xarray data model (coordinates + custom indexes).

Other Xarray extensions like xvec technically support multiple CRS (xvec currently encapsulates the CRS into the geometry coordinate / index). Although I’m not sure if multi-CRS vector data cubes exist and/or make sense, in theory there will be some friction in adopting xproj if the latter only works with single-CRS (breaking changes).

Single-CRS is easy to enforce in 3rd-party extensions. #1 provides a convenient API to work with single-CRS datasets or dataarrays, while still supporting the multi-CRS case.

Why this may be a bad idea?

Supporting multi-CRS datasets possibly opens a can of worms?

This is potentially confusing. I cannot imagine a DataArray representing a single raster (or a mosaic / stack of rasters) have multiple CRSs defined. I haven't checked but I guess that rioxarray and odc-geo won't work with multi-CRS (EDIT: I checked that and both libraries do not support multiple grid mapping attribute values). Although here again, single-CRS may be enforced in those libraries and no big deal if xproj provides a user-friendly, single-CRS API (#1)?


Any thoughts on this?

@scottyhq
Copy link
Collaborator

scottyhq commented Dec 4, 2024

I think a single CRS is the simpler conceptual data model and worth enforcing. While it may be possible to represent coordinates in multiple CRSs in a single object, I've yet to see a compelling use-case compared to the alternative of reprojecting coordinates to a new CRS during/after loading.

For vectors, the number of geometries doesn't change between CRS (I think), but even so, libraries like geopandas operate under the assumption of a single 'active CRS' at a time

For rectilinear rasters, the number of coordinates will likely change between CRSs (under the assumption of maintaining the same dx,dy resolution), so I don't see that fitting well with the Xarray data model.

EDIT: example 5.13 has both lat-lon geographic and x-y projected coordinate systems in the same dataset

I've never actually come across a netCDF file that does this! Would be interested to know if others have? Currently, I think such a dataset would need to be opened with Xarray DataTree since the grid sizes are different.

@benbovy
Copy link
Owner Author

benbovy commented Dec 4, 2024

since the grid sizes are different

Hmm I thought the example 5.13 in the CF-conventions (version 1.11) would look like below when loaded into a xarray Dataset?

>>> ds
<xarray.Dataset>
Dimensions:      (x: 18, y: 36)
Coordinates:
  * x           (x) float64 ...
  * y           (y) float64 ...
    lat         (y, x) float64 ...
    lon         (y, x) float64 ...
    crs_wgs84   int64 0
  * crs_osgb    int64 0
Data variables:
    temp         (y, x) float64 ...
Indexes:
    x            PandasIndex
    y            PandasIndex
    crs_osgb     CRSIndex (crs=EPSG:27700)

This assumes a single-CRS model (using the projected CRS as the "active" one).

Now assuming a multi-CRS model and a geographic Xarray index set for the lat/lon coordinates, it may look like:

>>> ds
<xarray.Dataset>
Dimensions:      (x: 18, y: 36)
Coordinates:
  * x           (x) float64 ...
  * y           (y) float64 ...
  * lat         (y, x) float64 ...
  * lon         (y, x) float64 ...
  * crs_wgs84   int64 0
  * crs_osgb    int64 0
Data variables:
    temp         (y, x) float64 ...
Indexes:
    x            PandasIndex
    y            PandasIndexlat          GeographyIndexlon
    crs_wgs84    CRSIndex (crs=EPSG:4326)
    crs_osgb     CRSIndex (crs=EPSG:27700)

(note: I plan to refactor xoak so that it provides such Xarray index built from 2D spatial coordinates, see also xarray-contrib/xoak#34)

I find the latter example pretty illustrative and useful, actually (i.e., it allows selecting data either based on lat/lon or x/y coordinates). On a related note, I'm wondering if the reason why rioxarray doesn't support multiple grid mappings is because so far we haven't been able to do much with 2D spatial coordinates in Xarray?

I could imagine similar examples of vector data cubes where we can select data based on either planar or spherical geometries.

I agree that single-CRS is a simpler conceptual data model, and perhaps we could imagine the concept of "active" CRS for the example above? That said, as far as I understand the concept of "active" geometry column (and its CRS) in GeoPandas is rather specific to the dataframe model (often based on one index), which is different from the Xarray data model. The code below (API and behavior) looks pretty clear to me but sadly wouldn't be possible to write if we enforce single-CRS.

>>> ds.proj("crs_wgs84").crs
<Geographic 2D CRS: EPSG:4326>
...

>>> ds.proj("crs_osgb").crs
<Projected CRS: EPSG:27700>
...

>>> ds.proj.crs
ValueError: multiple CRSs found

>>> ds_latlon = ds.drop_vars(["x", "y", "crs_osgb"])

>>> ds_latlon.proj.crs
<Geographic 2D CRS: EPSG:4326>
...

>>>  # ... continue using `ds_latlon` in CRS-aware operations without any extra-step required

I'm sure a multi-CRS model would introduce some extra complexity for other things (re-projection API?), but I wonder if we can keep it under control.

@scottyhq
Copy link
Collaborator

scottyhq commented Dec 5, 2024

Interesting! Thanks for clarifying with the example - I was coming at this from the more narrow view of rasters being represented by an affine and 1D coordinates. I also didn't look closely the CF example and thought it was for storing 2 different data arrays in a single file (e.g. as different HDF groups of different sizes 18x36 and 648x648) 🤦...

so far we haven't been able to do much with 2D spatial coordinates in Xarray. I could imagine similar examples of vector data cubes were we can select data based on either planar or spherical geometries.

This does seem neat, and I'm not opposed to leaving the door open and running with multi-CRS! Of course, if the data is not stored w/ multiCRS to begin with, there is the alternative approach of re-projecting the values/geometry used for selection. It would be good to identify some existing datasets that are stored like the example above - maybe climate model output or something in the xvec or xdggs realms?

@benbovy benbovy mentioned this issue Dec 5, 2024
@benbovy
Copy link
Owner Author

benbovy commented Dec 5, 2024

I also didn't look closely the CF example and thought it was for storing 2 different data arrays in a single file (e.g. as different HDF groups of different sizes 18x36 and 648x648) 🤦...

Yes I agree the CF examples are a little bit confusing. It is mentioned that example 5.13 results from examples 5.11 and 5.12 combined together but in example 5.12 we have this:

dimensions:
  lat = 648 ;
  lon = 648 ;
  y = 18 ;
  x = 36 ;

I've relied to the dimensions of coordinates x(x), y(y), lat(y, x) and lon(y, x) in example 5.13.

@benbovy
Copy link
Owner Author

benbovy commented Dec 13, 2024

Interesting note I've found in geoxarray/geoxarray#21 (comment):

GDAL does not support grid_mapping defined as: "CRS: x y"
GDAL does not support grid_mapping with 2 CRS: "CRS: x y CRSWGS84: lat lon"

So in short GDAL's NetCDF driver doesn't seem to support multiple CRSs. I didn't find any reference about that in GDAL's documentation and repository, though.

@benbovy
Copy link
Owner Author

benbovy commented Dec 16, 2024

Brief summary on the current state of things:

  • CF-conventions allow multiple CRS per dataset, e.g., using a grid_mapping: crs_osgb: x y crs_wgs84: lat lon variable attribute (see also an example in Single vs. multi CRS datasets #2 (comment)). It is not clear whether this is often used by data providers, though.
  • "raster" Xarray extensions like rioxarray and odc-geo do not allow multiple CRS per Dataset or DataArray. GDAL's NetCDF driver doesn't seem to support that either.
  • The xvec "vector" Xarray extension in theory supports multiple CRS per Dataset or DataArray since the CRS is bound to a geometry coordinate. @martinfleis are there many cases where a vector data cube has multiple geometry coordinates (or a GeoDataFrame has multiple geometry columns) with different CRSs?

I'm still unsure about which model xproj should adopt between the single vs. multi CRS model. Why not the multi CRS model, if it does not make things too complicated on the xproj side?

In either case, xproj might still be useful assuming that it provides a flexible, opt-in solution for both the maintainers of Xarray geospatial extensions and the end-users:

  • Dataset.proj.crs raises an error if multiple spatial reference coordinates are found. rioxarray and odc-geo could rely on this (they do similar checks when scanning the Dataset for grid_mapping encoding or attribute entries)
  • Maybe it is okay if xvec still maintains the CRS given per GeometryIndex?
    • Users are free to use xproj if they want to handle CRS using scalar coordinate variables, but it is their responsibility to ensure that all CRS are kept in sync across the dataset (via, e.g., Dataset.proj.map_crs() added in Interoperability with Xarray custom indexes #9).
    • We might want to have a specific Index.__proj_get_crs__() interface such that Dataset.proj.crs also looks at the CRS of Xarray custom indexes to check for unique CRS across the whole Dataset.
    • Dataset.proj.crs could be used as a fallback by xvec if no CRS is defined for any of the geometry coordinates. This is also useful for scalar coordinates (Don't set crs in attrs. xarray-contrib/xvec#71).

@martinfleis
Copy link

@martinfleis are there many cases where a vector data cube has multiple geometry coordinates (or a GeoDataFrame has multiple geometry columns) with different CRSs?

I would not say it is too common but there is certainly a pattern where you keep multiple geometry column in a GeoDataFrame, each representing the same but in a different CRS. I've been doing that to do analysis on projected CRS but for visualisation with lonboard or folium you need 4326 or 3857 and don't want to repeatedly reproject. Not sure how common this would be in the vector data cube world but I can imagine a similar pattern there.

@benbovy
Copy link
Owner Author

benbovy commented Dec 16, 2024

We might want to have a specific Index.proj_get_crs() interface such that Dataset.proj.crs also looks at the CRS of Xarray custom indexes to check for unique CRS across the whole Dataset.

This has been implemented in #10.

@benbovy
Copy link
Owner Author

benbovy commented Jan 8, 2025

I relaxed single-CRS enforcement in #18. We can still change our minds later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants