Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Geo)Zarr sub-resources in /coverage, and CIS JSON #175

Open
jerstlouis opened this issue Sep 14, 2023 · 13 comments
Open

(Geo)Zarr sub-resources in /coverage, and CIS JSON #175

jerstlouis opened this issue Sep 14, 2023 · 13 comments

Comments

@jerstlouis
Copy link
Member

jerstlouis commented Sep 14, 2023

Zarr is typically organized in the cloud as a directory structure of resource files, as opposed to a single file.

What does this mean for implementing this as a representation of /coverage?
Would the content then be sub-directories of /coverage?

What does an application/x-zarr negotiated media type typically return?

Does this sub-resource pattern mean that we should resurrect sub-resources in the context of CIS JSON, to require in addition to /coverage:

/coverage/domainset, /coverage/rangetype, /coverage/rangeset, /coverage/metadata instead of the current profile= query parameter in that requirement class, but ONLY for when CIS-JSON support is declared in /conformance ?

@joanma747
Copy link

How Zarr is internally organized should not be important. We should use the common parametres ot extract information from the coverage that are agnostic of the file format.

I'm not able to understand why the first sentence is related with the rest of the discussion and the resucitation of "/coverage/"

In case that we request a zarr as a format, can we only use a MIME multipart response or a zip files (with or without compression) to support the retrieval of several files in one? Both support files folders and byte arrays.

@jerstlouis
Copy link
Member Author

jerstlouis commented Sep 14, 2023

@joanma747 In general, I think multipart responses are a pain to deal with. Zip could be one option (though we would need a Zarr+zip media type), but I also think the whole idea of Zarr being cloud-friendly is normally mapping it to separate files (resources) that allow to efficiently serve it from object storage and access individual parts of it (somewhat similar to COG range request). This might also be necessary to access it with e.g., Python XArray.

Although the standard itself seems to say that the key/value store (file name / file content) can be implemented any which way, from https://wiki.earthdata.nasa.gov/display/ESO/Zarr+Format:

Zarr can be viewed as the cloud based version of HDF5/NetCDF files as it follows a similar data model. Zarr does not come in a single file as NetCDF or HDF5 does but as a directory with chunks of data in compressed binary files and metadata describing the binary content in external JSON files.

So based on all of this, I am wondering if it should mean that our (Geo)Zarr conformance class in coverages is not a single file unlike other representations like GeoTIFF or netCDF, but individual files (as defined by Zarr) inside /coverage/ directory.

And I am making the parallel to CIS JSON if we already introduce encoding-specific sub-resources about the desire to access only the "domainset" or "rangetype" property that we had before that currently I changed to a /coverage?f=json&profile=domainset,rangetype request.

(of course it could get messy in terms of describing those encoding-specific sub-resources in an OpenAPI definition, especially if there ends up being conflicting paths for different encodings).

@joanma747
Copy link

I see.
I still think that:

We should use common parametres to extract information from the coverage that are agnostic of the file format.
So, requesting "files in a folder" is not the way WCS worked and it should not be the way OGC API coverages works.

If you want to implement Zarr as "files in a folder" then you should define the OGC Zarr API (that is not the OGC API Coverages)
and do

/collections/{collectionId}/Zarr/file

I have the same opinion with COG. We should not serve a COG in a OGC API Coverages. COG uses HTTP range and the client is in control of the bytes traffic. It does not require an API and adding an API only messes with the original idea.

@jerstlouis
Copy link
Member Author

@joanma747 About COG see #93 (comment) :)

For COG, it's rather easy to add support for HTTP range to a /coverage GeoTIFF representation (and cannot potentially conflict with other encoding resources, since it does not define new resources).

For Zarr as files in a folder, I am not convinced one way or the other.

If this is what could allow pointing a Python XArray client to an OGC API - Coverages /coverage resource, I think it is a very big argument "for" (while avoiding having to define a competing separate API), but there are certainly valid arguments against in terms of "messing up" the clean resource representation defined by OGC API - Coverages.

@joanma747
Copy link

it's rather easy to add support for HTTP range to a /coverage
That is true, but does this make sense?.

The whole purpose of the OGC API coverages is to forget about the internal structure of the coverage and request the data based on geospatial filters and other filters.

If I have to build a client that transforms all that in to byte positions before I do a HTTP -range request, where is the value of the OGC API Coverages. It is simply better to forget about it and consider that HTTP-range is your protocol. No API needed.

@jerstlouis
Copy link
Member Author

jerstlouis commented Sep 14, 2023

@joanma747 The value / sense in that is being able to support both typical OGC API - Coverages clients that implement parameters like subset and scale-factor (and/or coverage tiles) as well as those clients that just understand COG at the same OGC API end-point. In my opinion, that makes data easier to find by having everything in one place, one API that supports both use cases, and is somewhat of an answer to justify OGC APIs to those asking "What's the point of OGC APIs? We don't need an API, we have COG".

It would be the same idea for supporting a Zarr directory to which you can point a python XArray client at /collections/{collectionId}/coverage.

@cnreediii
Copy link
Contributor

@joanma747 I am not sure having separate APIs for specific encodings/formats is a good idea. For example, a CDB data store could use a UNIX file system (traditional CDB approach) in which content is structured in a hierarchy of file folders (based on tiling/LoD rules). A variety of formats/encodings are used and more will be added. As a developer defining clients that access a CDB data store, I want to simply have one API that accesses any vector data stype and one API that accesses any coverage type.

Actually, I would really like one API that rules them all :-) so that I can simple say "Give me content in this geography area with appropriate metadata so I can then process further". Unfortunately, the OGC API design/architecture does not support this!

@chris-little
Copy link

Actually, I would really like one API that rules them all :-) so that I can simple say "Give me content in this geography area with appropriate metadata so I can then process further"
@cnreediii Doesn't API-EDR do that?

@chris-little
Copy link

chris-little commented Sep 15, 2023

@jerstlouis, @joanma747 , @cnreediii The key issue should be not the detailed format and structure in the cloud or on disks, but the metadata that is exposed for processing - this should be the same and consistent across OGC APIs .

PS NetCDF3 and NetCDF4/HDF5 have completely different internal structures - the first is a multidimensional array, the latter a hierarchy of objects (which may be multidimensional arrays). The OGC APIs should hide this.

@joanma747
Copy link

What is the mechanism to know the internal file structure in the first place? Without knowing it, you cannot request individual chunk files by name.

And then the client should be "aware" of this structure and what "chunk" corresponds to what spatial area. The Zarr structure should not change when you do a subsetting (a new Zarr should not be created) so in practice, when requesting individual subfiles you would not use subsetting or scaling parameters.

@cnreediii
Copy link
Contributor

@joanma747 Indeed, that is the question! Back in the day when I helped design and implement a GIS API, we had a simple call to the server that asked what formats/encodings the server supported for output. We used a controlled vocabulary so that the server provided a list of one or more formats and then the client "knew" what could be returned. Sort of sounds like some of the W*S Standards :-) Whether the returned format/encoding conformed to the rules of a given format, such as IEGS or SIF, that was a different question :-)

@jerstlouis
Copy link
Member Author

SWG 2024-01-24: Since we do not have enough feedback / experience on how best to implement Zarr as a coverage representation in OGC API - Coverages, and uncertainty about the usefulness of either internal or (multi-web resources) space partitioning blocks potentially conflicting or being redundant with subsetting mechanisms, I propose to remove the Zarr requirement class from Part 1, with the option to add it later if we receive more feedback.

Alternatively, if anyone has input on Zarr and would like to propose how best to go about it, please discuss this in this issue or at the upcoming Code Sprint February 13-15, https://github.com/opengeospatial/ogcapi-coverages/wiki/February-2024-OGC-API-%E2%80%90-Coverages-Virtual-Code-Sprint .

@jerstlouis
Copy link
Member Author

SWG 2024-08-21: We now have a draft Zarr requirement class using a Zip container files for the /coverage response, with the internal Zarr structure contained within it.

Please provide feedback on this approach if you plan on implementing Zarr in your Coverages implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants