Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advertising data completeness #359

Closed
tpowellmeto opened this issue Mar 25, 2022 · 12 comments
Closed

Advertising data completeness #359

tpowellmeto opened this issue Mar 25, 2022 · 12 comments
Labels
API-EDR V1.3 enhancement New feature or request

Comments

@tpowellmeto
Copy link

In many circumstances data can be incomplete. Users’ definition of completeness is based upon their specific use case and is therefore unlikely to be common.

EDR uses an extent object to describe the available axes from which a user must formulate their subset (cube) request. In scenarios where there are multiple axes described by interval arrays, e.g. vertical & temporal, it is impossible to determine where data is available and where it isn’t.

For example, consider how to accurately represent the following data availability whilst constrained to using only two lists:

temporal_1 temporal_2 temporal_3
vertical_1 1 1 1
vertical_2 1 0 0
vertical_3 1 0 1

Given the above there is a burden on data publishers to ensure data is ‘complete’ prior to publishing ’interval’ labels. This forces data publishers to impose their own view of completeness on users, thus limiting timely access to data. In the case of fault/error there is a risk that a ‘complete’ state may never be achieved.

Suggested Resolution
Extend the EDR spec to introduce a data_mask object.
The object itself is an optional property belonging to the parameter object.
The object has two required sub properties:
• order - the order of the axes as they appear in the mask
• mask - a multidimensional array describing where data is available (1) and where its not (0).
A convention could be established whereby if no data_mask is provided then it is assumed all data is available (as is the case now).

I'd be happy to draft a proposal for this if this is the correct next step. :)

@m-burgoyne m-burgoyne added the enhancement New feature or request label Mar 30, 2022
@m-burgoyne
Copy link
Collaborator

A data mask might also be useful when presenting information about data archives, it is possible that there would be time intervals that subset of the parameters in a collection are intentionally missing or were unavailable and a data mask provide publishers to a mechanism to advertise the completeness of the archive.

@chris-little
Copy link
Contributor

chris-little commented Apr 14, 2022

@tpowellmeto @m-burgoyne Do not forget that the presumption in the API-EDR is that the use case is for data that are generally dense, not sparse, and a query is most likely to return data rather than an HTTP error code.
I question whether it is a sensible choice by the data service provider to expose a collection where the data is often absent. Why not use an async mechanism to say "wait a bit then try again"?

@tpowellmeto
Copy link
Author

@chris-little

"I question whether is is a sensible choice by the data service provider to expose a collection where the data is often absent."
We should be empowering service providers to group data in meaningful ways, not constraining them with arbitrary rules on data density.

Whilst data may be dense when 'complete' in many situations it arrives piecemeal. Being able to accurately describe what is present, when, is a mechanism for allowing users to access just the data they need in a timely way whilst telling other users to "wait a bit then try again" without responding with a HTTP error code.

@m-burgoyne
Copy link
Collaborator

@chris-little As long as the data_mask object is optional and only published at the /collections/{collection_id} level it could reduce the overheads on a data publishing server as it would reduce the number of requests clients made to the server. Ideally a server would only return error messages when the user is making an invalid request and in the example that @tpowellmeto gives the user would not be making an invalid request based on the information provided by the server.

@chris-little
Copy link
Contributor

@m-burgoyne @tpowellmeto Well, let's try it to see how effective it is for the described use cases. I worry that we are encouraging over-complicated systems and carrying forward undesirable legacies of WMO GRIB and BUFR and creating unnecessary future technical debt. If it works well in practice, we can then standardise it. Meanwhile, it is a good candidate to add to the Best Practice for API-EDR for Meteorology.

@chris-little chris-little added the API EDR V1.2 Non-breaking change for Version 1.2 label Jul 28, 2022
@chris-little
Copy link
Contributor

EDR API SWG 81 encourages implementors to build a proof-of-concept. Provisionally tagged for V1.2.

@iandruska-ibl
Copy link
Collaborator

iandruska-ibl commented Aug 17, 2022

If we decide to add the data masks, I believe we should remove the extent property at the parameter level. With data masks it becomes unnecessary and would only bring confusion whether the data mask applies to the extent at the collection level or the one at the parameter level.

@chris-little
Copy link
Contributor

This may cause an incompatibility with API Coverages as they have put a lot of effort into extents of various kinds.
@jerstlouis could you comment please?

@jerstlouis
Copy link
Member

@chris-little I don't think this use case of different domain / extent / envelope per field is covered in Coverages / CIS, but it is somewhat related to a suggestion to be able to return a different domain (different resolution and/or envelope) when requesting specific fields (range subsetting).

As a general point, I still hope that we can eventually harmonize parameter names / Features properties schemas / CIS range type; and collection extents / CIS DomainSet :)

@chris-little
Copy link
Contributor

Discussed at EDR API SWG 2023-11-23 that this needs wider review from implementers and users.

@m-burgoyne m-burgoyne linked a pull request Mar 21, 2024 that will close this issue
@chris-little chris-little added API-EDR V1.3 and removed API EDR V1.2 Non-breaking change for Version 1.2 labels Jun 19, 2024
@chris-little
Copy link
Contributor

Current thinking is that should be addressed by API-EDR Part 2: PubSub. Anything finer grained, such as this suggestion is an unnecessary complication. There may be a problem with downstream legacy systems.

@chris-little
Copy link
Contributor

After the EDR API SWG 123 meeting on 31 Oct 2024, the publication of OGC API-EDR Part 2: Publish-Subscribe Workflow on 2024-09-23, and pending any practical implementations and experience of data masks and granularity of data resources, I propose to close this issue and associated PR #470.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API-EDR V1.3 enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants