Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about FORMAT specifier's purpose and usefulness in a DataLink-dominated world #7

Open
gpdf opened this issue May 21, 2024 · 7 comments

Comments

@gpdf
Copy link

gpdf commented May 21, 2024

Colloquially, the FORMAT parameter is intended to apply a constraint to the persistence format of the data product associated with a row in a conceptual underlying ObsCore table. (Obviously there doesn't have to be such a table in order to implement SIAv2/DAP, but the standard is written with reference to a table model.)

From a user perspective, this is clearly meant to enable, for instance, limiting a query to data in FITS format. In the DAP era where tabular datasets may be available from a service, a user might say "I'm only interested in Parquet".

The way the standard is written, though, it's clear that FORMAT is meant to be evaluated against the value of access_format in the query response. In many archives (CADC, Rubin, probably at least parts of IRSA in the future) we have adopted the "DataLink model" for providing ObsTAP/SIAv2 dataset access, though, where the actual access_format value is always the DataLink MIME type.

The standard even acknowledges this:

This column describes the format of the response from the access_url (see 3.1.3) so the values could be data file types (e.g. application/fits) or they could be the DataLink MIME type.

It seems like we've accidentally shot ourselves in the foot here. No non-IVOA-aware science user would be expected to be know about the "DataLink model" -- they go to Firefly, say, do a query, and, if possible, Firefly will show them the #this target from the DataLink links service, rather than making them navigate the indirection on their own. Unless they deliberately click on the part of the UI that lets them see the links response and any associated additional datasets, they won't be aware of DataLink at all. That's a good thing.

So in this situation if someone does FORMAT=fits they are likely to be very surprised by the results.

I realize the difficulties involved in potentially prying FORMAT off its mandatory link to access_format, but I think it would be worth our having a conversation about whether an interpretation "if access_format is the DataLink MIME type, then evaluate the restriction against the content_type for the #this entry in the resulting links table" would be sustainable, at least as an option.

I recognize that this might require data publishers to add a column to the underlying table to make sure that the "real" data type is efficiently queryable.

FORMAT seems sufficiently useful that it's a shame to, in effect, be forced to lose its usefulness in exchange for all the other big advantages of the "DataLink model". For Rubin (IMO) it's still a worthwhile tradeoff if we can't fix this, but... let's try to think this through and fix it.

My guess is that this must have been discussed before, but I haven't found the trail yet.

@msdemlei
Copy link

msdemlei commented May 22, 2024 via email

@pdowler
Copy link
Collaborator

pdowler commented Oct 9, 2024

I would support dropping it for same reasons expressed above: not actually useful and leads to surprising results.

@gpdf
Copy link
Author

gpdf commented Oct 9, 2024

OK, I'm actually pretty glad to hear you both say this.

The recommendation to interactive clients' developers, then, might be "if #this, or any other entry in a DataLink table appears in multiple rows differing only by content-type, the client may wish to display to the user that they have a choice of available data formats for retrieving the data".

(Noting that this is not about "science FITS vs. preview JPEG" but more about "science FITS vs. science HDF5" or, say, "VOTable-TABLEDATA vs. Parquet" -- choices of content-equivalent representations.)

@gpdf
Copy link
Author

gpdf commented Oct 9, 2024

How do we get this rolling? Is this a deprecation warning to be added to the next SIA 2.x, along with an explanatory sentence or three?

Or do we view FORMAT as permanently part of SIAv2, and we just start DAP without it from the very beginning?

@gmantele
Copy link

I have never used or implemented SIA, so my opinion may not be the best here. But I would tend to say, let's start DAP without the FORMAT constraint. Services are still free to implement it if they want to, but it won't be standard. If at some point it seems to be important to have it (which does not seem to be the case from what all of you say), then, you'll be able to do add it into DAP. In standards, it is often easier to add than to delete.

@pdowler
Copy link
Collaborator

pdowler commented Oct 10, 2024

Yes, I think we would just drop FORMAT from DAP entirely. Maybe we don't need to do anything with SIAv2... I can't really think of an erratum that would be at all helpful.

@gpdf
Copy link
Author

gpdf commented Oct 10, 2024

That sounds reasonable to me.

Regarding SIAv2: The message one would like to convey is "On some archives, FORMAT is unlikely to do what you think it should do" but it almost seems like that's something that should be in client documentation rather than in the standard. If we were going to issue a new SIAv2.* for some reason, I can think of some better wording, but I don't think issuing an erratum (which we don't fold back into the source document anyway) is going to get the message out effectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants