Support CKAN view plugins #88

amercader · 2021-04-21T13:49:22Z

amercader
Apr 21, 2021

I'm conscious this is not a Giftless issue exclusively but it is at least in part related to it.

Context

I'm trying to get CKAN view plugins to play nice with blob_storage + gifltess hosted data files. These are basically JS clients that download the data and parse it to render it in the browser (eg a PDF viewer, a Leaflet map for GeoJSON files). In order for these to work properly we need:

That the Content-disposition header is not set, or is not set to attachment
That the Content-type header is properly set to the actual mime type of the file

The way we've approached this in other simpler cloud based storage backends like ckanext-s3filestore is:

The view plugin requests the data file (the CKAN download endpoint) with a preview=1 query parameter. Based on this, the storage plugin adds or not the Content-Disposition header
The storage plugins stores the content type from the resource metadata during upload as metadata on the blob

Content-disposition

In the Gifltess case, it is Giftless who adds the Content-disposition header (when streaming, or getting the download URL from Azure when a filename is provided. This makes sense, so I'll try to make changes at the ckanext-blob-storage level so the filename is not sent to Giftless if we want to read the data file rather than download it.

Content-type

@shevron In #80 (comment) you mention:

Supporting this in ExternalStorage depends on the storage backend, because it is the backend that sends the HTTP reply.
I know we implemented support for content-type preservation / pre-setting in Azure / GCP but not all backends may support this.

I assume you mean something like setting a property in the remote blob that stores the Content-Type as described before in S3. Looking at the azure module I don't see anywhere that setting extra properties when uploading the blob is supported. Is this implemented somewhere else or was not implemented for Azure?

For local storage, would something like this be acceptable?

diff --git a/giftless/transfer/basic_streaming.py b/giftless/transfer/basic_streaming.py
index 0557f88..5946a35 100644
--- a/giftless/transfer/basic_streaming.py
+++ b/giftless/transfer/basic_streaming.py
@@ -12,6 +12,7 @@ from typing import Any, Dict, Optional
 from flask import Response, request, url_for
 from flask_classful import route
 from webargs.flaskparser import parser  # type: ignore
+import magic
 
 from giftless.auth.identity import Permission
 from giftless.exc import InvalidPayload, NotFound
@@ -87,6 +88,8 @@ class ObjectsView(BaseView):
 
         if self.storage.exists(path, oid):
             file = self.storage.get(path, oid)
+            media_type = magic.from_buffer(file.read(2048), mime=True)
+            headers['Content-Type'] = media_type
             return Response(file, direct_passthrough=True, status=200, headers=headers)
         else:
             raise NotFound("The object was not found")

Or would you rather prefer that it was the client that provided somehow (via query param?) what the preferred content type was?

Thanks for your input

shevron · 2021-04-22T14:26:11Z

shevron
Apr 22, 2021

I agree with the approach for Content-disposition.

As for Content-Type, I think the approach you mention (for streaming storage backends) is generally correct, but here are some additional thoughts:

We should probably add a method in StreamingStorage that gives us the file's content type, and call that from the code you suggested in basic_streaming. The reason for this is that while it complicates the interface by a bit, it allows us to change the behavior of this based on the storage backend. For local files, we can implement mime_magic based detection. For files streamed off Azure / GCP / AWS we can grab the Content-type from the cloud provider, if we have it, and perhaps fall back to either a static default (application/octet-stream?) or magic based detection if the cloud provider doesn't know the content type.
If we do it the way you suggested, take special care to not read the first 2k bytes from file and then pass it as is to the response, I think that in some situations this can cause the response to not include the first 2k bytes. And not all storage adapters return a seekable stream, so doing file.seek(0) before returning may also not work... We need to be more clever here. But maybe we don't need this at all if we go with the approach mentioned in (1).
I'm trying to think if this should be configurable or not, and if there should be a configurable fallback value. In the transfer adapter, we can do something like "if the storage backend doesn't give us content-type, use the configured default" or maybe we can just safely always fall back to application/octet-stream. Maybe having this as a configurable option is overkill.

3 replies

shevron Apr 22, 2021

Another thought: Maybe it's worth while to split between providing the file name and adding another parameter that says "no content-disposition please". This will allow Giftless to know the file name, and use it for content-type detection using the built-in mimetypes module. Not as good as libmagic I suppose, but it also means less dependencies, so maybe there is an advantage. Just a thought.

amercader Apr 23, 2021
Author

@shevron On 1 and 2 let me know if I captured what you suggest in fdec329. I followed your suggestion in the last comment of relying on filename which makes perfect sense (so we'll have a separate disposition=inline parameter or whatever)

Note that for Azure (and other cloud backends) we need to set up the content-type on the blob itself as we are just redirected directly to it. I've done that in e7b1a38, but right now I can't test it because apparently we are not sending the filename when uploading a file, only when downloading. I need to find where in the puzzle is this done (datapub?)

Let me know if I'm going in the right direction with #90. If it looks good I'll clean it up and try to add tests

shevron Apr 30, 2021

I think it looks good, and definitely in the right direction. I left a few minor comments on the PR.

As for downloading / uploading - download is handled by ckanext-blob-storage (via giftless-client), see the get_resource_download_spec action. Uploading is done via datapub (IIRC) but you can see upload code in giftless-client as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CKAN view plugins #88

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support CKAN view plugins #88

amercader Apr 21, 2021

Context

Content-disposition

Content-type

Replies: 1 comment · 3 replies

shevron Apr 22, 2021

shevron Apr 22, 2021

amercader Apr 23, 2021 Author

shevron Apr 30, 2021

amercader
Apr 21, 2021

Replies: 1 comment 3 replies

shevron
Apr 22, 2021

amercader Apr 23, 2021
Author