You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm conscious this is not a Giftless issue exclusively but it is at least in part related to it.
Context
I'm trying to get CKAN view plugins to play nice with blob_storage + gifltess hosted data files. These are basically JS clients that download the data and parse it to render it in the browser (eg a PDF viewer, a Leaflet map for GeoJSON files). In order for these to work properly we need:
That the Content-disposition header is not set, or is not set to attachment
That the Content-type header is properly set to the actual mime type of the file
The way we've approached this in other simpler cloud based storage backends like ckanext-s3filestore is:
The view plugin requests the data file (the CKAN download endpoint) with a preview=1 query parameter. Based on this, the storage plugin adds or not the Content-Disposition header
The storage plugins stores the content type from the resource metadata during upload as metadata on the blob
Content-disposition
In the Gifltess case, it is Giftless who adds the Content-disposition header (when streaming, or getting the download URL from Azure when a filename is provided. This makes sense, so I'll try to make changes at the ckanext-blob-storage level so the filename is not sent to Giftless if we want to read the data file rather than download it.
Supporting this in ExternalStorage depends on the storage backend, because it is the backend that sends the HTTP reply.
I know we implemented support for content-type preservation / pre-setting in Azure / GCP but not all backends may support this.
I assume you mean something like setting a property in the remote blob that stores the Content-Type as described before in S3. Looking at the azure module I don't see anywhere that setting extra properties when uploading the blob is supported. Is this implemented somewhere else or was not implemented for Azure?
For local storage, would something like this be acceptable?
diff --git a/giftless/transfer/basic_streaming.py b/giftless/transfer/basic_streaming.py
index 0557f88..5946a35 100644
--- a/giftless/transfer/basic_streaming.py+++ b/giftless/transfer/basic_streaming.py@@ -12,6 +12,7 @@ from typing import Any, Dict, Optional
from flask import Response, request, url_for
from flask_classful import route
from webargs.flaskparser import parser # type: ignore
+import magic
from giftless.auth.identity import Permission
from giftless.exc import InvalidPayload, NotFound
@@ -87,6 +88,8 @@ class ObjectsView(BaseView):
if self.storage.exists(path, oid):
file = self.storage.get(path, oid)
+ media_type = magic.from_buffer(file.read(2048), mime=True)+ headers['Content-Type'] = media_type
return Response(file, direct_passthrough=True, status=200, headers=headers)
else:
raise NotFound("The object was not found")
Or would you rather prefer that it was the client that provided somehow (via query param?) what the preferred content type was?
I agree with the approach for Content-disposition.
As for Content-Type, I think the approach you mention (for streaming storage backends) is generally correct, but here are some additional thoughts:
We should probably add a method in StreamingStorage that gives us the file's content type, and call that from the code you suggested in basic_streaming. The reason for this is that while it complicates the interface by a bit, it allows us to change the behavior of this based on the storage backend. For local files, we can implement mime_magic based detection. For files streamed off Azure / GCP / AWS we can grab the Content-type from the cloud provider, if we have it, and perhaps fall back to either a static default (application/octet-stream?) or magic based detection if the cloud provider doesn't know the content type.
If we do it the way you suggested, take special care to not read the first 2k bytes from file and then pass it as is to the response, I think that in some situations this can cause the response to not include the first 2k bytes. And not all storage adapters return a seekable stream, so doing file.seek(0) before returning may also not work... We need to be more clever here. But maybe we don't need this at all if we go with the approach mentioned in (1).
I'm trying to think if this should be configurable or not, and if there should be a configurable fallback value. In the transfer adapter, we can do something like "if the storage backend doesn't give us content-type, use the configured default" or maybe we can just safely always fall back to application/octet-stream. Maybe having this as a configurable option is overkill.
Another thought: Maybe it's worth while to split between providing the file name and adding another parameter that says "no content-disposition please". This will allow Giftless to know the file name, and use it for content-type detection using the built-in mimetypes module. Not as good as libmagic I suppose, but it also means less dependencies, so maybe there is an advantage. Just a thought.
@shevron On 1 and 2 let me know if I captured what you suggest in fdec329. I followed your suggestion in the last comment of relying on filename which makes perfect sense (so we'll have a separate disposition=inline parameter or whatever)
Note that for Azure (and other cloud backends) we need to set up the content-type on the blob itself as we are just redirected directly to it. I've done that in e7b1a38, but right now I can't test it because apparently we are not sending the filename when uploading a file, only when downloading. I need to find where in the puzzle is this done (datapub?)
Let me know if I'm going in the right direction with #90. If it looks good I'll clean it up and try to add tests
I think it looks good, and definitely in the right direction. I left a few minor comments on the PR.
As for downloading / uploading - download is handled by ckanext-blob-storage (via giftless-client), see the get_resource_download_spec action. Uploading is done via datapub (IIRC) but you can see upload code in giftless-client as well.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm conscious this is not a Giftless issue exclusively but it is at least in part related to it.
Context
I'm trying to get CKAN view plugins to play nice with
blob_storage
+ gifltess hosted data files. These are basically JS clients that download the data and parse it to render it in the browser (eg a PDF viewer, a Leaflet map for GeoJSON files). In order for these to work properly we need:Content-disposition
header is not set, or is not set toattachment
Content-type
header is properly set to the actual mime type of the fileThe way we've approached this in other simpler cloud based storage backends like ckanext-s3filestore is:
preview=1
query parameter. Based on this, the storage plugin adds or not theContent-Disposition
headerContent-disposition
In the Gifltess case, it is Giftless who adds the
Content-disposition
header (when streaming, or getting the download URL from Azure when a filename is provided. This makes sense, so I'll try to make changes at the ckanext-blob-storage level so the filename is not sent to Giftless if we want to read the data file rather than download it.Content-type
@shevron In #80 (comment) you mention:
I assume you mean something like setting a property in the remote blob that stores the
Content-Type
as described before in S3. Looking at the azure module I don't see anywhere that setting extra properties when uploading the blob is supported. Is this implemented somewhere else or was not implemented for Azure?For local storage, would something like this be acceptable?
Or would you rather prefer that it was the client that provided somehow (via query param?) what the preferred content type was?
Thanks for your input
Beta Was this translation helpful? Give feedback.
All reactions