Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT fails on pre-signed URLs due to CORS errors in DuckDB-Wasm #1852

Open
coji opened this issue Sep 14, 2024 · 9 comments
Open

SELECT fails on pre-signed URLs due to CORS errors in DuckDB-Wasm #1852

coji opened this issue Sep 14, 2024 · 9 comments

Comments

@coji
Copy link

coji commented Sep 14, 2024

What happens?

When executing a SELECT statement in DuckDB-Wasm on a data source accessed via a pre-signed URL (especially those created for GET requests), the operation fails due to CORS errors. This prevents querying data stored in locations that require pre-signed URLs for access.

To Reproduce

  1. Use the following pre-signed URL for a Parquet file (valid for 7 days from 2024-09-14):
    https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182
    
  2. In DuckDB-Wasm or at shell.duckdb.org, attempt to execute a SELECT statement on this data source using the pre-signed URL.

Example SQL query:

SELECT * FROM parquet_scan('https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182') LIMIT 10;
  1. Observe that the query fails due to a CORS error, and the data is not accessible.

Note: I tried this query on shell.duckdb.org, and it failed to access the data.

Additional context:
The current behavior seems to be:

  1. DuckDB-Wasm attempts a HEAD request on the pre-signed URL.
  2. The HEAD request fails with a CORS error.
  3. An exception is thrown by xhr.send(null), which is not caught.
  4. The code for performing a range GET request is never reached.
  5. The SELECT statement fails, unable to access the data.

This behavior was observed both in a local DuckDB-Wasm implementation and on shell.duckdb.org.

Importantly, the bucket's CORS policy is set according to the documentation:

[
  {
    "AllowedOrigins": [
      "*"
    ],
    "AllowedMethods": [
      "GET",
      "HEAD"
    ],
    "AllowedHeaders": [
      "*"
    ],
    "ExposeHeaders": [
      "*"
    ],
    "MaxAgeSeconds": 3000
  }
]

Despite this CORS policy allowing both GET and HEAD methods from any origin, the issue persists. This suggests that the problem might be related to how DuckDB-Wasm handles the pre-signed URLs rather than the bucket's CORS configuration.

A possible solution might be to skip the HEAD request for pre-signed URLs or implement exception handling to proceed with the range GET request even if the HEAD request fails.

Browser/Environment:

Chrome 128.0.6613.138

Device:

M2 Macbook Air

DuckDB-Wasm Version:

1.28.1-dev278.0

DuckDB-Wasm Deployment:

shell.duckdb.org

Full Name:

Koji Mizoguchi

Affiliation:

TechTalk Inc.

@carlopi
Copy link
Collaborator

carlopi commented Sep 14, 2024

Using curl, in native:

% curl -I 'https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182'
HTTP/1.1 403 Forbidden
Date: Sat, 14 Sep 2024 11:09:08 GMT
Content-Type: text/plain;charset=UTF-8
Connection: keep-alive
Server: cloudflare
CF-RAY: 8c2fec512cac66ea-AMS

So I would interpret that as the HEAD request not being valid at all, it's not a CORS problem but a problem with presigned URLs that I think do allow only GET methods (I think, unsure, this one for sure).

I would need to check this in the shell, sounds possible an error should be dealt with better and recover from it

@seanbirchall
Copy link

I'm having the same issue with R2. I think it might be related to the presigned URL only being valid for whatever method you've specified to access the data. In duckdb wasm when we read or attach to a file in most cases a HEAD and then a GET request are sent. There's a few other issues talking about this #699. Curious if there's any solution?

Sounds like a possible work around is to create a modified version which sends a ranged GET request first instead of the HEAD request. At which point only GET requests would be sent.

@tobilg
Copy link

tobilg commented Sep 14, 2024

Maybe check the following: https://stackoverflow.com/a/76836943

Also, can you access the presigned URL via curl w/o Problem?

@seanbirchall
Copy link

seanbirchall commented Sep 14, 2024

@tobilg can't curl mine either... sorry if URL expires.

curl -I "https://01bec96ddf135b4f6636692059641ffe.r2.cloudflarestorage.com/scrapeable-data/db_test.duckdb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=f7f92bbce532f6ead44c019c43921a18%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T184231Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=486f72d518388b248d5d9955743ab7a5eb2023c5ef9723c0d5c6afa132b923ac"
HTTP/1.1 403 Forbidden
Date: Sat, 14 Sep 2024 18:45:21 GMT
Content-Type: text/plain;charset=UTF-8
Connection: keep-alive
Server: cloudflare
CF-RAY: 8c3288989eee8fb8-BOS

Maybe I need to get rid of special characters in my bucket and .duckdb in my object.

@seanbirchall
Copy link

It looks like this PR should've fixed the issue #1367 to #1717

@tobilg
Copy link

tobilg commented Sep 16, 2024

Have you checked adding the Content-Range header, as outline #1717 (comment)

@coji
Copy link
Author

coji commented Sep 16, 2024

Have you checked adding the Content-Range header, as outline #1717 (comment)

Thank you for your confirmation. I added the following setting to CORS:

   "ExposeHeaders": ["*"]

but it still results in an error at the time of the HEAD request.
I've also updated the issue itself with the above information.

@coji
Copy link
Author

coji commented Sep 16, 2024

this is huge!

#1856

@seanbirchall
Copy link

@coji I tried out this example app from @e1arikawa using your above example but still cannot get it to work. So maybe it is a CORS issue, or this app is incomplete for testing this, or the PR doesn't work for our desired use case?

Either way maybe give it a try too and see if you can query your S3 parquet file. I still see

duckdb-browser-eh.worker.js:1 Access to XMLHttpRequest at 'https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182' from origin 'http://127.0.0.1:5500' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants