Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read parquet files from AWS S3 #652

Merged
merged 30 commits into from
Jul 21, 2023
Merged

Read parquet files from AWS S3 #652

merged 30 commits into from
Jul 21, 2023

Conversation

philss
Copy link
Member

@philss philss commented Jul 18, 2023

This feature enables reading Parquet files directly from services like the AWS S3.
It's also compatible with other services that implements the S3 API (like minIO, localstack, DO spaces, etc).

This is using Polars' "scan_parquet" feature, that is lazy and also works for reading multiple files at once.

Explorer.DataFrame.from_parquet("s3://test-bucket/*.parquet")

lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/explorer/data_frame.ex Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
@philss
Copy link
Member Author

philss commented Jul 19, 2023

I'm having a bad time implementing the integration test using minIO. It seams that Polars (or Reqwest) does not accept connections that are not using TLS - only HTTPs endpoints are accepted.

My first attempt to have localhost certs was to generate them with mkcert.
I exposed the certs to the minio server, but it didn't work for me :/

Then I tested using a ngrok tunnel - that uses TLS - pointing to my minIO instance on localhost and it works 🎉

Unfortunately this is not a valid solution, so I'm considering running a minIO instance in a private server, or use the AWS S3 service, but with instructions on how to reproduce the environment.

If any of you have ideas, please let me know!

@josevalim
Copy link
Member

Did you try calling this in the object_store configuration? https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_allow_http

The docs also mention localstack testing, so it may be worth looking at object_store own tests in case they use something similar?

lib/fss.ex Outdated Show resolved Hide resolved
@philss
Copy link
Member Author

philss commented Jul 20, 2023

Did you try calling this in the object_store configuration? https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_allow_http

The docs also mention localstack testing, so it may be worth looking at object_store own tests in case they use something similar?

@josevalim thank you! I didn't try this config. I cannot access the internal builder that this method is from, so I'm trying the way of using the (key, value) configs, like we are doing today. But for that, I need to update Polars to v0.31, since the key pair for allowing http is only available after that version (they updated object_store, so we can use).

@philss philss marked this pull request as ready for review July 20, 2023 22:14
@philss
Copy link
Member Author

philss commented Jul 20, 2023

@josevalim @wojtekmach @Qqwy @jonatanklosko I think it's ready for another pass, if you can review again :)

PS: sorry for the amount of failed builds 😓

lib/fss.ex Outdated Show resolved Hide resolved
lib/fss.ex Outdated Show resolved Hide resolved
Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, just some minor final nits. :) Also please add to the README how to run the cloud tests. :)

@philss philss merged commit 21bf119 into main Jul 21, 2023
4 checks passed
@philss philss deleted the ps-read-parquet-from-s3 branch July 21, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants