-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support integration with Parquet modular encryption #15216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I had a go at seeing if I could use this callback based configuration approach to integrate with encryption without Datafusion needing to know anything about Parquet encryption. I tested with Rok's current encryption branch, and tried both reading and writing encrypted files. I wrote an example that nearly works here: https://github.com/adamreeve/datafusion/blob/18d1aca67d0e090823ebd407dc26fef2158f26fd/datafusion-examples/examples/parquet_encryption.rs One obvious downside is that this isn't compatible with conversion of plans to protobuf. I've just ignored the new config fields there, although ideally we would at least raise an error if they're set and we try to convert a plan to protobuf, but that might require adding a "parquet" feature to the protobuf crates. I think this is fine though as long as you don't want to use this in a distributed query engine. I could get writing of encrypted Parquet working, but reading fails when trying to infer the schema, as this uses a If I pass the schema to I'm not sure how best to work around that. Maybe arrow-rs could be refactored to support reader options that aren't Arrow specific, that could be passed to the Or maybe Datafusion could change how metadata is read and use |
So, to play the devil's advocate, here are some arguments for having encryption configurations encoded as plain strings:
So, even though it may be harder to configure encryption from strings, if we can enable this logic it would buy us a lot of flexibility in how end users can access this feature. It also may not restrict us that much. In practice most users want to use just a few KMS classes that connect to standard cloud API such as AWS or Azure and we could provide a few pre-built classes like is done in Java. A final idea could be to use this crate here: We require that classes implementing the KMS trait be serializable. Then, we can serialize them to strings or distribute them across hosts as needed. This constraint might even help us in the end. |
FWIW I agree with @corwinjoy that having string based configuration is likely needed in several scenarios Maybe we could do string based settings for any SQL-based interaction and then something programatic for any usecase that needs more control |
@alamb @adamreeve With the modular encryption essentially complete in arrow-rs, we are interested in beginning to move forward with adding support for this feature in datafusion and implementing more concretely what this looks like. It looks like with the upgrade to arrow 55.0.0 (#15749) this can now handle the new code? So, if that is true, I think we are in a position to proceed to add this feature. It may take us a bit of time; we've had a bunch of new work come up, but we are still interested in seeing this through. Anyway, I just wanted to add a status update that we are still interested in this piece. |
With the KMS API not being included in arrow-rs but being built as a third-party crate (apache/arrow-rs#7387 (comment)), I would assume we probably don't want to depend on that in Datafusion, but keep the encryption integration more flexible to allow other approaches? I think it should still be possible to achieve this while mostly using string-based configuration, but the encryption configuration might need to be an opaque string or an arbitrary JSON object. We could support configuring encryption keys directly by default without needing programmatic access, but allow registering factory methods that could take the configuration string and produce file decryption or encryption properties. I believe you've suggested something like this previously @corwinjoy. |
I think that is correct. I agree the key question here will be "how to integrate / use a KMS system without making DataFusion dependent on such a system" As I think you have been discussing the major challenge will be configuration. I personally suggest using the What i suggest as a next step is to make a public example of reading encrypted parquet files files, and then try to integrate that example with the KMS -- to get the example working you'll very likely have to add some new APIs to DataFusion, but the example will drive that implementation. |
I did this, for example, with the parquet index API: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs Basically I started writing the example and what I wanted to show and then when I couldn't do it with the existing DataFusion APIs I updated / added some new ones. Having the example also helped review because reviewers could see what was going on with the example and how the newly added APIs got used |
Here is how spark does encryption configuration https://spark.apache.org/docs/latest/sql-data-sources-parquet.html |
My understanding of how this works in Spark from reading this and looking at some of the code:
Starting with similarly flexible I'm not that familiar with Java, but from what I understand it's straightforward to define your own
I don't really understand the reason for using |
Actually I think I remember now that this would let us include structs in config types in |
Is your feature request related to a problem or challenge?
arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.
Examples of this complex configuration are:
FileDecryptionProperties
in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs
Currently all Parquet format options can be easily encoded as strings or primitive types, and live in
datafusion-common
, which has an optional dependency on the parquet crate, althoughTableParquetOptions
is always defined even if the parquet feature is disabled.We're experimenting with using encryption in DataFusion by adding encoded keys to the
ParquetOptions
struct, but this is quite limited and doesn't support the more complex configuration options I mention above.Describe the solution you'd like
One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:
These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.
TableParquetOptions
can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?Another minor issue is that
TableParquetOptions
implementsPartialEq
, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.Describe alternatives you've considered
@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an
Arc<dyn Any>
to allow passing more complex configuration types throughTableParquetOptions
.I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the
Any
trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generateArrowReaderOptions
andWriterProperties
per file.Additional context
No response
The text was updated successfully, but these errors were encountered: