Skip to content

Support integration with Parquet modular encryption #15216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adamreeve opened this issue Mar 14, 2025 · 10 comments
Open

Support integration with Parquet modular encryption #15216

adamreeve opened this issue Mar 14, 2025 · 10 comments
Labels
enhancement New feature or request

Comments

@adamreeve
Copy link

Is your feature request related to a problem or challenge?

arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.

Examples of this complex configuration are:

  • Users may require different encryption or decryption keys to be specified per Parquet file
  • The encryption and decryption keys specified may depend on the file schema
  • The encryption keys may need to be generated per file by interacting with a user's key management service (KMS)
  • Decryption keys may need to be retrieved dynamically based on the metadata read from Parquet files and require interaction with a KMS. This process would be opaque to DataFusion, but requires the FileDecryptionProperties in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).

I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs

Currently all Parquet format options can be easily encoded as strings or primitive types, and live in datafusion-common, which has an optional dependency on the parquet crate, although TableParquetOptions is always defined even if the parquet feature is disabled.

We're experimenting with using encryption in DataFusion by adding encoded keys to the ParquetOptions struct, but this is quite limited and doesn't support the more complex configuration options I mention above.

Describe the solution you'd like

One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:

--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -1615,6 +1615,12 @@ pub struct TableParquetOptions {
     /// )
     /// ```
     pub key_value_metadata: HashMap<String, Option<String>>,
+    /// Callback to modify the Parquet WriterPropertiesBuilder with custom configuration
+    #[cfg(feature = "parquet")]
+    pub writer_configuration: Option<Arc<dyn Fn(WriterPropertiesBuilder) -> WriterPropertiesBuilder>>,
+    /// Callback to modify the Parquet ArrowReaderOptions with custom configuration
+    #[cfg(feature = "parquet")]
+    pub read_configuration: Option<Arc<dyn Fn(ArrowReaderOptions) -> ArrowReaderOptions>>,
 }
 
 impl TableParquetOptions {

These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.

TableParquetOptions can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?

Another minor issue is that TableParquetOptions implements PartialEq, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.

Describe alternatives you've considered

@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an Arc<dyn Any> to allow passing more complex configuration types through TableParquetOptions.

I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the Any trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generate ArrowReaderOptions and WriterProperties per file.

Additional context

No response

@adamreeve
Copy link
Author

adamreeve commented Mar 20, 2025

I had a go at seeing if I could use this callback based configuration approach to integrate with encryption without Datafusion needing to know anything about Parquet encryption.

I tested with Rok's current encryption branch, and tried both reading and writing encrypted files. I wrote an example that nearly works here: https://github.com/adamreeve/datafusion/blob/18d1aca67d0e090823ebd407dc26fef2158f26fd/datafusion-examples/examples/parquet_encryption.rs

One obvious downside is that this isn't compatible with conversion of plans to protobuf. I've just ignored the new config fields there, although ideally we would at least raise an error if they're set and we try to convert a plan to protobuf, but that might require adding a "parquet" feature to the protobuf crates.

I think this is fine though as long as you don't want to use this in a distributed query engine.

I could get writing of encrypted Parquet working, but reading fails when trying to infer the schema, as this uses a ParquetMetaDataReader (here) which doesn't know about the ArrowReaderOptions but instead uses FileDecryptionProperties directly (here).

If I pass the schema to register_listing_table, it still fails at the same place when fetching statistics during the query execution.

I'm not sure how best to work around that. Maybe arrow-rs could be refactored to support reader options that aren't Arrow specific, that could be passed to the ParquetMetaDataReader?

Or maybe Datafusion could change how metadata is read and use ParquetObjectReader::get_metadata_with_options instead? I made a brief attempt at that but didn't get very far. There is a comment on fetch_parquet_metadata though that says "This component is a subject to change in near future"...

@corwinjoy
Copy link

So, to play the devil's advocate, here are some arguments for having encryption configurations encoded as plain strings:

  1. Users may want to run datafusion using the CLI. I could see this being valuable for ETL tasks or in other settings, and in this case having the ability to use a set of environment variables to drive encryption or decryption could be quite valuable.
  2. There are a number of contexts, such as Lambdas or ephemeral AWS boxes, where it may be a better security practice to have environment variables load dynamically with secrets like encryption keys. This may be a better fit and easier to maintain than a custom built KMS.
  3. In a distributed setting, enabling keys via strings is much more doable than trying to distribute objects.

So, even though it may be harder to configure encryption from strings, if we can enable this logic it would buy us a lot of flexibility in how end users can access this feature. It also may not restrict us that much. In practice most users want to use just a few KMS classes that connect to standard cloud API such as AWS or Azure and we could provide a few pre-built classes like is done in Java.

A final idea could be to use this crate here:
https://github.com/dtolnay/typetag

We require that classes implementing the KMS trait be serializable. Then, we can serialize them to strings or distribute them across hosts as needed. This constraint might even help us in the end.
@adamreeve

@alamb
Copy link
Contributor

alamb commented Mar 24, 2025

FWIW I agree with @corwinjoy that having string based configuration is likely needed in several scenarios

Maybe we could do string based settings for any SQL-based interaction and then something programatic for any usecase that needs more control

@corwinjoy
Copy link

@alamb @adamreeve With the modular encryption essentially complete in arrow-rs, we are interested in beginning to move forward with adding support for this feature in datafusion and implementing more concretely what this looks like. It looks like with the upgrade to arrow 55.0.0 (#15749) this can now handle the new code? So, if that is true, I think we are in a position to proceed to add this feature. It may take us a bit of time; we've had a bunch of new work come up, but we are still interested in seeing this through. Anyway, I just wanted to add a status update that we are still interested in this piece.

@adamreeve
Copy link
Author

With the KMS API not being included in arrow-rs but being built as a third-party crate (apache/arrow-rs#7387 (comment)), I would assume we probably don't want to depend on that in Datafusion, but keep the encryption integration more flexible to allow other approaches?

I think it should still be possible to achieve this while mostly using string-based configuration, but the encryption configuration might need to be an opaque string or an arbitrary JSON object.

We could support configuring encryption keys directly by default without needing programmatic access, but allow registering factory methods that could take the configuration string and produce file decryption or encryption properties. I believe you've suggested something like this previously @corwinjoy.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

With the KMS API not being included in arrow-rs but being built as a third-party crate (apache/arrow-rs#7387 (comment)), I would assume we probably don't want to depend on that in Datafusion, but keep the encryption integration more flexible to allow other approaches?

I think that is correct.

I agree the key question here will be "how to integrate / use a KMS system without making DataFusion dependent on such a system" As I think you have been discussing the major challenge will be configuration.

I personally suggest using the Arc<dyn Any> approach, but I don't really understand the requirements enough to know

What i suggest as a next step is to make a public example of reading encrypted parquet files files, and then try to integrate that example with the KMS -- to get the example working you'll very likely have to add some new APIs to DataFusion, but the example will drive that implementation.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

I did this, for example, with the parquet index API: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs

Basically I started writing the example and what I wanted to show and then when I couldn't do it with the existing DataFusion APIs I updated / added some new ones.

Having the example also helped review because reviewers could see what was going on with the example and how the newly added APIs got used

@alamb
Copy link
Contributor

alamb commented May 2, 2025

Here is how spark does encryption configuration

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

@adamreeve
Copy link
Author

Here is how spark does encryption configuration

My understanding of how this works in Spark from reading this and looking at some of the code:

  • Spark requires specifying a class used to generate file encryption and/or decryption properties. This is configured with the spark.hadoop.parquet.crypto.factory.class config setting, and the class needs to implement EncryptionPropertiesFactory and/or DecryptionPropertiesFactory to generate file encryption or decryption properties as required. The class gets access to extra context like the file schema so it knows what columns to provide keys for (see the getFileEncryptionProperties method).
  • Spark supports using the KMS based API by providing a built-in PropertiesDrivenCryptoFactory class that implements EncryptionPropertiesFactory and DecryptionPropertiesFactory. This requires also specifying a KmsClient implementation with the spark.hadoop.parquet.encryption.kms.client.class key, and this class must be defined by users (only a mock InMemoryKMS class is provided for testing).
  • In theory a user could also define their own class that implements EncryptionPropertiesFactory and DecryptionPropertiesFactory if they don't want to use the KMS based API, for example if they want to define AES keys directly.

Starting with similarly flexible EncryptionPropertiesFactory and DecryptionPropertiesFactory traits in Datafusion seems like a reasonable approach to me.

I'm not that familiar with Java, but from what I understand it's straightforward to define your own KmsClient in a JAR and then include that at runtime so it's discoverable by the configuration mechanism. This approach doesn't really translate to Rust though. If any custom code is needed it will need to be compiled in unless we use something like WebAssembly or an FFI, but that seems overly complicated and unnecessary. We could maintain some level of string-configurability by letting users statically register named implementations of traits in code and then reference these in configuration strings. Corwin mentioned the typetag crate that can automate this, or it could be more manual.

I personally suggest using the Arc<dyn Any> approach

I don't really understand the reason for using Any rather than a trait like Arc<dyn EncryptionPropertiesFactory>. At some point an Any would need to be downcast to something that Datafusion understands for it to be usable right? But I agree we should come up with an example of how we'd like this to work and that should provide more clarity.

@adamreeve
Copy link
Author

I don't really understand the reason for using Any

Actually I think I remember now that this would let us include structs in config types in datafusion::common::config without needing to expose details of the parquet and encryption implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants