Skip to content

Add Parquet Modular encryption support (write) #7111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 101 commits into from
Apr 1, 2025

Conversation

rok
Copy link
Member

@rok rok commented Feb 10, 2025

Which issue does this PR close?

Closes #7327

This PR is based on branch and an internal patch and /pull/6637 and aims to provide basic modular encryption support.

Rationale for this change

See #3511 and #7278. This allows users to write Parquet files with modular encryption.

What changes are included in this PR?

  • Adds a new encryption::encrypt module, containing types for configuring encryption, and types that implement encryption of metadata and page data.
  • Adds a new with_file_encryption_properties method to WriterPropertiesBuilder to enable encryption.
  • Adds a new encryption test library, moves existing tests for reading encrypted Parquet files to this and adds new tests for writing encrypted files.

Are there any user-facing changes?

Yes, this adds the ability to specify encryption properties when writing Parquet files.

Several new types and methods are added when the encryption feature is enabled but there should be no breaking changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 10, 2025
@rok rok force-pushed the encryption-basics-fork branch 5 times, most recently from 8e6a4be to d1ccf75 Compare February 17, 2025 18:52
@rok rok force-pushed the encryption-basics-fork branch 3 times, most recently from f6d1155 to 05d4d60 Compare February 21, 2025 12:18
@adamreeve adamreeve force-pushed the encryption-basics-fork branch 3 times, most recently from dad9642 to 5d52ef7 Compare March 4, 2025 07:08
@rok rok force-pushed the encryption-basics-fork branch 2 times, most recently from 36d7c70 to bc75007 Compare March 4, 2025 19:06
@adamreeve adamreeve force-pushed the encryption-basics-fork branch 3 times, most recently from 24e70d9 to cc58a3a Compare March 6, 2025 00:11
@rok rok force-pushed the encryption-basics-fork branch 2 times, most recently from 65fca4b to 44559b0 Compare March 6, 2025 11:26
@rok rok force-pushed the encryption-basics-fork branch 2 times, most recently from c5a692c to 87224ea Compare March 10, 2025 20:05
@rok rok force-pushed the encryption-basics-fork branch 4 times, most recently from cab7dd4 to de5feef Compare March 11, 2025 12:50
@alamb alamb changed the title Parquet Modular encryption support Add Parquet Modular encryption support (write) Mar 12, 2025
@alamb
Copy link
Contributor

alamb commented Mar 26, 2025

I am going to spend some time seeing if I can get it more isolated and hopefully that might serve as an example of how to do it for writing.

Update:

@adamreeve
Copy link
Contributor

I came up with #7337 which I think is easier to work with

Thanks, this example helped. I took another pass at refactoring ThriftMetadataWriter and this is much nicer now.

Since this is a major release, we can make API changes too if needed to support modular encryption.

The one thing that comes to mind is whether we want to change AsyncArrowWriter::finish to return parquet::file::ParquetMetadata instead of parquet::format::FileMetaData. This was suggested in #7254 although closed without comment, but that would probably avoid the problem of writers not being able to use the returned metadata to retrieve statistics due to them being encrypted (https://github.com/apache/arrow-rs/pull/7111/files#r2009421835). But that might be quite a large and painful breaking change, and also might come with a performance hit...

@corwinjoy
Copy link
Contributor

@alamb wrote:

Since this is a major release, we can make API changes too if needed to support modular encryption.

One minor API change that I would like to have considered is this related PR that I feel makes decryption usage better.
#7342

At the suggestion of @adamreeve I have posted this on its own PR.

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

But that might be quite a large and painful breaking change, and also might come with a performance hit...

We would have to try it I suspect to test

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

How is this PR doing, btw? Is it ready for another review?

@adamreeve
Copy link
Contributor

Yes I think this is ready for review. There are a couple of comments I'd like some input on (https://github.com/apache/arrow-rs/pull/7111/files#r2015196618 and https://github.com/apache/arrow-rs/pull/7111/files#r2009421835) but otherwise it's looking good to me.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rok @adamreeve and @corwinjoy 👏

I think this is looking really nice. I had some style comments but I don't think anything is required. I am not an expert in this area of the parquet spec, but I found the code clear, clean, and well commented and tested 🥇 🏆

I do also think it would be great to add some examples (as a follow on PR) that shows how to write an arrow record batch with encryption to a parquet file

The only thing I want to do before approving this is to run some benchmark tests (to make sure it doesn't accidentally slow down writing without encryption). I have started this and will report back

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

I ran the performance benchmarks and I do not see any changes for parquet writing with this PR (though the numbers do vary somewhat). Nice work. Details below

Details

++ critcmp main encryption-basics-fork
group                                                                     encryption-basics-fork                 main
-----                                                                     ----------------------                 ----
write_batch nested/4096 values primitive list                             1.00      6.5±0.30ms   325.6 MB/sec    1.03      6.7±0.26ms   316.1 MB/sec
write_batch nested/4096 values primitive list non-null                    1.00      8.2±0.37ms   260.8 MB/sec    1.00      8.2±0.29ms   260.1 MB/sec
write_batch primitive/4096 values bool                                    1.01  1022.2±53.98µs  1062.3 KB/sec    1.00  1013.0±40.47µs  1072.0 KB/sec
write_batch primitive/4096 values bool non-null                           1.00   934.1±27.14µs   627.3 KB/sec    1.03   960.6±34.00µs   610.0 KB/sec
write_batch primitive/4096 values float with NaNs                         1.01  1824.6±56.19µs    30.1 MB/sec    1.00  1809.3±31.57µs    30.4 MB/sec
write_batch primitive/4096 values primitive                               1.04      2.9±0.24ms    60.4 MB/sec    1.00      2.8±0.07ms    62.6 MB/sec
write_batch primitive/4096 values primitive non-null                      1.06      2.9±0.25ms    60.5 MB/sec    1.00      2.7±0.13ms    64.0 MB/sec
write_batch primitive/4096 values primitive non-null with bloom filter    1.05     22.5±2.68ms     7.7 MB/sec    1.00     21.5±0.36ms     8.0 MB/sec
write_batch primitive/4096 values primitive with bloom filter             1.05     22.4±3.68ms     7.8 MB/sec    1.00     21.4±0.88ms     8.2 MB/sec
write_batch primitive/4096 values string                                  1.00  1746.6±55.52µs    72.3 MB/sec    1.01  1757.4±46.59µs    71.8 MB/sec
write_batch primitive/4096 values string dictionary                       1.02      4.3±0.40ms   241.6 MB/sec    1.00      4.2±0.18ms   247.1 MB/sec
write_batch primitive/4096 values string dictionary with bloom filter     1.02      5.8±0.33ms   177.5 MB/sec    1.00      5.7±0.27ms   180.3 MB/sec
write_batch primitive/4096 values string non-null                         1.02      6.9±0.24ms   295.6 MB/sec    1.00      6.8±0.25ms   302.8 MB/sec
write_batch primitive/4096 values string non-null with bloom filter       1.00     12.9±0.34ms   159.3 MB/sec    1.00     12.9±0.36ms   158.6 MB/sec
write_batch primitive/4096 values string with bloom filter                1.00     11.1±0.63ms   184.1 MB/sec    1.01     11.2±0.37ms   182.3 MB/sec

Second run

++ critcmp main encryption-basics-fork
group                                                                     encryption-basics-fork                 main
-----                                                                     ----------------------                 ----
write_batch nested/4096 values primitive list                             1.01      6.9±0.29ms   309.3 MB/sec    1.00      6.8±0.27ms   313.5 MB/sec
write_batch nested/4096 values primitive list non-null                    1.00      8.5±0.41ms   251.6 MB/sec    1.04      8.8±0.51ms   241.4 MB/sec
write_batch primitive/4096 values bool                                    1.00  1012.3±34.35µs  1072.7 KB/sec    1.03  1044.5±39.21µs  1039.6 KB/sec
write_batch primitive/4096 values bool non-null                           1.00   945.6±33.77µs   619.6 KB/sec    1.02   962.4±43.08µs   608.8 KB/sec
write_batch primitive/4096 values float with NaNs                         1.01  1876.1±54.31µs    29.3 MB/sec    1.00  1861.8±63.39µs    29.5 MB/sec
write_batch primitive/4096 values primitive                               1.03      3.0±0.13ms    58.5 MB/sec    1.00      2.9±0.12ms    60.4 MB/sec
write_batch primitive/4096 values primitive non-null                      1.04      3.0±0.75ms    57.7 MB/sec    1.00      2.9±0.60ms    60.0 MB/sec
write_batch primitive/4096 values primitive non-null with bloom filter    1.00     21.5±0.50ms     8.0 MB/sec    1.00     21.5±0.65ms     8.0 MB/sec
write_batch primitive/4096 values primitive with bloom filter             1.00     21.4±0.85ms     8.2 MB/sec    1.00     21.4±0.96ms     8.2 MB/sec
write_batch primitive/4096 values string                                  1.00  1790.4±68.70µs    70.5 MB/sec    1.02  1827.0±87.24µs    69.1 MB/sec
write_batch primitive/4096 values string dictionary                       1.00      4.6±0.28ms   225.3 MB/sec    1.06      4.8±0.30ms   213.4 MB/sec
write_batch primitive/4096 values string dictionary with bloom filter     1.00      6.2±0.31ms   166.9 MB/sec    1.01      6.2±0.42ms   165.3 MB/sec
write_batch primitive/4096 values string non-null                         1.00      7.2±0.29ms   285.1 MB/sec    1.09      7.8±0.54ms   261.6 MB/sec
write_batch primitive/4096 values string non-null with bloom filter       1.01     13.4±0.41ms   152.7 MB/sec    1.00     13.3±0.42ms   153.9 MB/sec
write_batch primitive/4096 values string with bloom filter                1.00     11.9±0.48ms   172.3 MB/sec    1.00     11.9±0.35ms   172.7 MB/sec

@adamreeve
Copy link
Contributor

adamreeve commented Mar 30, 2025

Your latest comments should all be addressed now thanks @alamb. I've also just added one extra test to cover being able to use unencrypted row group metadata when writing a file.

I'll follow up with a PR soon to add some examples.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

Thanks @adamreeve @corwinjoy and @rok ! I'll plan to merge this tomorrow

This PR appeared to have a conflict so I took the liberty of merging up from main

Screenshot 2025-03-31 at 2 35 30 PM

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Merged to resolve another conflict 🙄

@alamb alamb merged commit 3c5d3f5 into apache:main Apr 1, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Thanks again @rok @etseidl @adamreeve and @corwinjoy 🦾 🚀

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

I also made a small PR to update the parquet status page (when this is released):

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Follow on PR (for full support) is:

I will mark that ready for review when we have released the next version of arrow-rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support writing Parquet with modular encryption
6 participants