-
Notifications
You must be signed in to change notification settings - Fork 924
Add Parquet Modular decryption (read) support + encryption
flag
#6637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently this is a rough rebase of work done by @ggershinsky. As |
@rok let me know if you want any help shoehorning this into |
Is there any help, input or contribution needed here? |
Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push. |
fe488b3
to
d263510
Compare
@etseidl could you please do a quick pass to say if this makes sense in respect to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!
f90d8b4
to
29d55eb
Compare
27e77ad
to
7db06cc
Compare
a4105d5
to
3e7646d
Compare
deedba9
to
951f2fa
Compare
f6b9e88
to
23375d1
Compare
7f94e39
to
177d826
Compare
ac4ac21
to
3241425
Compare
f33c96c
to
e58ae7c
Compare
👁️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rok @adamreeve and everyone else. I think this looks really nice
I think it would be good to move some of the tests to the "integration test" arrow_reeader
binary rather than as unit tests so the structure visibility is tested (the rules for tests compiled with --lib
is different than external) clearer, but we can do that as a follow on PR
I also verified this code is tested (parquet is tested with all features):
I also re-verified that the latest version works with datafusion (as a way to try and double check this PR has no breaking APIs)
Thanks again
@@ -2336,4 +2459,279 @@ mod tests { | |||
let result = reader.try_collect::<Vec<_>>().await.unwrap(); | |||
assert_eq!(result.len(), 1); | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As they are placed here, these tests:
- Require
#[cfg(feature = "encryption")]
for each - Have different visibility rules than user programs (they can use non public items from the containing module)
Thus I recommend we move the tests to the reader
"integration test" -- https://github.com/apache/arrow-rs/tree/main/parquet/tests/arrow_reader
This is what gets run via cargo test -p parquet --test arrow_reader
We can basically follow the same pattern as checksumming
arrow-rs/parquet/tests/arrow_reader/mod.rs
Lines 39 to 40 in f4fde76
#[cfg(feature = "crc")] | |
mod checksum; |
I think the way this PR has it is fine, however, and I think we should move the tests as part of a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed
- Consolidate encryption parquet tests #7280 to track this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes since my last review seem good. I wish we could avoid the small bit of code duplication in the async_reader, but it's not worth holding this up any longer. Thanks all.
Co-authored-by: Rok Mihevc <[email protected]>
I merged this PR up from main and once the CI tests pass I will merge it in. Thank you all for all your effort to get this over the line! |
I indulged in some ticket organizing and have collected the various follow on items / work in a new epic so we can follow along |
I also made a PR to the parquet site website to note that arrow-rs now supports modular reading ❤️ |
🚀 |
I also filed #7281 to track adding some examples |
…ache#6637) * first commit * Use ParquetMetaDataReader * Fix CI * test * save progress * work * Review feedback * page decompression issue * add update_aad * Change encrypt and decrypt to return Results * Use correct page ordinal and module type in AADs * Tidy up ordinal types * Lint * Fix regular deserialization path * cleaning * Update data checks in test * start non-uniform decryption * Add missing doc comments * Make encryption an optional feature * Handle when a file is encrypted but encryption is disabled or no decryption properties are provided * Allow for plaintext footer * work * Fix method name * work * Minor * work * work * work * Fix reading to end of file * Refactor tests * Fix non-uniform encryption configuration * Don't use footer key for non-encrypted columns * Rebase and cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Cleanup * lint * Remove encryption setup * Fix building with ring on wasm * file_decryptor into a seperate module * lint * FileDecryptionProperties should have at least one key * Move cyphertext reading into decryptor * More tidy up of footer key handling * Get column decryptors as RingGcmBlockDecryptor * Use Arc<dyn BlockDecryptor> * Fix file metadata tests * Handle reading plaintext footer files without decryption properties * Split up encryption modules further * Error instead of panic for AES-GCM-CTR * load_async * new_with_options * Add tests * get_metadata * Add CryptoContext to async_reader * Add row_group_ordinal to InMemoryRowGroup * Adjust docstrings * Apply suggestions from code review Co-authored-by: Adam Reeve <[email protected]> * Review feedback * move file_decryption_properties into ArrowReaderOptions * make create_page_aad method of CryptoContext * Review feedback * Infer ModuleType in create_page_aad * add create_page_header_aad * Review feedback * Update parquet/src/arrow/async_reader/store.rs Co-authored-by: Ed Seidl <[email protected]> * Review feedback * Update parquet/src/encryption/ciphers.rs Co-authored-by: Adam Reeve <[email protected]> * Review feedback * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * Review feedback * Handle common encryption errors Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Adam Reeve <[email protected]> * Apply suggestions from code review Co-authored-by: Adam Reeve <[email protected]> * Update parquet/src/arrow/async_reader/mod.rs Co-authored-by: Adam Reeve <[email protected]> * Fix previous commit * Add TestReader::new, less pub functions, add test_non_uniform_encryption_disabled_aad_storage * Review feedback * Add new CI check * Rename decryption module to decrypt. This is because we'll introduce encryption module later and we'll have to name it encrypt to not clash with the super module name (encryption). It would be odd to have sub modules called encrypt and decryption. * Fix failing where encryption is enabled but no decryption properties are provided or where encyption is disabled but file is encrypted. * Apply suggestions from code review Co-authored-by: Adam Reeve <[email protected]> * get_metadata_with_encryption -> get_metadata_with_options * Use ParquetMetaData instead of RowGroupMetaData in InMemoryRowGroup. Change row_group_ordinal to row_group_idx. * Review feedback * Fixes * Continue refactor away from encryption specific APIs and fix async reader load * Add default get_metadata_with_options implementation * Minor tidy ups and test fix * Update parquet/src/encryption/mod.rs * Update parquet/src/lib.rs Co-authored-by: Rok Mihevc <[email protected]> --------- Co-authored-by: Gidon Gershinsky <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: Ed Seidl <[email protected]> Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.
Rationale for this change
What changes are included in this PR?
This introduces AesGcmV1 cypher decryption to
ArrowReaderMetadata
andParquetRecordBatchReader
. Introduced classes and functions are tested on sample files fromparquet-dataset
.Are there any user-facing changes?
Several new classes and method parameters are introduced. If project is compiled without
encryption
flag changes are not breaking. Ifencryption
flag is on some methods and constructors (e.g.ParquetMetaData::new
) will require new parameters which would be a breaking change.