Skip to content

Add Parquet Modular decryption (read) support + encryption flag #6637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 99 commits into from
Mar 12, 2025

Conversation

rok
Copy link
Member

@rok rok commented Oct 28, 2024

Which issue does this PR close?

This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.

Rationale for this change

What changes are included in this PR?

This introduces AesGcmV1 cypher decryption to ArrowReaderMetadata and ParquetRecordBatchReader. Introduced classes and functions are tested on sample files from parquet-dataset.

Are there any user-facing changes?

Several new classes and method parameters are introduced. If project is compiled without encryption flag changes are not breaking. If encryption flag is on some methods and constructors (e.g. ParquetMetaData::new) will require new parameters which would be a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 28, 2024
@rok
Copy link
Member Author

rok commented Oct 28, 2024

Currently this is a rough rebase of work done by @ggershinsky. As ParquetMetaDataReader is now available some refactoring will be required.

@etseidl
Copy link
Contributor

etseidl commented Oct 28, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@brainslush
Copy link

Is there any help, input or contribution needed here?

@rok
Copy link
Member Author

rok commented Nov 21, 2024

Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push.

@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from fe488b3 to d263510 Compare November 23, 2024 23:06
@rok
Copy link
Member Author

rok commented Dec 4, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@etseidl could you please do a quick pass to say if this makes sense in respect to ParquetMetaDataReader?
I'll continue with data decryption.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!

@rok rok force-pushed the decryption-basics-fork branch from f90d8b4 to 29d55eb Compare December 16, 2024 23:51
@adamreeve adamreeve force-pushed the decryption-basics-fork branch from 27e77ad to 7db06cc Compare December 20, 2024 02:15
@rok rok force-pushed the decryption-basics-fork branch from a4105d5 to 3e7646d Compare January 9, 2025 21:59
@rok rok force-pushed the decryption-basics-fork branch 4 times, most recently from deedba9 to 951f2fa Compare January 21, 2025 20:35
@rok rok changed the title Parquet Modular Encryption support Parquet Modular decryption support Jan 21, 2025
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from f6b9e88 to 23375d1 Compare January 23, 2025 18:17
@adamreeve adamreeve force-pushed the decryption-basics-fork branch 3 times, most recently from 7f94e39 to 177d826 Compare January 24, 2025 02:46
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from ac4ac21 to 3241425 Compare February 5, 2025 10:01
@adamreeve adamreeve force-pushed the decryption-basics-fork branch from f33c96c to e58ae7c Compare March 11, 2025 03:08
rok pushed a commit to rok/arrow-rs that referenced this pull request Mar 11, 2025
@rok rok requested a review from alamb March 11, 2025 14:14
@alamb
Copy link
Contributor

alamb commented Mar 11, 2025

👁️

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rok @adamreeve and everyone else. I think this looks really nice

I think it would be good to move some of the tests to the "integration test" arrow_reeader binary rather than as unit tests so the structure visibility is tested (the rules for tests compiled with --lib is different than external) clearer, but we can do that as a follow on PR

I also verified this code is tested (parquet is tested with all features):

https://github.com/apache/arrow-rs/blob/c4737a4939dbcfd31252fbf747cebb42ffb455ca/.github/workflows/parquet.yml#L63-L62

I also re-verified that the latest version works with datafusion (as a way to try and double check this PR has no breaking APIs)

Thanks again

@@ -2336,4 +2459,279 @@ mod tests {
let result = reader.try_collect::<Vec<_>>().await.unwrap();
assert_eq!(result.len(), 1);
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As they are placed here, these tests:

  1. Require #[cfg(feature = "encryption")] for each
  2. Have different visibility rules than user programs (they can use non public items from the containing module)

Thus I recommend we move the tests to the reader "integration test" -- https://github.com/apache/arrow-rs/tree/main/parquet/tests/arrow_reader

This is what gets run via cargo test -p parquet --test arrow_reader

We can basically follow the same pattern as checksumming

#[cfg(feature = "crc")]
mod checksum;

I think the way this PR has it is fine, however, and I think we should move the tests as part of a follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes since my last review seem good. I wish we could avoid the small bit of code duplication in the async_reader, but it's not worth holding this up any longer. Thanks all.

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

I merged this PR up from main and once the CI tests pass I will merge it in.

Thank you all for all your effort to get this over the line!

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

This is a good team effort if I ever saw one:

Screenshot 2025-03-12 at 4 59 13 PM

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

I indulged in some ticket organizing and have collected the various follow on items / work in a new epic so we can follow along

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

I also made a PR to the parquet site website to note that arrow-rs now supports modular reading ❤️

@alamb alamb merged commit d5339f3 into apache:main Mar 12, 2025
38 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

🚀

@alamb
Copy link
Contributor

alamb commented Mar 13, 2025

I also filed #7281 to track adding some examples

PinkCrow007 pushed a commit to PinkCrow007/arrow-rs that referenced this pull request Mar 20, 2025
…ache#6637)

* first commit

* Use ParquetMetaDataReader

* Fix CI

* test

* save progress

* work

* Review feedback

* page decompression issue

* add update_aad

* Change encrypt and decrypt to return Results

* Use correct page ordinal and module type in AADs

* Tidy up ordinal types

* Lint

* Fix regular deserialization path

* cleaning

* Update data checks in test

* start non-uniform decryption

* Add missing doc comments

* Make encryption an optional feature

* Handle when a file is encrypted but encryption is disabled or no decryption properties are provided

* Allow for plaintext footer

* work

* Fix method name

* work

* Minor

* work

* work

* work

* Fix reading to end of file

* Refactor tests

* Fix non-uniform encryption configuration

* Don't use footer key for non-encrypted columns

* Rebase and cleanup

* Cleanup

* Cleanup

* Cleanup

* Cleanup

* Cleanup

* Cleanup

* lint

* Remove encryption setup

* Fix building with ring on wasm

* file_decryptor into a seperate module

* lint

* FileDecryptionProperties should have at least one key

* Move cyphertext reading into decryptor

* More tidy up of footer key handling

* Get column decryptors as RingGcmBlockDecryptor

* Use Arc<dyn BlockDecryptor>

* Fix file metadata tests

* Handle reading plaintext footer files without decryption properties

* Split up encryption modules further

* Error instead of panic for AES-GCM-CTR

* load_async

* new_with_options

* Add tests

* get_metadata

* Add CryptoContext to async_reader

* Add row_group_ordinal to InMemoryRowGroup

* Adjust docstrings

* Apply suggestions from code review

Co-authored-by: Adam Reeve <[email protected]>

* Review feedback

* move file_decryption_properties into ArrowReaderOptions

* make create_page_aad method of CryptoContext

* Review feedback

* Infer ModuleType in create_page_aad

* add create_page_header_aad

* Review feedback

* Update parquet/src/arrow/async_reader/store.rs

Co-authored-by: Ed Seidl <[email protected]>

* Review feedback

* Update parquet/src/encryption/ciphers.rs

Co-authored-by: Adam Reeve <[email protected]>

* Review feedback

* WIP: Decryption shouldn't change the API

* WIP: Decryption shouldn't change the API

* WIP: Decryption shouldn't change the API

* WIP: Decryption shouldn't change the API

* WIP: Decryption shouldn't change the API

* WIP: Decryption shouldn't change the API

* Review feedback

* Handle common encryption errors

Co-authored-by: Corwin Joy <[email protected]>
Co-authored-by: Adam Reeve <[email protected]>

* Apply suggestions from code review

Co-authored-by: Adam Reeve <[email protected]>

* Update parquet/src/arrow/async_reader/mod.rs

Co-authored-by: Adam Reeve <[email protected]>

* Fix previous commit

* Add TestReader::new, less pub functions, add test_non_uniform_encryption_disabled_aad_storage

* Review feedback

* Add new CI check

* Rename decryption module to decrypt. This is because we'll introduce encryption module later and we'll have to name it encrypt to not clash with the super module name (encryption). It would be odd to have sub modules called encrypt and decryption.

* Fix failing where encryption is enabled but no decryption properties are provided or where encyption is disabled but file is encrypted.

* Apply suggestions from code review

Co-authored-by: Adam Reeve <[email protected]>

* get_metadata_with_encryption -> get_metadata_with_options

* Use ParquetMetaData instead of RowGroupMetaData in InMemoryRowGroup. Change row_group_ordinal to row_group_idx.

* Review feedback

* Fixes

* Continue refactor away from encryption specific APIs and fix async reader load

* Add default get_metadata_with_options implementation

* Minor tidy ups and test fix

* Update parquet/src/encryption/mod.rs

* Update parquet/src/lib.rs

Co-authored-by: Rok Mihevc <[email protected]>

---------

Co-authored-by: Gidon Gershinsky <[email protected]>
Co-authored-by: Adam Reeve <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Corwin Joy <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Modular Encryption support
8 participants