Skip to content

Commit d5339f3

Browse files
rokggershinskyadamreeveetseidlcorwinjoy
authored
Add Parquet Modular decryption (read) support + encryption flag (#6637)
* first commit * Use ParquetMetaDataReader * Fix CI * test * save progress * work * Review feedback * page decompression issue * add update_aad * Change encrypt and decrypt to return Results * Use correct page ordinal and module type in AADs * Tidy up ordinal types * Lint * Fix regular deserialization path * cleaning * Update data checks in test * start non-uniform decryption * Add missing doc comments * Make encryption an optional feature * Handle when a file is encrypted but encryption is disabled or no decryption properties are provided * Allow for plaintext footer * work * Fix method name * work * Minor * work * work * work * Fix reading to end of file * Refactor tests * Fix non-uniform encryption configuration * Don't use footer key for non-encrypted columns * Rebase and cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Cleanup * lint * Remove encryption setup * Fix building with ring on wasm * file_decryptor into a seperate module * lint * FileDecryptionProperties should have at least one key * Move cyphertext reading into decryptor * More tidy up of footer key handling * Get column decryptors as RingGcmBlockDecryptor * Use Arc<dyn BlockDecryptor> * Fix file metadata tests * Handle reading plaintext footer files without decryption properties * Split up encryption modules further * Error instead of panic for AES-GCM-CTR * load_async * new_with_options * Add tests * get_metadata * Add CryptoContext to async_reader * Add row_group_ordinal to InMemoryRowGroup * Adjust docstrings * Apply suggestions from code review Co-authored-by: Adam Reeve <[email protected]> * Review feedback * move file_decryption_properties into ArrowReaderOptions * make create_page_aad method of CryptoContext * Review feedback * Infer ModuleType in create_page_aad * add create_page_header_aad * Review feedback * Update parquet/src/arrow/async_reader/store.rs Co-authored-by: Ed Seidl <[email protected]> * Review feedback * Update parquet/src/encryption/ciphers.rs Co-authored-by: Adam Reeve <[email protected]> * Review feedback * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * WIP: Decryption shouldn't change the API * Review feedback * Handle common encryption errors Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Adam Reeve <[email protected]> * Apply suggestions from code review Co-authored-by: Adam Reeve <[email protected]> * Update parquet/src/arrow/async_reader/mod.rs Co-authored-by: Adam Reeve <[email protected]> * Fix previous commit * Add TestReader::new, less pub functions, add test_non_uniform_encryption_disabled_aad_storage * Review feedback * Add new CI check * Rename decryption module to decrypt. This is because we'll introduce encryption module later and we'll have to name it encrypt to not clash with the super module name (encryption). It would be odd to have sub modules called encrypt and decryption. * Fix failing where encryption is enabled but no decryption properties are provided or where encyption is disabled but file is encrypted. * Apply suggestions from code review Co-authored-by: Adam Reeve <[email protected]> * get_metadata_with_encryption -> get_metadata_with_options * Use ParquetMetaData instead of RowGroupMetaData in InMemoryRowGroup. Change row_group_ordinal to row_group_idx. * Review feedback * Fixes * Continue refactor away from encryption specific APIs and fix async reader load * Add default get_metadata_with_options implementation * Minor tidy ups and test fix * Update parquet/src/encryption/mod.rs * Update parquet/src/lib.rs Co-authored-by: Rok Mihevc <[email protected]> --------- Co-authored-by: Gidon Gershinsky <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: Ed Seidl <[email protected]> Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
1 parent 3ca4652 commit d5339f3

File tree

21 files changed

+1890
-76
lines changed

21 files changed

+1890
-76
lines changed

.github/workflows/parquet.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@ jobs:
111111
run: cargo check -p parquet --all-targets --all-features
112112
- name: Check compilation --all-targets --no-default-features --features json
113113
run: cargo check -p parquet --all-targets --no-default-features --features json
114+
- name: Check compilation --no-default-features --features encryption --features async
115+
run: cargo check -p parquet --no-default-features --features encryption --features async
114116

115117
# test the parquet crate builds against wasm32 in stable rust
116118
wasm32-build:

parquet/Cargo.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ rust-version = { workspace = true }
3030

3131
[target.'cfg(target_arch = "wasm32")'.dependencies]
3232
ahash = { version = "0.8", default-features = false, features = ["compile-time-rng"] }
33+
# See https://github.com/briansmith/ring/issues/918#issuecomment-2077788925
34+
ring = { version = "0.17", default-features = false, features = ["wasm32_unknown_unknown_js", "std"], optional = true }
3335

3436
[target.'cfg(not(target_arch = "wasm32"))'.dependencies]
3537
ahash = { version = "0.8", default-features = false, features = ["runtime-rng"] }
@@ -70,6 +72,7 @@ half = { version = "2.1", default-features = false, features = ["num-traits"] }
7072
sysinfo = { version = "0.33.0", optional = true, default-features = false, features = ["system"] }
7173
crc32fast = { version = "1.4.2", optional = true, default-features = false }
7274
simdutf8 = { version = "0.1.5", optional = true, default-features = false }
75+
ring = { version = "0.17", default-features = false, features = ["std"], optional = true }
7376

7477
[dev-dependencies]
7578
base64 = { version = "0.22", default-features = false, features = ["std"] }
@@ -117,6 +120,8 @@ sysinfo = ["dep:sysinfo"]
117120
crc = ["dep:crc32fast"]
118121
# Enable SIMD UTF-8 validation
119122
simdutf8 = ["dep:simdutf8"]
123+
# Enable Parquet modular encryption support
124+
encryption = ["dep:ring"]
120125

121126

122127
[[example]]

parquet/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ The `parquet` crate provides the following features which may be enabled in your
6363
- `crc` - enables functionality to automatically verify checksums of each page (if present) when decoding
6464
- `experimental` - Experimental APIs which may change, even between minor releases
6565
- `simdutf8` (default) - Use the [`simdutf8`] crate for SIMD-accelerated UTF-8 validation
66+
- `encryption` - support for reading / writing encrypted Parquet files
6667

6768
[`arrow`]: https://crates.io/crates/arrow
6869
[`simdutf8`]: https://crates.io/crates/simdutf8
@@ -76,12 +77,14 @@ The `parquet` crate provides the following features which may be enabled in your
7677
- [x] Row record reader
7778
- [x] Arrow record reader
7879
- [x] Async support (to Arrow)
80+
- [x] Encrypted files
7981
- [x] Statistics support
8082
- [x] Write support
8183
- [x] Primitive column value writers
8284
- [ ] Row record writer
8385
- [x] Arrow record writer
8486
- [x] Async support
87+
- [ ] Encrypted files
8588
- [x] Predicate pushdown
8689
- [x] Parquet format 4.0.0 support
8790

0 commit comments

Comments
 (0)