-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet Modular decryption (read) support + encryption
flag
#6637
base: main
Are you sure you want to change the base?
Conversation
Currently this is a rough rebase of work done by @ggershinsky. As |
@rok let me know if you want any help shoehorning this into |
Is there any help, input or contribution needed here? |
Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push. |
fe488b3
to
d263510
Compare
@etseidl could you please do a quick pass to say if this makes sense in respect to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!
f90d8b4
to
29d55eb
Compare
27e77ad
to
7db06cc
Compare
a4105d5
to
3e7646d
Compare
deedba9
to
951f2fa
Compare
f6b9e88
to
23375d1
Compare
7f94e39
to
177d826
Compare
ac4ac21
to
3241425
Compare
9bdf7f0
to
a0cea3e
Compare
a0cea3e
to
7e9e7aa
Compare
…encryption module later and we'll have to name it encrypt to not clash with the super module name (encryption). It would be odd to have sub modules called encrypt and decryption.
…are provided or where encyption is disabled but file is encrypted.
d5f2605
to
d9e597c
Compare
Thanks for the review @adamreeve, I think I addressed all the points, feel free to do another pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @rok -- I think it is quite close.
I went over the PR quite carefully and only real concern is the API in AsyncFileReader::get_metadata_with_encryption
-- I left comments inline. Everything else is minor / could be done as some follow on clean up PR I think
The CI failure here
https://github.com/apache/arrow-rs/actions/runs/13770783611/job/38509760112?pr=6637
Appears unrelated to the changes in this PR as it happens for me on main as well
andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs/object_store$ cargo msr verify
error: no such command: `msr`
Did you mean `msrv`?
View all installed commands with `cargo --list`
Find a package to install `msr` with `cargo search cargo-msr`
andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs/object_store$ cargo msrv verify
[Meta] cargo-msrv 0.17.1
Compatibility Check #1: Rust 1.64.0
[FAIL] Is incompatible
╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ error: failed to parse lock file at: /Users/andrewlamb/Software/arrow-rs/object_store/Cargo.lock │
│ │
│ Caused │
│ by: │
│ lock file version `4` was found, but this version of Cargo does not understand this lock file, per │
│ haps Cargo needs to be updated? │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
Crate source was found to be incompatible with Rust version '1.64.0' specified as MSRV in the Cargo manifest located at '/Users/andrewlamb/Software/arrow-rs/object_store/Cargo.toml'andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs/object_store$
I also made a PR to test this didn't inadvertently break things downstream and it looks good to me:
#[cfg(feature = "encryption")] | ||
fn get_metadata_with_encryption( | ||
&mut self, | ||
file_decryption_properties: Option<FileDecryptionProperties>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this new API and it seems overly specific to the encryption feature.
It seems like the need is to pass file_decryption_properties
to the underlying ParquetMetadataReader
but there are other potential options on ArrowReaderOptions
that might be useful.
What if instead we added a new method like this:
/// Provides asynchronous access to the [`ParquetMetaData`] of a parquet file,
/// allowing fine-grained control over how metadata is sourced, in particular allowing
/// for caching, pre-fetching, catalog metadata, etc...
///
/// By default calls `get_metadata()`
fn get_metadata_with_options(&mut self, options: &ArrowReaderOptions) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
self.get_metadata()
}
I think that would
- Be backwards compatible
- Permit access to the encryption options necessary to decrype the data
- Allow us eventually to clean up
ArrowReaderMetadata::load_async
function (in some other PR)
As a follow on PR we could potentially deprecate get_metadata
and direct people to implement get_metadata_with_options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does indeed seem like the better API! Changed in 90434d6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed some follow up changes (with Rok's blessing) to continue this refactor and fix the async metadata loading with an ObjectStore
. I think this is addressed now.
@@ -853,6 +929,10 @@ struct InMemoryRowGroup<'a> { | |||
offset_index: Option<&'a [OffsetIndexMetaData]>, | |||
column_chunks: Vec<Option<Arc<ColumnChunkData>>>, | |||
row_count: usize, | |||
#[cfg(feature = "encryption")] | |||
row_group_ordinal: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I think row_group_idx
would be a name that is more consistent with the rest of this crate (and it is used as the local variable name above too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to row_group_idx
.
I don't have a strong preference, but parquet encryption spec uses ordinal while here we do indeed use idx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also don't have a strong preference -- as long as the code is consistent with something (the spec would be fine) I am ok with either
parquet/src/encryption/decrypt.rs
Outdated
// decrypt parquet modules (data pages, dictionary pages, etc.). | ||
#[derive(Debug, Clone)] | ||
pub(crate) struct CryptoContext { | ||
pub(crate) row_group_ordinal: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistically, as before I think using idx
or index
for row group and column group would be more consistent with the existing code but it is not a major item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #6637 (comment). I'm ok with changing page_ordinal and column_ordinal if that's better for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rok, yes all my review comments were addressed, I've just left a few small follow up comments.
Co-authored-by: Adam Reeve <[email protected]>
…Change row_group_ordinal to row_group_idx.
9531491
to
67b8cb9
Compare
f33c96c
to
e58ae7c
Compare
👁️ |
Which issue does this PR close?
This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.
Rationale for this change
What changes are included in this PR?
This introduces AesGcmV1 cypher decryption to
ArrowReaderMetadata
andParquetRecordBatchReader
. Introduced classes and functions are tested on sample files fromparquet-dataset
.Are there any user-facing changes?
Several new classes and method parameters are introduced. If project is compiled without
encryption
flag changes are not breaking. Ifencryption
flag is on some methods and constructors (e.g.ParquetMetaData::new
) will require new parameters which would be a breaking change.