Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet Modular decryption (read) support + encryption flag #6637

Open
wants to merge 97 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Oct 28, 2024

Which issue does this PR close?

This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.

Rationale for this change

What changes are included in this PR?

This introduces AesGcmV1 cypher decryption to ArrowReaderMetadata and ParquetRecordBatchReader. Introduced classes and functions are tested on sample files from parquet-dataset.

Are there any user-facing changes?

Several new classes and method parameters are introduced. If project is compiled without encryption flag changes are not breaking. If encryption flag is on some methods and constructors (e.g. ParquetMetaData::new) will require new parameters which would be a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 28, 2024
@rok
Copy link
Member Author

rok commented Oct 28, 2024

Currently this is a rough rebase of work done by @ggershinsky. As ParquetMetaDataReader is now available some refactoring will be required.

@etseidl
Copy link
Contributor

etseidl commented Oct 28, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@brainslush
Copy link

Is there any help, input or contribution needed here?

@rok
Copy link
Member Author

rok commented Nov 21, 2024

Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push.

@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from fe488b3 to d263510 Compare November 23, 2024 23:06
@rok
Copy link
Member Author

rok commented Dec 4, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@etseidl could you please do a quick pass to say if this makes sense in respect to ParquetMetaDataReader?
I'll continue with data decryption.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!

@rok rok force-pushed the decryption-basics-fork branch from f90d8b4 to 29d55eb Compare December 16, 2024 23:51
@adamreeve adamreeve force-pushed the decryption-basics-fork branch from 27e77ad to 7db06cc Compare December 20, 2024 02:15
@rok rok force-pushed the decryption-basics-fork branch from a4105d5 to 3e7646d Compare January 9, 2025 21:59
@rok rok force-pushed the decryption-basics-fork branch 4 times, most recently from deedba9 to 951f2fa Compare January 21, 2025 20:35
@rok rok changed the title Parquet Modular Encryption support Parquet Modular decryption support Jan 21, 2025
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from f6b9e88 to 23375d1 Compare January 23, 2025 18:17
@adamreeve adamreeve force-pushed the decryption-basics-fork branch 3 times, most recently from 7f94e39 to 177d826 Compare January 24, 2025 02:46
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from ac4ac21 to 3241425 Compare February 5, 2025 10:01
@rok rok force-pushed the decryption-basics-fork branch from 9bdf7f0 to a0cea3e Compare March 10, 2025 14:35
@rok rok force-pushed the decryption-basics-fork branch from a0cea3e to 7e9e7aa Compare March 10, 2025 14:39
rok added 3 commits March 10, 2025 15:49
…encryption module later and we'll have to name it encrypt to not clash with the super module name (encryption). It would be odd to have sub modules called encrypt and decryption.
…are provided or where encyption is disabled but file is encrypted.
@rok rok force-pushed the decryption-basics-fork branch from d5f2605 to d9e597c Compare March 10, 2025 17:19
@rok
Copy link
Member Author

rok commented Mar 10, 2025

Thanks for the review @adamreeve, I think I addressed all the points, feel free to do another pass.

rok pushed a commit to rok/arrow-rs that referenced this pull request Mar 10, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @rok -- I think it is quite close.

I went over the PR quite carefully and only real concern is the API in AsyncFileReader::get_metadata_with_encryption -- I left comments inline. Everything else is minor / could be done as some follow on clean up PR I think

The CI failure here
https://github.com/apache/arrow-rs/actions/runs/13770783611/job/38509760112?pr=6637

Appears unrelated to the changes in this PR as it happens for me on main as well

andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs/object_store$ cargo msr verify
error: no such command: `msr`

	Did you mean `msrv`?

	View all installed commands with `cargo --list`
	Find a package to install `msr` with `cargo search cargo-msr`
andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs/object_store$ cargo msrv verify
  [Meta]   cargo-msrv 0.17.1

Compatibility Check #1: Rust 1.64.0
  [FAIL]   Is incompatible

  ╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
  │ error: failed to parse lock file at: /Users/andrewlamb/Software/arrow-rs/object_store/Cargo.lock     │
  │                                                                                                      │
  │ Caused                                                                                               │
  │ by:                                                                                                  │
  │   lock file version `4` was found, but this version of Cargo does not understand this lock file, per │
  │ haps Cargo needs to be updated?                                                                      │
  │                                                                                                      │
  ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯



Crate source was found to be incompatible with Rust version '1.64.0' specified as MSRV in the Cargo manifest located at '/Users/andrewlamb/Software/arrow-rs/object_store/Cargo.toml'andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs/object_store$

I also made a PR to test this didn't inadvertently break things downstream and it looks good to me:

#[cfg(feature = "encryption")]
fn get_metadata_with_encryption(
&mut self,
file_decryption_properties: Option<FileDecryptionProperties>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this new API and it seems overly specific to the encryption feature.

It seems like the need is to pass file_decryption_properties to the underlying ParquetMetadataReader but there are other potential options on ArrowReaderOptionsthat might be useful.

What if instead we added a new method like this:

    /// Provides asynchronous access to the [`ParquetMetaData`] of a parquet file,
    /// allowing fine-grained control over how metadata is sourced, in particular allowing
    /// for caching, pre-fetching, catalog metadata, etc...
    ///
    /// By default calls `get_metadata()`
    fn get_metadata_with_options(&mut self, options: &ArrowReaderOptions) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
     self.get_metadata()
   }

I think that would

  1. Be backwards compatible
  2. Permit access to the encryption options necessary to decrype the data
  3. Allow us eventually to clean up ArrowReaderMetadata::load_async function (in some other PR)

As a follow on PR we could potentially deprecate get_metadata and direct people to implement get_metadata_with_options

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does indeed seem like the better API! Changed in 90434d6

Copy link
Contributor

@adamreeve adamreeve Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed some follow up changes (with Rok's blessing) to continue this refactor and fix the async metadata loading with an ObjectStore. I think this is addressed now.

@@ -853,6 +929,10 @@ struct InMemoryRowGroup<'a> {
offset_index: Option<&'a [OffsetIndexMetaData]>,
column_chunks: Vec<Option<Arc<ColumnChunkData>>>,
row_count: usize,
#[cfg(feature = "encryption")]
row_group_ordinal: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I think row_group_idx would be a name that is more consistent with the rest of this crate (and it is used as the local variable name above too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to row_group_idx.
I don't have a strong preference, but parquet encryption spec uses ordinal while here we do indeed use idx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also don't have a strong preference -- as long as the code is consistent with something (the spec would be fine) I am ok with either

// decrypt parquet modules (data pages, dictionary pages, etc.).
#[derive(Debug, Clone)]
pub(crate) struct CryptoContext {
pub(crate) row_group_ordinal: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistically, as before I think using idx or index for row group and column group would be more consistent with the existing code but it is not a major item

Copy link
Member Author

@rok rok Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6637 (comment). I'm ok with changing page_ordinal and column_ordinal if that's better for readability.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rok, yes all my review comments were addressed, I've just left a few small follow up comments.

@rok rok force-pushed the decryption-basics-fork branch from 9531491 to 67b8cb9 Compare March 11, 2025 01:05
@adamreeve adamreeve force-pushed the decryption-basics-fork branch from f33c96c to e58ae7c Compare March 11, 2025 03:08
rok pushed a commit to rok/arrow-rs that referenced this pull request Mar 11, 2025
@rok rok requested a review from alamb March 11, 2025 14:14
@alamb
Copy link
Contributor

alamb commented Mar 11, 2025

👁️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Modular Encryption support
8 participants