Skip to content

Remove AsyncFileReader::get_metadata_with_options, add options to AsyncFileReader::get_metadata #7342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 1, 2025

Conversation

corwinjoy
Copy link
Contributor

@corwinjoy corwinjoy commented Mar 27, 2025

Rationale for this change

In using encryption with datafusion, I found the default implementation for get_metadata_with_options to be more of a hindrance than a help. What happened, is that when working with encrypted parquet, my calls would mysteriously fail and I could not understand why. It took me longer than I like to admit to discover that the default implementation for get_metadata_with_options would quietly drop decryption information, and that was the source of my mysterious file reading failures. I propose that this default implementation should be removed so that developers get a clear compiler message that they should be explicit about what they want. If this is too much of a breaking API change, then a second best solution would be to provide a runtime error. Either way, I feel that the default implementation is simply too likely to cause unexpected and surprising errors.

What changes are included in this PR?

Remove the default implementation for get_metadata_with_options for the AsyncFileReader: Send trait.

Are there any user-facing changes?

Users will need to be explicit about what they want to do for get_metadata_with_options for the AsyncFileReader: Send trait. Having no default implementation is in line with other methods here. I feel that asking users to be explicit is a good change here to prevent potential errors.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 27, 2025
@alamb alamb changed the title Remove default implementation for get_metadata_with_options and expla… Remove default implementation for AsyncFileReader::get_metadata_with_options Mar 27, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @corwinjoy -- this is a good improvement for sure

Since we can change the API in the next breaking release, what do you think about simply removing get_metadata entirely?

That would mean that all downstream crates would be forced to handle or explicitly ignore the options, which I think would be good.

I am talking about removing this:

    /// Provides asynchronous access to the [`ParquetMetaData`] of a parquet file,
    /// allowing fine-grained control over how metadata is sourced, in particular allowing
    /// for caching, pre-fetching, catalog metadata, etc...
    fn get_metadata(&mut self) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>>;

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

For anyone else following along get_metadata_with_options was added very recently in #6637 so while this is technically a breaking API change I don't expect it to impact many people

@alamb alamb added the api-change Changes to the arrow API label Mar 27, 2025
@kylebarron
Copy link
Contributor

You're suggesting removing get_metadata in favor of keeping get_metadata_with_options? If we don't have both we could keep the get_metadata name but change the signature to

    fn get_metadata(&mut self, options: &ArrowReaderOptions) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>>;

?

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

If we don't have both we could keep the get_metadata name but change the signature to

I think changing the signature would also be a good choice

@corwinjoy
Copy link
Contributor Author

corwinjoy commented Mar 27, 2025

Looking through how this function is used and implemented, many times users just want to grab metadata and don't care about options. Furthermore, right now we have a lot of code duplication between implementations of get_metadata and get_metadata_with_options. I think that what would make sense is to reduce code duplication and have a single function like:

fn get_metadata(&mut self, options: Option<&ArrowReaderOptions>) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>>;

I think this would make any applicable options explicit while streamlining the existing code.
If this sounds good, I can adjust the PR and we can take a look at that idea.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

I think this would make any applicable options explicit while streamlining the existing code.
If this sounds good, I can adjust the PR and we can take a look at that idea.

I think it sounds great

@tustvold and @etseidl and @mbrobbel perhaps you have some opinions to share here

@corwinjoy
Copy link
Contributor Author

OK. I have updated this PR to change the signature of get_metadata to

fn get_metadata<'a>(
        &'a mut self,
        options: Option<&'a ArrowReaderOptions>,
    ) -> BoxFuture<'a, Result<Arc<ParquetMetaData>>> 

as discussed. I think this reduces code duplication and ambiguity.

@alamb alamb changed the title Remove default implementation for AsyncFileReader::get_metadata_with_options Remove AsyncFileReader::get_metadata_with_options, add options to AsyncFileReader::get_metadata Mar 31, 2025
@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

I updated this PR's title to reflect what I think it currently does

It looks like this PR has some merge conflicts, but once those are solved it will be good to go from my perspective

…_with_options

# Conflicts:
#	parquet/src/arrow/async_reader/store.rs
#	parquet/tests/arrow_reader/encryption_async.rs
@corwinjoy
Copy link
Contributor Author

I updated this PR's title to reflect what I think it currently does

It looks like this PR has some merge conflicts, but once those are solved it will be good to go from my perspective

Thank you! I have gone ahead and merged the latest from main to resolve these conflicts.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this has turned out. Thanks @corwinjoy!

@alamb alamb merged commit 3a45ae9 into apache:main Apr 1, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Thanks everyone!

@kylebarron
Copy link
Contributor

kylebarron commented Apr 9, 2025

I was just updating some of my code (parquet-wasm) to use the release candidate, but it doesn't seem possible for third party code to use the options argument in AsyncFileReader::get_metadata because no internal data from ArrowReaderOptions is publicly accessible.

The internal implementation here on AsyncRead and AsyncSeek accesses the page_index and file_decryption_properties attributes:

.with_page_indexes(options.is_some_and(|o| o.page_index));

options.and_then(|o| o.file_decryption_properties.as_ref()),

But third party code can't do that. Is that intentional?

@alamb
Copy link
Contributor

alamb commented Apr 9, 2025

But third party code can't do that. Is that intentional?

It was not intentional to my knowledge. -- maybe we can add some accessors to ArrowReaderOptions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants