Skip to content

Parquet: Support reading Parquet metadata via suffix range requests #7334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 31, 2025

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Mar 26, 2025

Which issue does this PR close?

Rationale for this change

Avoid needing to know file size in advance to read Parquet files.

What changes are included in this PR?

  • Add MetadataSuffixFetch trait which expands on MetadataFetch to additionally support reading suffix range requests.
  • Breaking: Change ParquetObjectReader::new to a signature of new(store: Arc<dyn ObjectStore>, path: Path, file_size: Option<usize>), so that file_size is no longer required.
  • Implement MetadataSuffixFetch for ParquetObjectReader, using ObjectStore::get_opts.
  • Always prefer reading metadata via bounded range requests if the file size is provided, and only fall back to suffix range requests if the file size is None.
  • Added simple test without file length

Are there any user-facing changes?

Yes, breaking change to the signature of ParquetObjectReader::new. I'd like to get this in to #7084.

Supersedes and closes #6157.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 26, 2025
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good to me, just a few comments so far. I want to do a second pass later today or tomorrow.

Thanks!

@etseidl etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Mar 26, 2025
@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

I hope to review this tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @kylebarron -- this is a great improvement

I think the only thing that it needs is a test of some sort that shows the suffix was requested (to verify that we are minimizing the number of requests)

I also have some API suggestions and code refactoring ideas but I don't think they are needed.

All in all this is great. Thank you

prefetch: usize,
) -> Result<(ParquetMetaData, Option<(usize, Bytes)>)> {
let suffix = fetch.fetch_suffix(prefetch.max(FOOTER_SIZE)).await?;
let suffix_len = suffix.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code from here on down seems like it is copy/pasted from load_metadata -- could you please try and avoid duplication if possible by refactoring the common code into a shared method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly copy/pasted but with a few differences between each

  • Whether to call fetch or fetch_suffix
  • arithmetic over byte ranges to read

There might be a way to DRY but I don't see anything super obvious other than maybe these four lines

        let mut footer = [0; FOOTER_SIZE];
        footer.copy_from_slice(&suffix[suffix_len - FOOTER_SIZE..suffix_len]);

        let footer = Self::decode_footer_tail(&footer)?;
        let length = footer.metadata_length();

@kylebarron
Copy link
Contributor Author

I think the only thing that it needs is a test of some sort that shows the suffix was requested (to verify that we are minimizing the number of requests)

Do you have a suggestion of how to do this?

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

I think the only thing that it needs is a test of some sort that shows the suffix was requested (to verify that we are minimizing the number of requests)

Do you have a suggestion of how to do this?

Perhaps we can implement an object_store wrapper that wraps a InMemory store but records the requests it gets 🤔

We have used that pattern in the past to good effect elsewhere I think

@kylebarron
Copy link
Contributor Author

kylebarron commented Mar 28, 2025

@alamb what do you think about c5e4226 (#7334)?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kylebarron -- this looks great to me. Thank you for the tests

assert_eq!(actual.file_metadata().schema(), expected);
assert_eq!(fetch_count.load(Ordering::SeqCst), 0);
assert_eq!(suffix_fetch_count.load(Ordering::SeqCst), 2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a test case for passing in the metadata size hint? And the expected number of suffix fetches is 1?

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

You are the best. Thanks again for this improvement and the tests @kylebarron

@alamb alamb merged commit a3598a9 into apache:main Mar 31, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

Thanks again @kylebarron !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read Parquet metadata via suffix requests
4 participants