-
Notifications
You must be signed in to change notification settings - Fork 924
Parquet: Support reading Parquet metadata via suffix range requests #7334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good to me, just a few comments so far. I want to do a second pass later today or tomorrow.
Thanks!
I hope to review this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @kylebarron -- this is a great improvement
I think the only thing that it needs is a test of some sort that shows the suffix was requested (to verify that we are minimizing the number of requests)
I also have some API suggestions and code refactoring ideas but I don't think they are needed.
All in all this is great. Thank you
prefetch: usize, | ||
) -> Result<(ParquetMetaData, Option<(usize, Bytes)>)> { | ||
let suffix = fetch.fetch_suffix(prefetch.max(FOOTER_SIZE)).await?; | ||
let suffix_len = suffix.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code from here on down seems like it is copy/pasted from load_metadata
-- could you please try and avoid duplication if possible by refactoring the common code into a shared method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly copy/pasted but with a few differences between each
- Whether to call
fetch
orfetch_suffix
- arithmetic over byte ranges to read
There might be a way to DRY but I don't see anything super obvious other than maybe these four lines
let mut footer = [0; FOOTER_SIZE];
footer.copy_from_slice(&suffix[suffix_len - FOOTER_SIZE..suffix_len]);
let footer = Self::decode_footer_tail(&footer)?;
let length = footer.metadata_length();
Do you have a suggestion of how to do this? |
Perhaps we can implement an object_store wrapper that wraps a InMemory store but records the requests it gets 🤔 We have used that pattern in the past to good effect elsewhere I think |
@alamb what do you think about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kylebarron -- this looks great to me. Thank you for the tests
assert_eq!(actual.file_metadata().schema(), expected); | ||
assert_eq!(fetch_count.load(Ordering::SeqCst), 0); | ||
assert_eq!(suffix_fetch_count.load(Ordering::SeqCst), 2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a test case for passing in the metadata size hint? And the expected number of suffix fetches is 1?
You are the best. Thanks again for this improvement and the tests @kylebarron |
Thanks again @kylebarron ! |
Which issue does this PR close?
Rationale for this change
Avoid needing to know file size in advance to read Parquet files.
What changes are included in this PR?
MetadataSuffixFetch
trait which expands onMetadataFetch
to additionally support reading suffix range requests.ParquetObjectReader::new
to a signature ofnew(store: Arc<dyn ObjectStore>, path: Path, file_size: Option<usize>)
, so thatfile_size
is no longer required.MetadataSuffixFetch
forParquetObjectReader
, usingObjectStore::get_opts
.None
.Are there any user-facing changes?
Yes, breaking change to the signature of
ParquetObjectReader::new
. I'd like to get this in to #7084.Supersedes and closes #6157.