parquet: Add option to cache file metadata #12548
Closed
+168
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Inspired by datafusion-examples/examples/advanced_parquet_index.rs
Which issue does this PR close?
This was an attempt to solve #12547, but did not achieve it, and I am not sure it is the right approach.
Rationale for this change
On every query on Parquet ables, Datafusion re-opens every file, and parses its metadata. This takes a significant time for short queries (in my use case, there is usually a single hit in the Page Index).
My goal with to make these queries near-instant. Unfortunately, I realized after writing this code that the Page Index still needs to be parsed every time, because file metadata is lost through the
listing
layer (as mentioned in #9964).So this does spare some (negligible?) time parsing metadata. I'm not sure it's worth the extra complexity, especially in
ParquetFormat
. What do you think?What changes are included in this PR?
ParquetFormat
carry state (it probably deserves a renaming then...)CachedParquetFileReaderFactory
as an alternative toDefaultParquetFileReaderFactory
, and made it usable through a config optionAre these changes tested?
no
Are there any user-facing changes?
Added
datafusion.execution.parquet.cache_metadata