Replies: 2 comments 5 replies
-
Confusingly, the Parquet "Page Index" https://github.com/apache/parquet-format/blob/master/PageIndex.md Refers to two separate structures in the metadata, both of which are available in the ParquetMetadata:
If you mean how can you get those fields populated with just decode metadata, I am not sure it is possible outside the readers uet One thing you could do is use https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.new_with_options to say fetch the page/offset index and and then use that to read the metadata with the page/offset index. That is the approach taken in the datafusion example to read the metadata (without actually reading the data): https://github.com/apache/datafusion/blob/58f79e143e1a90e5caa59eecc9b36dbdd082a7eb/datafusion-examples/examples/advanced_parquet_index.rs#L408-L415 It would be nice to have a nicer API I think |
Beta Was this translation helpful? Give feedback.
-
The conversation is on #6002 so let's continue the discussion there |
Beta Was this translation helpful? Give feedback.
-
Hi folks,
I'm trying to work with parquet metadata directly. In particular, I'd like to be able to serialize and deserialize it so I can store it in a cache outside of parquet files and avoid slow object store requests. I see that there is a
decode_metadata
function but it doesn't handle some of the newer features like page indexes. I also don't see an inverse method, the pieces necessary seem to only exist in readers, etc. Short of re-implementing all of this logic myself, is there an easy way to serialize and deserialize entire parquet metadata?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions