Reconsider default for statistics in parquet writer #15586

deanm0000 · 2024-04-10T19:56:12Z

Description

With polars on the verge of doing native dataset writing, it seems it's a good time to rethink the default statistics behavior. Datasets largely depend on having statistics so it seems the small price to pay to calculate them at write time is worth it at read time. I suppose the standalone writer could still default to False but the dataset writer would have to have statistics turned on (wouldn't it?). I think that might create more confusion for those two to be different so just another reason to turn them on by default in the standalone writer.

ritchie46 · 2024-04-11T15:27:01Z

Yeap, agree. I think we should default to True. Can you open a PR?

kszlim · 2024-04-11T22:45:46Z

I think it'd be good to be aware of page index based statistics https://github.com/apache/parquet-format/blob/master/PageIndex.md and if this is going to be a breaking change, there should be some consideration for whether polars should write and utilize (on the read end) these statistics instead, which I think are strictly superior?

Relates to #12752

ritchie46 · 2024-04-12T09:06:05Z

I don't think this would be breaking. The files would still be read correctly.

deanm0000 · 2024-04-12T10:42:15Z

#15597

deanm0000 added enhancement New feature or an improvement of an existing feature A-io-parquet Area: reading/writing Parquet files A-io-partitioning Area: reading/writing (Hive) partitioned files labels Apr 10, 2024

deanm0000 mentioned this issue Apr 10, 2024

Hive partitioning tracking issue #15441

Open

13 tasks

deanm0000 mentioned this issue Apr 11, 2024

feat: change default to write parquet statistics #15597

Merged

deanm0000 closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider default for statistics in parquet writer #15586

Reconsider default for statistics in parquet writer #15586

deanm0000 commented Apr 10, 2024

ritchie46 commented Apr 11, 2024

kszlim commented Apr 11, 2024

ritchie46 commented Apr 12, 2024

deanm0000 commented Apr 12, 2024

Reconsider default for statistics in parquet writer #15586

Reconsider default for statistics in parquet writer #15586

Comments

deanm0000 commented Apr 10, 2024

Description

ritchie46 commented Apr 11, 2024

kszlim commented Apr 11, 2024

ritchie46 commented Apr 12, 2024

deanm0000 commented Apr 12, 2024