Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider default for statistics in parquet writer #15586

Closed
deanm0000 opened this issue Apr 10, 2024 · 4 comments
Closed

Reconsider default for statistics in parquet writer #15586

deanm0000 opened this issue Apr 10, 2024 · 4 comments
Labels
A-io-parquet Area: reading/writing Parquet files A-io-partitioning Area: reading/writing (Hive) partitioned files enhancement New feature or an improvement of an existing feature

Comments

@deanm0000
Copy link
Collaborator

Description

With polars on the verge of doing native dataset writing, it seems it's a good time to rethink the default statistics behavior. Datasets largely depend on having statistics so it seems the small price to pay to calculate them at write time is worth it at read time. I suppose the standalone writer could still default to False but the dataset writer would have to have statistics turned on (wouldn't it?). I think that might create more confusion for those two to be different so just another reason to turn them on by default in the standalone writer.

@deanm0000 deanm0000 added enhancement New feature or an improvement of an existing feature A-io-parquet Area: reading/writing Parquet files A-io-partitioning Area: reading/writing (Hive) partitioned files labels Apr 10, 2024
@ritchie46
Copy link
Member

Yeap, agree. I think we should default to True. Can you open a PR?

@kszlim
Copy link
Contributor

kszlim commented Apr 11, 2024

I think it'd be good to be aware of page index based statistics https://github.com/apache/parquet-format/blob/master/PageIndex.md and if this is going to be a breaking change, there should be some consideration for whether polars should write and utilize (on the read end) these statistics instead, which I think are strictly superior?

Relates to #12752

@ritchie46
Copy link
Member

I don't think this would be breaking. The files would still be read correctly.

@deanm0000
Copy link
Collaborator Author

#15597

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-io-partitioning Area: reading/writing (Hive) partitioned files enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants