-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract parquet statistics from Interval
columns
#10752
Comments
take |
I actually think these would be helpful then as soon as there are statistics we can hook them up to the tests. If you had time to write the tests that would be great. We can then perhaps file a ticket in parquet-rs for supporting writing statistics to interval types. |
sure I can do that; from the top of my mind - the |
I did some digging in order to find out why / or where the writing of those statistics is not supported (yet).
I think this should be possible, or put differently, I don't see the reason yet, why this is not supported? Perhaps, you have some more information on this @alamb - otherwise this might be enough information to file a ticket in arrow-rs? |
Is your feature request related to a problem or challenge?
Part of #10453, where we are filling out support for extracting statistics for all data types from parquet files
At the moment, even if statistics are extracted for a different type (like
Int32
) the PruningPredicate will attempt to cast these values to the correct type:datafusion/datafusion/core/src/physical_optimizer/pruning.rs
Lines 909 to 911 in acd7106
However, in order to be efficient and ensure the cast kernel doesn't add anything incorrectly, we should be extracting the parquet statistics as the correct Array type directly. It turns out we do not do this yet for several types and those types do not have good (or any) test coverage. We almost missed this in #10711 in @xinlifoobar
Thus, we need to add support and tests for other types
Describe the solution you'd like
cargo test --test parquet_exec
) with the relevant typedatafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 61 to 182 in acd7106
Here are some example PRs:
Date32
parquet statistics asDate32Array
rather thanInt32Array
#10593Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: