-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support more types when pruning Parquet data #15742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
After a bit of spelunking through the plan generation code, I figured out why my second suggestion doesn't make sense. Deep down the types will be coerced to a common type for comparison, with the lower precision being coerced to the higher, so that part is all as designed. I'll start on adding more data types to |
More background in #3377 and #3442. It seems like additional data types were planned, but abandoned for some reason. @alamb do you think it would be safe to replace the logic above with if from_type.is_numeric() && to_type.is_numeric() {
Ok(())
} This would bring floating point and unsigned ints into the mix. |
BTW, this issue is somewhat tied up with apache/parquet-format#221. Take for example > select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > 2.0;
+-----+
| x |
+-----+
| NaN |
+-----+
1 row(s) fetched. This return a single > select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16');
+---+
| x |
+---+
+---+
0 row(s) fetched. Now filters on statistics that have |
I think it would be a great addition I suspect the reason we didn't add more types at the time is that we ere worried about correctness and hadn't thought through the implications of ordering and nans and nulls, etc on floats Now that the pruning code is setup and the test patters are good I think we can expand the support and tests and it woudl be really good Thank you @etseidl for the find |
FYI @adriangb |
Funny enough I just opened #15764 without having seen this issue! It sounds like there may be some complexity with floats... honestly I just wanted dictionary support. I will amend my PR to remove float support since that's tricky. |
I think that might be a historical relic -- and we could potentially fix this ticket by simply deleting the coercion type check. Let me know what you think: |
I'll follow up in #15764 |
Fixed by #15764 |
Is your feature request related to a problem or challenge?
I've been working on implementing a new
ColumnOrder
for floating point columns in Parquet (apache/arrow-rs#7408), and while investigating how to use the new statistics in Datafusion, I found an interesting quirk.I'm looking at explain plans for some queries that should be able to use statistics to prune pages. For example:
will use the column and page statistics (
pruning_predicate
is populated). However, when I tried a similar plan for afloat
column, I was surprised to see that no pruning is done.But, using a
double
column will:Digging into the code, it seems the problem with
float
lies in a) the literal10.0
is treated as adouble
so b)float_col
is cast toFloat64
, which is disallowed indatafusion_physical_optimizer::pruning::verify_support_type_for_prune()
.Interestingly, using an
int
literal works:as does an explicit cast
Describe the solution you'd like
It would be nice if Datafusion always used statistics for floating point columns if they are available. One potential fix is to add more cases to
verify_support_type_for_prune
(datafusion/datafusion/physical-optimizer/src/pruning.rs
Lines 1217 to 1229 in 42a45d1
int
literals tofloat
). I'd be happy to attempt the former, but I'd need pointing to where to attempt the latter 😅.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: