Skip to content

Support more types when pruning Parquet data #15742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
etseidl opened this issue Apr 16, 2025 · 9 comments
Closed

Support more types when pruning Parquet data #15742

etseidl opened this issue Apr 16, 2025 · 9 comments
Labels
enhancement New feature or request

Comments

@etseidl
Copy link

etseidl commented Apr 16, 2025

Is your feature request related to a problem or challenge?

I've been working on implementing a new ColumnOrder for floating point columns in Parquet (apache/arrow-rs#7408), and while investigating how to use the new statistics in Datafusion, I found an interesting quirk.

I'm looking at explain plans for some queries that should be able to use statistics to prune pages. For example:

> explain select int_col from 'parquet-testing/data/alltypes_tiny_pages.parquet' where int_col > 10;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                  |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: parquet-testing/data/alltypes_tiny_pages.parquet.int_col > Int32(10)                                                                                                                                                                                                                          |
|               |   TableScan: parquet-testing/data/alltypes_tiny_pages.parquet projection=[int_col], partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.int_col > Int32(10)]                                                                                                                            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                           |
|               |   FilterExec: int_col@0 > 10                                                                                                                                                                                                                                                                          |
|               |     RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                                                                                                                                                                                              |
|               |       DataSourceExec: file_groups={1 group: [[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[int_col], file_type=parquet, predicate=int_col@4 > 10, pruning_predicate=int_col_null_count@1 != row_count@2 AND int_col_max@0 > 10, required_guarantees=[] |

will use the column and page statistics (pruning_predicate is populated). However, when I tried a similar plan for a float column, I was surprised to see that no pruning is done.

> explain select float_col from 'parquet-testing/data/alltypes_tiny_pages.parquet' where float_col > 10.0;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                 |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: CAST(parquet-testing/data/alltypes_tiny_pages.parquet.float_col AS Float64) > Float64(10)                                                                                                                    |
|               |   TableScan: parquet-testing/data/alltypes_tiny_pages.parquet projection=[float_col], partial_filters=[CAST(parquet-testing/data/alltypes_tiny_pages.parquet.float_col AS Float64) > Float64(10)]                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                          |
|               |   FilterExec: CAST(float_col@0 AS Float64) > 10                                                                                                                                                                      |
|               |     RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                                                                                                             |
|               |       DataSourceExec: file_groups={1 group: [[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[float_col], file_type=parquet, predicate=CAST(float_col@6 AS Float64) > 10 |
|               |      

But, using a double column will:

> explain select double_col from 'parquet-testing/data/alltypes_tiny_pages.parquet' where double_col > 10.0;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                              |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: parquet-testing/data/alltypes_tiny_pages.parquet.double_col > Float64(10)                                                                                                                                                                                                                                 |
|               |   TableScan: parquet-testing/data/alltypes_tiny_pages.parquet projection=[double_col], partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.double_col > Float64(10)]                                                                                                                                |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                       |
|               |   FilterExec: double_col@0 > 10                                                                                                                                                                                                                                                                                   |
|               |     RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                                                                                                                                                                                                          |
|               |       DataSourceExec: file_groups={1 group: [[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[double_col], file_type=parquet, predicate=double_col@7 > 10, pruning_predicate=double_col_null_count@1 != row_count@2 AND double_col_max@0 > 10, required_guarantees=[] |

Digging into the code, it seems the problem with float lies in a) the literal 10.0 is treated as a double so b) float_col is cast to Float64, which is disallowed in datafusion_physical_optimizer::pruning::verify_support_type_for_prune().

Interestingly, using an int literal works:

> explain select float_col from 'parquet-testing/data/alltypes_tiny_pages.parquet' where float_col > 10;
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                          |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: parquet-testing/data/alltypes_tiny_pages.parquet.float_col > Float32(10)                                                                                                                                                                                                                              |
|               |   TableScan: parquet-testing/data/alltypes_tiny_pages.parquet projection=[float_col], partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.float_col > Float32(10)]                                                                                                                              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                   |
|               |   FilterExec: float_col@0 > 10                                                                                                                                                                                                                                                                                |
|               |     RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                                                                                                                                                                                                      |
|               |       DataSourceExec: file_groups={1 group: [[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[float_col], file_type=parquet, predicate=float_col@6 > 10, pruning_predicate=float_col_null_count@1 != row_count@2 AND float_col_max@0 > 10, required_guarantees=[] |

as does an explicit cast

> explain select float_col from 'parquet-testing/data/alltypes_tiny_pages.parquet' where float_col > cast(10.0 as float);
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                          |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: parquet-testing/data/alltypes_tiny_pages.parquet.float_col > Float32(10)                                                                                                                                                                                                                              |
|               |   TableScan: parquet-testing/data/alltypes_tiny_pages.parquet projection=[float_col], partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.float_col > Float32(10)]                                                                                                                              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                   |
|               |   FilterExec: float_col@0 > 10                                                                                                                                                                                                                                                                                |
|               |     RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                                                                                                                                                                                                      |
|               |       DataSourceExec: file_groups={1 group: [[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[float_col], file_type=parquet, predicate=float_col@6 > 10, pruning_predicate=float_col_null_count@1 != row_count@2 AND float_col_max@0 > 10, required_guarantees=[] |

Describe the solution you'd like

It would be nice if Datafusion always used statistics for floating point columns if they are available. One potential fix is to add more cases to verify_support_type_for_prune (

fn verify_support_type_for_prune(from_type: &DataType, to_type: &DataType) -> Result<()> {
// TODO: support other data type for prunable cast or try cast
if matches!(
from_type,
DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64
| DataType::Decimal128(_, _)
) && matches!(
to_type,
DataType::Int8 | DataType::Int32 | DataType::Int64 | DataType::Decimal128(_, _)
) {
). Another would be to cast floating point literals to a more appropriate type (as it appears is done when casting int literals to float). I'd be happy to attempt the former, but I'd need pointing to where to attempt the latter 😅.

Describe alternatives you've considered

No response

Additional context

No response

@etseidl etseidl added the enhancement New feature or request label Apr 16, 2025
@etseidl
Copy link
Author

etseidl commented Apr 17, 2025

After a bit of spelunking through the plan generation code, I figured out why my second suggestion doesn't make sense. Deep down the types will be coerced to a common type for comparison, with the lower precision being coerced to the higher, so that part is all as designed.

I'll start on adding more data types to verify_support_type_for_prune.

@etseidl
Copy link
Author

etseidl commented Apr 18, 2025

More background in #3377 and #3442. It seems like additional data types were planned, but abandoned for some reason.

@alamb do you think it would be safe to replace the logic above with

    if from_type.is_numeric() && to_type.is_numeric() {
        Ok(())
    }

This would bring floating point and unsigned ints into the mix.

@etseidl
Copy link
Author

etseidl commented Apr 18, 2025

BTW, this issue is somewhat tied up with apache/parquet-format#221. Take for example

> select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > 2.0;
+-----+
| x   |
+-----+
| NaN |
+-----+
1 row(s) fetched. 

This return a single NaN row because the lack of pruning will not filter out pages/column chunks. Rewriting the query to avoid the LHS cast:

> select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16');
+---+
| x |
+---+
+---+
0 row(s) fetched.

Now filters on statistics that have max == 2.0. Changing verify_support_type_for_prune to support floating point types will now also return 0 rows.

@alamb
Copy link
Contributor

alamb commented Apr 19, 2025

It would be nice if Datafusion always used statistics for floating point columns if they are available. One potential fix is to add more cases to verify_support_type_for_prune (

I think it would be a great addition

I suspect the reason we didn't add more types at the time is that we ere worried about correctness and hadn't thought through the implications of ordering and nans and nulls, etc on floats

Now that the pruning code is setup and the test patters are good I think we can expand the support and tests and it woudl be really good

Thank you @etseidl for the find

@alamb
Copy link
Contributor

alamb commented Apr 19, 2025

FYI @adriangb

@adriangb
Copy link
Contributor

Funny enough I just opened #15764 without having seen this issue!

It sounds like there may be some complexity with floats... honestly I just wanted dictionary support. I will amend my PR to remove float support since that's tricky.

@alamb
Copy link
Contributor

alamb commented Apr 19, 2025

I think that might be a historical relic -- and we could potentially fix this ticket by simply deleting the coercion type check. Let me know what you think:

@etseidl
Copy link
Author

etseidl commented Apr 19, 2025

I'll follow up in #15764

@etseidl
Copy link
Author

etseidl commented Apr 24, 2025

Fixed by #15764

@etseidl etseidl closed this as completed Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants