Skip to content

Use min_value and max_value on statistics to avoid ExecutionPlan.execute #10400

Open
@samuelcolvin

Description

@samuelcolvin

Describe the bug

Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.

If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?

To Reproduce

I have a custom TableProvider and ExecutionPlan, where calling execute is somewhat expensive and I want to avoid calling it if no data will match.

The execution plan can return helpful statistics from .statistics(), including for example, for one column:

...
ColumnStatistics {
    null_count: Precision::Exact(0),
    max_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    min_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    distinct_count: Precision::Exact(1),
},

E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query value is null, the execute() function is never alled.

But if I query value > 5 or value < 0, the statistic is ignored and execute() is still called.

Expected behavior

min_value and max_value of ColumnStatistics should be used for pruning and the query plan should not require the "slow" execute method to be called.

Additional context

I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?

I've tried this on both main (as of today) and 37.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions