Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use min_value and max_value on statistics to avoid ExecutionPlan.execute #10400

Open
samuelcolvin opened this issue May 6, 2024 · 11 comments
Open
Labels
enhancement New feature or request

Comments

@samuelcolvin
Copy link
Contributor

samuelcolvin commented May 6, 2024

Describe the bug

Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.

If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?

To Reproduce

I have a custom TableProvider and ExecutionPlan, where calling execute is somewhat expensive and I want to avoid calling it if no data will match.

The execution plan can return helpful statistics from .statistics(), including for example, for one column:

...
ColumnStatistics {
    null_count: Precision::Exact(0),
    max_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    min_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    distinct_count: Precision::Exact(1),
},

E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query value is null, the execute() function is never alled.

But if I query value > 5 or value < 0, the statistic is ignored and execute() is still called.

Expected behavior

min_value and max_value of ColumnStatistics should be used for pruning and the query plan should not require the "slow" execute method to be called.

Additional context

I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?

I've tried this on both main (as of today) and 37.1.0

@samuelcolvin samuelcolvin added the bug Something isn't working label May 6, 2024
@samuelcolvin
Copy link
Contributor Author

For anyone else looking for this, the solution is to implement supports_filters_pushdown on the table, then use that to apply filters within scan.

Feel free to close this if you like.

@alamb
Copy link
Contributor

alamb commented May 8, 2024

I think using the reported statistics to prune using datafusion::physical_optimizer::pruning::PruningPredicate would be a nice improvement (so each table provider didn't have to apply the basic min/max filters themselves)

@alamb alamb changed the title min_value and max_value on statistics don't help avoid ExecutionPlan.execute Use min_value and max_value on statistics to avoid ExecutionPlan.execute May 8, 2024
@alamb alamb added enhancement New feature or request and removed bug Something isn't working labels May 8, 2024
@alamb
Copy link
Contributor

alamb commented May 8, 2024

Relabed from bug to feature -- thanks @samuelcolvin

@samuelcolvin
Copy link
Contributor Author

@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?

@alamb
Copy link
Contributor

alamb commented May 13, 2024

@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?

I recommend updating the existing though poorly named AggregateStatistics pass

And there you could potentially call ExecutionPlan::statistics() and use PruningPredicate to prove that some predicates can't be true and replace them with PlaceholderRowExec

@alamb
Copy link
Contributor

alamb commented May 13, 2024

(Thank you for working on this, bTW)

@edmondop
Copy link
Contributor

@samuelcolvin have you already started to write code?

@samuelcolvin
Copy link
Contributor Author

No code yet, if you'd like to work on this, feel free.

@samuelcolvin
Copy link
Contributor Author

Actually I'm on a flight today, so might have some time to work on this.

@samuelcolvin
Copy link
Contributor Author

Progress update, I've got min & max stats pruning "working" in our code, however I immediately ran into #10536, I'll let you know how I get on.

@alamb
Copy link
Contributor

alamb commented May 15, 2024

You know it occurs to me that @dmitrybugakov / @jayzhan211 may be working on a similar feature with a different approach on #10456 🤔 I

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants