-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use min_value
and max_value
on statistics to avoid ExecutionPlan.execute
#10400
Comments
For anyone else looking for this, the solution is to implement Feel free to close this if you like. |
I think using the reported statistics to prune using datafusion::physical_optimizer::pruning::PruningPredicate would be a nice improvement (so each table provider didn't have to apply the basic min/max filters themselves) |
min_value
and max_value
on statistics don't help avoid ExecutionPlan.execute
min_value
and max_value
on statistics to avoid ExecutionPlan.execute
Relabed from bug to feature -- thanks @samuelcolvin |
@alamb using |
I recommend updating the existing though poorly named AggregateStatistics pass And there you could potentially call |
(Thank you for working on this, bTW) |
@samuelcolvin have you already started to write code? |
No code yet, if you'd like to work on this, feel free. |
Actually I'm on a flight today, so might have some time to work on this. |
Progress update, I've got min & max stats pruning "working" in our code, however I immediately ran into #10536, I'll let you know how I get on. |
You know it occurs to me that @dmitrybugakov / @jayzhan211 may be working on a similar feature with a different approach on #10456 🤔 I |
Describe the bug
Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.
If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?
To Reproduce
I have a custom
TableProvider
andExecutionPlan
, where callingexecute
is somewhat expensive and I want to avoid calling it if no data will match.The execution plan can return helpful statistics from
.statistics()
, including for example, for one column:E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query
value is null
, theexecute()
function is never alled.But if I query
value > 5
orvalue < 0
, the statistic is ignored andexecute()
is still called.Expected behavior
min_value
andmax_value
ofColumnStatistics
should be used for pruning and the query plan should not require the "slow" execute method to be called.Additional context
I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?
I've tried this on both
main
(as of today) and37.1.0
The text was updated successfully, but these errors were encountered: