Add `value_from_statisics` to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

edmondop · 2024-09-02T18:55:57Z

Which issue does this PR close?

Closes #11151 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 · 2024-09-02T23:52:50Z

datafusion/expr/src/udaf.rs

@@ -262,6 +262,19 @@ impl AggregateUDF {
        self.inner.is_descending()
    }

+    /// Returns true if the function is min. Used by the optimizer
+    pub fn is_min(&self) -> bool {


This is not we should do...
We need to understand the context and have a general function instead of a specialize name matching function.

default_value is a good example, it is only used in count for now, but it could extend to any function if needed.

@jayzhan211 looking to this comment from @alamb #11151 (comment) it might be that those "general functions" might not exists, will do some homework

If there is really no "general context", I think it is fine to just leave them as it is. Having specific function name matching in Impl Trait doesn't make sense to me 🤔

I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics /// /// If the value of this aggregate, can be determined using only the /// statistics, return `Some(value)`, otherwise return `None` (the default) /// /// # Arguments /// * `statistics`: the statistics describing the input to this aggregate functions /// * `args`: the arguments passed to the aggregate function /// /// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` /// can be determined using statistics, if known /// fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

alamb

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

alamb · 2024-09-06T14:53:42Z

datafusion/expr/src/udaf.rs

@@ -262,6 +262,19 @@ impl AggregateUDF {
        self.inner.is_descending()
    }

+    /// Returns true if the function is min. Used by the optimizer
+    pub fn is_min(&self) -> bool {


I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics /// /// If the value of this aggregate, can be determined using only the /// statistics, return `Some(value)`, otherwise return `None` (the default) /// /// # Arguments /// * `statistics`: the statistics describing the input to this aggregate functions /// * `args`: the arguments passed to the aggregate function /// /// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` /// can be determined using statistics, if known /// fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

alamb · 2024-09-11T14:56:07Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

edmondop · 2024-09-13T23:34:05Z

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

I tried to implement that solution, but the problem is now here

https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437

that there is no real distinction between count and min_max. So the non distinct count doesn't get taken at line 61 but only at line https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437R65

We have two options imho:

passing distinct to get_value from stats
have a separate method for count?

jayzhan211 · 2024-09-14T00:34:14Z

datafusion/physical-optimizer/src/aggregate_statistics.rs

-        }
-    }
-    None
+    let value = agg_expr.fun().value_from_stats(


@edmondop
We have agg_expr.is_distinct() here, so you can differentiate distinct count and non-distinct case.

Maybe we can have StatisticsArgs similar to AccumulatorArgs.

pub struct StatisticsArgs<'a> { statistics: &'a Statistics, return_type: &'a DataType, /// Whether the aggregate function is distinct. /// /// ```sql /// SELECT COUNT(DISTINCT column1) FROM t; /// ``` pub is_distinct: bool, /// The physical expression of arguments the aggregate function takes. pub exprs: &'a [Arc<dyn PhysicalExpr>], }

So the function for count would return None if the aggregate is distinct? It felt like leaking in the UDF logic required by physical optimiser

It felt like leaking in the UDF logic required by physical optimiser

I think it makes sense, by default it has None, but we can also tune the function for optimizer 🤔

Maybe we need to find a better name than value_from_stats? I am not very clear why the distinct for count is important if you are simply "getting the value" from stats, if the precision is Exact

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

The optimization for count is mainly for count(*), since it doesn't make sense for distinct case, so we only care about non-distinct count for value_from_stats, but not distinct count.

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

distinct is for value_from_stats to know what kind of function we have, either count or distinct count.

alamb · 2024-09-27T11:08:43Z

How is this PR coming @edmondop ? Do you think you'll be able to finish it in the near term ?

If not, perhaps we can put up the issue as a good first issue for someone as it has a "pretty close" implementation that just needs some help finishing

edmondop · 2024-09-27T11:48:04Z

How is this PR coming @edmondop ? Do you think you'll be able to finish it in the near term ?

If not, perhaps we can put up the issue as a good first issue for someone as it has a "pretty close" implementation that just needs some help finishing

I will wrap up this weekend, it's so close you are right

alamb

This is great @edmondop thank you

I think this is a really nice API and generalizing the optimization is 👍

alamb · 2024-09-29T12:00:59Z

datafusion/expr/src/udaf.rs

@@ -93,6 +94,19 @@ impl fmt::Display for AggregateUDF {
    }
 }

+pub struct StatisticsArgs<'a> {


❤️

I think it would also be great to add some documentation. Perhaps like this:

Suggested change

pub struct StatisticsArgs<'a> {

/// Arguments passed to [`AggregateUDFImpl::value_from_stats`]

pub struct StatisticsArgs<'a> {

alamb · 2024-09-29T13:36:06Z

datafusion/expr/src/udaf.rs

@@ -574,6 +595,10 @@ pub trait AggregateUDFImpl: Debug + Send + Sync {
    fn is_descending(&self) -> Option<bool> {
        None
    }
+    // Return the value of the current UDF from the statistics


Suggested change

// Return the value of the current UDF from the statistics

/// Return the value of this UDF for the query if it can be determined entirely from

/// statistics and arguments.

///

/// For example, if the minimum value of column `x` is known exactly in the statistics,

/// then `MIN(x)` can be replaced by that value, significantly improving query performance.

alamb · 2024-09-29T13:36:29Z

datafusion/functions-aggregate/src/count.rs

@@ -291,6 +294,36 @@ impl AggregateUDFImpl for Count {
    fn default_value(&self, _data_type: &DataType) -> Result<ScalarValue> {
        Ok(ScalarValue::Int64(Some(0)))
    }
+
+    fn value_from_stats(&self, statistics_args: &StatisticsArgs) -> Option<ScalarValue> {
+        if statistics_args.is_distinct {


jayzhan211

👍

alamb · 2024-09-30T19:09:56Z

Thanks again so much @edmondop and @jayzhan211

~~I'll make a PR to add the documentation suggestions~~
Update: PR with documentation: #12689

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules functions labels Sep 2, 2024

Removes min/max/count comparison based on name in aggregate statistics

f478344

edmondop force-pushed the issue-11151 branch from b9262ec to f478344 Compare September 2, 2024 18:57

jayzhan211 reviewed Sep 2, 2024

View reviewed changes

alamb reviewed Sep 6, 2024

View reviewed changes

This was referenced Sep 6, 2024

Remove special casting of Min / Max built in AggregateFunctions #11151

Closed

DataFusion weekly project plan (Andrew Lamb) - Sep 2, 2024 #12336

Closed

alamb marked this pull request as draft September 11, 2024 14:56

edmondop added 2 commits September 13, 2024 23:17

Abstracting away value from statistics

751e35a

Removing imports

016decf

jayzhan211 reviewed Sep 14, 2024

View reviewed changes

Introduced StatisticsArgs

f41dd3e

github-actions bot added the physical-expr Physical Expressions label Sep 28, 2024

Fixed docs

faf6c45

edmondop marked this pull request as ready for review September 28, 2024 15:02

alamb approved these changes Sep 29, 2024

View reviewed changes

alamb changed the title ~~Removes min/max/count comparison based on name in aggregate statistics~~ Add value_from_statisics to AggregateUDFImpl, remove special case for min/max/count aggregate statistics Sep 29, 2024

jayzhan211 approved these changes Sep 29, 2024

View reviewed changes

alamb merged commit 0242767 into apache:main Sep 30, 2024
25 checks passed

alamb mentioned this pull request Sep 30, 2024

Improve documentation for AggregateUDFImpl::value_from_stats #12689

Merged

edmondop mentioned this pull request Sep 30, 2024

Remove Min/Max references from AggregateStatistics #11153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `value_from_statisics` to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

Add `value_from_statisics` to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

edmondop commented Sep 2, 2024

jayzhan211 Sep 2, 2024

edmondop Sep 3, 2024

jayzhan211 Sep 3, 2024 •

edited

Loading

alamb Sep 6, 2024

alamb left a comment

alamb Sep 6, 2024

alamb commented Sep 11, 2024

edmondop commented Sep 13, 2024

jayzhan211 Sep 14, 2024 •

edited

Loading

edmondop Sep 14, 2024 •

edited

Loading

jayzhan211 Sep 14, 2024

edmondop Sep 14, 2024

jayzhan211 Sep 14, 2024

alamb commented Sep 27, 2024

edmondop commented Sep 27, 2024

alamb left a comment

alamb Sep 29, 2024

alamb Sep 29, 2024

alamb Sep 29, 2024

jayzhan211 left a comment

alamb commented Sep 30, 2024 •

edited

Loading

	pub struct StatisticsArgs<'a> {
	/// Arguments passed to [`AggregateUDFImpl::value_from_stats`]
	pub struct StatisticsArgs<'a> {

-    // Return the value of the current UDF from the statistics
+    /// Return the value of this UDF for the query if it can be determined entirely from
+    /// statistics and arguments.
+    ///
+    /// For example, if the minimum value of column `x` is known exactly in the statistics,
+    /// then `MIN(x)` can be replaced by that value, significantly improving query performance.

Add value_from_statisics to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

Add value_from_statisics to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

Conversation

edmondop commented Sep 2, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 11, 2024

edmondop commented Sep 13, 2024

jayzhan211 Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

edmondop Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 27, 2024

edmondop commented Sep 27, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb commented Sep 30, 2024 • edited Loading

Add `value_from_statisics` to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

Add `value_from_statisics` to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

jayzhan211 Sep 3, 2024 •

edited

Loading

jayzhan211 Sep 14, 2024 •

edited

Loading

edmondop Sep 14, 2024 •

edited

Loading

alamb commented Sep 30, 2024 •

edited

Loading