Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add value_from_statisics to AggregateUDFImpl, remove special case for min/max/count aggregate statistics #12296

Merged
merged 5 commits into from
Sep 30, 2024

Conversation

edmondop
Copy link
Contributor

@edmondop edmondop commented Sep 2, 2024

Which issue does this PR close?

Closes #11151 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules functions labels Sep 2, 2024
@@ -262,6 +262,19 @@ impl AggregateUDF {
self.inner.is_descending()
}

/// Returns true if the function is min. Used by the optimizer
pub fn is_min(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not we should do...
We need to understand the context and have a general function instead of a specialize name matching function.

default_value is a good example, it is only used in count for now, but it could extend to any function if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 looking to this comment from @alamb #11151 (comment) it might be that those "general functions" might not exists, will do some homework

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is really no "general context", I think it is fine to just leave them as it is. Having specific function name matching in Impl Trait doesn't make sense to me 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics
///
/// If the value of this aggregate, can be determined using only the 
/// statistics, return `Some(value)`, otherwise return `None` (the default)
///
/// # Arguments
/// * `statistics`: the statistics describing the input to this aggregate functions
/// * `args`: the arguments passed to the aggregate function 
///
/// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` 
/// can be determined using statistics, if known
///
fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

@@ -262,6 +262,19 @@ impl AggregateUDF {
self.inner.is_descending()
}

/// Returns true if the function is min. Used by the optimizer
pub fn is_min(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics
///
/// If the value of this aggregate, can be determined using only the 
/// statistics, return `Some(value)`, otherwise return `None` (the default)
///
/// # Arguments
/// * `statistics`: the statistics describing the input to this aggregate functions
/// * `args`: the arguments passed to the aggregate function 
///
/// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` 
/// can be determined using statistics, if known
///
fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

@alamb
Copy link
Contributor

alamb commented Sep 11, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@edmondop
Copy link
Contributor Author

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

I tried to implement that solution, but the problem is now here

https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437

that there is no real distinction between count and min_max. So the non distinct count doesn't get taken at line 61 but only at line https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437R65

We have two options imho:

  • passing distinct to get_value from stats
  • have a separate method for count?

}
}
None
let value = agg_expr.fun().value_from_stats(
Copy link
Contributor

@jayzhan211 jayzhan211 Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edmondop
We have agg_expr.is_distinct() here, so you can differentiate distinct count and non-distinct case.

Maybe we can have StatisticsArgs similar to AccumulatorArgs.

pub struct StatisticsArgs<'a> {
    statistics: &'a Statistics,
    return_type: &'a DataType,
    /// Whether the aggregate function is distinct.
    ///
    /// ```sql
    /// SELECT COUNT(DISTINCT column1) FROM t;
    /// ```
    pub is_distinct: bool,
    /// The physical expression of arguments the aggregate function takes.
    pub exprs: &'a [Arc<dyn PhysicalExpr>],
}

Copy link
Contributor Author

@edmondop edmondop Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the function for count would return None if the aggregate is distinct? It felt like leaking in the UDF logic required by physical optimiser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt like leaking in the UDF logic required by physical optimiser

I think it makes sense, by default it has None, but we can also tune the function for optimizer 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to find a better name than value_from_stats? I am not very clear why the distinct for count is important if you are simply "getting the value" from stats, if the precision is Exact

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization for count is mainly for count(*), since it doesn't make sense for distinct case, so we only care about non-distinct count for value_from_stats, but not distinct count.

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

distinct is for value_from_stats to know what kind of function we have, either count or distinct count.

@alamb
Copy link
Contributor

alamb commented Sep 27, 2024

How is this PR coming @edmondop ? Do you think you'll be able to finish it in the near term ?

If not, perhaps we can put up the issue as a good first issue for someone as it has a "pretty close" implementation that just needs some help finishing

@edmondop
Copy link
Contributor Author

How is this PR coming @edmondop ? Do you think you'll be able to finish it in the near term ?

If not, perhaps we can put up the issue as a good first issue for someone as it has a "pretty close" implementation that just needs some help finishing

I will wrap up this weekend, it's so close you are right

@github-actions github-actions bot added the physical-expr Physical Expressions label Sep 28, 2024
@edmondop edmondop marked this pull request as ready for review September 28, 2024 15:02
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @edmondop thank you

so-beautiful

I think this is a really nice API and generalizing the optimization is 👍

@@ -93,6 +94,19 @@ impl fmt::Display for AggregateUDF {
}
}

pub struct StatisticsArgs<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

I think it would also be great to add some documentation. Perhaps like this:

Suggested change
pub struct StatisticsArgs<'a> {
/// Arguments passed to [`AggregateUDFImpl::value_from_stats`]
pub struct StatisticsArgs<'a> {

@@ -574,6 +595,10 @@ pub trait AggregateUDFImpl: Debug + Send + Sync {
fn is_descending(&self) -> Option<bool> {
None
}
// Return the value of the current UDF from the statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Return the value of the current UDF from the statistics
/// Return the value of this UDF for the query if it can be determined entirely from
/// statistics and arguments.
///
/// For example, if the minimum value of column `x` is known exactly in the statistics,
/// then `MIN(x)` can be replaced by that value, significantly improving query performance.

@@ -291,6 +294,36 @@ impl AggregateUDFImpl for Count {
fn default_value(&self, _data_type: &DataType) -> Result<ScalarValue> {
Ok(ScalarValue::Int64(Some(0)))
}

fn value_from_stats(&self, statistics_args: &StatisticsArgs) -> Option<ScalarValue> {
if statistics_args.is_distinct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb changed the title Removes min/max/count comparison based on name in aggregate statistics Add value_from_statisics to AggregateUDFImpl, remove special case for min/max/count aggregate statistics Sep 29, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 0242767 into apache:main Sep 30, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 30, 2024

Thanks again so much @edmondop and @jayzhan211

I'll make a PR to add the documentation suggestions
Update: PR with documentation: #12689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove special casting of Min / Max built in AggregateFunctions
3 participants