Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ResolveGroupingAnalytics analyzer rule #5749

Closed
wants to merge 10 commits into from

Conversation

mingmwang
Copy link
Contributor

@mingmwang mingmwang commented Mar 27, 2023

Which issue does this PR close?

Closes #5672
Closes #5649
Closes #5647.

Some TPC-DS(q27, q36, q70, q86)queries will be impacted by this PR also. Those queries include the GROUPING() functions
and the actually physical plans were not able to evaluate before this PR.

Rationale for this change

What changes are included in this PR?

  1. Add new ResolveGroupingAnalytics rule, move the enumerate_grouping_sets logic to this rule
  2. Add two internal hidden columns when necessary
// Internal column used to represent the grouping_id, used by the grouping functions.
// It is "spark_grouping_id" in Spark
const INTERNAL_GROUPING_ID: &str = "grouping_id";
// Internal column used to represent different grouping sets when there are duplicated grouping sets
const INTERNAL_GROUPING_SET_ID: &str = "grouping_set_id";
  1. Support GROUPING() and GROUPING_ID() functions.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Mar 27, 2023
@mingmwang
Copy link
Contributor Author

@jackwener @yukkit @alamb

@alamb
Copy link
Contributor

alamb commented Apr 4, 2023

I am not an expert in GROUPING sets / etc. I wonder if someone else has the time to review this PR?

@jackwener jackwener self-requested a review April 5, 2023 11:08
@mingmwang
Copy link
Contributor Author

@yjshen Could you please help to review this PR ?

@jackwener
Copy link
Member

jackwener commented Apr 10, 2023

It looks like no one has reviewed this PR, and I'm not an expert about GROUP.
so I'm going to review this PR while looking up information. It need some time.

@jackwener
Copy link
Member

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mingmwang , It's a excellent job!

Comment on lines +73 to +76
match self {
AggregateFunction::GroupingId => {
write!(f, "GROUPING_ID")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look like we don't need it

@@ -110,6 +110,28 @@ async fn csv_query_group_by_boolean() -> Result<()> {
Ok(())
}

#[tokio::test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future ticket: add some test in slt

@@ -2856,8 +2856,8 @@ fn aggregate_with_rollup() {
let sql =
"SELECT id, state, age, COUNT(*) FROM person GROUP BY id, ROLLUP (state, age)";
let expected = "Projection: person.id, person.state, person.age, COUNT(UInt8(1))\
\n Aggregate: groupBy=[[GROUPING SETS ((person.id), (person.id, person.state), (person.id, person.state, person.age))]], aggr=[[COUNT(UInt8(1))]]\
\n TableScan: person";
\n Aggregate: groupBy=[[person.id, ROLLUP (person.state, person.age)]], aggr=[[COUNT(UInt8(1))]]\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@@ -202,6 +202,77 @@ impl PartialEq<dyn Any> for UnKnownColumn {
}
}

#[derive(Debug, Hash, PartialEq, Eq, Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +732 to +742
let mut group_data = group
.iter()
.enumerate()
.map(|(idx, is_null)| {
if *is_null {
null_exprs_value[idx].clone()
} else {
exprs_value[idx].clone()
}
})
.collect::<Vec<_>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can extract a function named get_group_data.

Comment on lines +754 to +764
group
.iter()
.enumerate()
.map(|(idx, is_null)| {
if *is_null {
null_exprs_value[idx].clone()
} else {
exprs_value[idx].clone()
}
})
.collect::<Vec<_>>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use function get_group_data

/// Hidden grouping set expr in the grouping set
hidden_grouping_set_expr: Vec<(Arc<dyn PhysicalExpr>, String)>,
/// Distinct result expr for the grouping set, used to generate output schema
result_expr: Vec<(Arc<dyn PhysicalExpr>, String)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question.

Is result_expr a part of grouping_set_expr?
If yes, maybe we can use a index vec to point grouping_set_expr

Comment on lines +652 to +654
if !expr.contains_hidden_columns()
&& cols.iter().all(|c| group_expr_columns.contains(c))
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hiddent columns may influent optimizer rule, it may cause some bug.
We should add more test to cover it in the following PR.

@@ -1364,6 +1423,8 @@ fn create_name(e: &Expr) -> Result<String> {
"Create name does not support qualified wildcard".to_string(),
)),
Expr::Placeholder { id, .. } => Ok((*id).to_string()),
Expr::HiddenColumn(_, c) => Ok(format!("#{}", c)),
Expr::HiddenExpr(first, _) => Ok(format!("#{}", first)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should HiddenExpe be Ok(format!("{}", first))? look like # is wrong.

Comment on lines +34 to +35

pub struct ResolveGroupingAnalytics;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add some comment to explain this rule resolve what problem.

@alamb
Copy link
Contributor

alamb commented May 3, 2023

What is the status of this PR? It has accumulated non trivial conflicts. Shall we mark it as a draft ?

@mingmwang
Copy link
Contributor Author

What is the status of this PR? It has accumulated non trivial conflicts. Shall we mark it as a draft ?

Yes, please mark it as a draft.

@jackwener jackwener marked this pull request as draft May 4, 2023 11:10
@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
3 participants