Skip to content

fix: make columnize_expr resistant to display_name collisions #10459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 14, 2024

Conversation

jonahgao
Copy link
Member

@jonahgao jonahgao commented May 11, 2024

Which issue does this PR close?

Closes #10413.

Rationale for this change

Background

columnize_expr is used during planning projections. Usually, a projected expr can only reference the output data of the input plan through a column reference, but this does not work for aggregation and window functions. For example

let aggr_expr = sum(col("c2"));
let df = test_table()
      .await?
      .aggregate(vec![], vec![aggr_expr.clone()])?
      .select(vec![aggr_expr])?;

The aggregate expression sum(col("c2")) is calculated in the aggregate logical plan, and the projection plan references its result through the aggregate expression directly. To make it work appropriately, we need to convert the aggregate expression in the projection into a column reference of the aggregate plan, which is done by columnize_expr.

Problem And Fix

Previously, columnize_expr searched for the corresponding column from the input schema through the display name of projected expr. However, different expressions may have the same display name. It is easy to reproduce by creating the same alias for another differenct expression. Collisions of display name will cause columnize_expr to find the wrong column, producing incorrect results as in issue #10413.

The fix is to search from input plan's output expressions directly based on the expression itself, instead of searching by display name.

What changes are included in this PR?

Refactor columnize_expr so that it can correctly find the column corresponding to the expression.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. columnize_expr is a public method, although it may only be used internally in DataFusion.

@jonahgao jonahgao marked this pull request as draft May 11, 2024 15:55
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 11, 2024
@jonahgao jonahgao marked this pull request as ready for review May 12, 2024 13:23
alias_expr,
Arc::new(outer_aggr),
)?)))
Ok(Some(project(outer_aggr, alias_expr)?))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function project() will internally call columnize_expr, which can simplify the code here.

@@ -1154,12 +1154,12 @@ mod test {
let table_scan = test_table_scan()?;

let plan = LogicalPlanBuilder::from(table_scan)
.project(vec![lit(1) + col("a")])?
.project(vec![lit(1) + col("a"), col("a")])?
Copy link
Member Author

@jonahgao jonahgao May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It corresponds to select a+1 from (select a+1 from t), which is not a valid query.

> select a+1 from (select a+1 from t);
Schema error: No field named a. Valid fields are "t.a + Int64(1)"

columnize_expr currently only works when the input plan is aggregate or window.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jonahgao -- I went through this PR quite carefully and it looks really nice to me ❤️

I made a note of some place we maybe could avoid some clones, but I don't think it is required

cc @peter-toth and @erratic-pattern

@@ -1619,7 +1619,16 @@ select count(1) from v;
query I
select a + b from (select 1 as a, 2 as b, 1 as "a + b");
----
1
3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -228,37 +228,18 @@ impl OptimizerRule for SingleDistinctToGroupBy {
.collect::<Result<Vec<_>>>()?;

// construct the inner AggrPlan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @appletreeisyellow - this PR contains some changes which may conflict with #10295

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome -- thank you @jonahgao

Really really nice

@alamb alamb merged commit cd36ee3 into apache:main May 14, 2024
23 checks passed
@jonahgao
Copy link
Member Author

Thanks for reviewing @alamb .

@jonahgao jonahgao deleted the fix_columnize_expr branch May 14, 2024 12:52
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…he#10459)

* fix: make `columnize_expr` resistant to display_name collisions

* fix simple_window_function test

* remove Projection

* add tests

* retry ci

* fix DataFrame tests

* Update datafusion/expr/src/logical_plan/plan.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Remove copies

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect results with expression resolution
2 participants