Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic unnest function #6796

Closed
wants to merge 4 commits into from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jun 29, 2023

Which issue does this PR close?

Ref #6555

Closes #.

Rationale for this change

Support unnest for arrays

What changes are included in this PR?

  • select unnest()
  • select * from unnest()
  • unnest alias
  • unnest multidimensional arrays
  • unnest args more than 2
  • select partial columns
  • deterministic result for datafusion.execution.target_partitions > 1 (We might not need this?)

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 29, 2023
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 29, 2023

Current implementation.

There are two workflows for unnest

unnest_arrays is the core function for building an Unnest Logical Plan.

Workflow 1: select unnest(), we will get statements from select.projection. SQL::Function unnest is then transformed to Expr::Unnest. We will then build the LogicalPlan in optimize phase. In optimize rule, when we see Expr::Unnest, build the Unnest Logical Plan with unnest_arrays.

Workflow 2:select * from unnest(), we can get the table relation Unnest directly. In this case, we can build an Unnest Logical Plan with unnest_arrays without Expr::Unnest involved.

@github-actions github-actions bot removed the physical-expr Physical Expressions label Jun 29, 2023
@jayzhan211 jayzhan211 marked this pull request as ready for review June 29, 2023 06:46
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 29, 2023

Code cleanup is done, and ready for review.
@izveigor @alamb

set datafusion.execution.target_partitions = 1;

query ?
select unnest(make_array(1,2,3,4,5))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend to create test with one of aggregate functions as stated in the description of the issue: #6555

SELECT sum(a) AS total FROM (SELECT unnest(make_array(3, 5, 6) AS a) AS b;
----
14

Copy link
Contributor Author

@jayzhan211 jayzhan211 Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also. Add to task list

@jayzhan211 jayzhan211 marked this pull request as draft July 2, 2023 01:32
@jayzhan211 jayzhan211 marked this pull request as ready for review July 2, 2023 06:52
)?);

self.apply_expr_alias(apply_name_plan, alias.columns)
let plan = self.apply_expr_alias(plan, alias.columns)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is changed so that we can get either expr or table. expr.

Example in array.slt
table.expr: SELECT sum(data.a), sum(data.b) FROM unnest(make_array(1, 2, 3), make_array(4, 5, 6, 7)) as data(a, b);
expr: SELECT sum(a), sum(b) FROM unnest(make_array(1, 2, 3), make_array(4, 5, 6, 7)) as data(a, b);

@jayzhan211 jayzhan211 marked this pull request as draft July 3, 2023 02:50
@jayzhan211
Copy link
Contributor Author

Wait for sqlparser 0.36

@alamb
Copy link
Contributor

alamb commented Jul 3, 2023

I was planning to release sqlparser in a few weeks -- but if it is blocking this PR I can find time to make a release sooner

@jayzhan211
Copy link
Contributor Author

I was planning to release sqlparser in a few weeks -- but if it is blocking this PR I can find time to make a release sooner

I hope sqlparser can be released soon! Thanks!

@alamb
Copy link
Contributor

alamb commented Jul 3, 2023

Next release is tracked by apache/datafusion-sqlparser-rs#905

I will try and prioritize it but this week is short for me because of the US holiday (and I am spending most of my DataFusion time on #6800)

@jayzhan211
Copy link
Contributor Author

@alamb Kindly remind you. :)

@jayzhan211 jayzhan211 force-pushed the unnest_array branch 2 times, most recently from 9127da7 to e55a2fb Compare July 28, 2023 04:27
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 28, 2023

Possible improvement:
Utilize the existing unnest_columns function instead building a new one.

TODO in another PR

  • Support column-wise unnest
  • Able to preserve nulls
  • Select partial columns
  • Partial Unnest
  • Subquery

@jayzhan211 jayzhan211 marked this pull request as ready for review July 28, 2023 07:09
@jayzhan211 jayzhan211 marked this pull request as draft July 28, 2023 07:10
@jayzhan211 jayzhan211 marked this pull request as ready for review July 28, 2023 07:24
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- this is getting there but I don't think it is ready to merge yet (for reasons I left in comments below)

My major concerns are that we aren't using the LogicalPlan::Unnest and UnnestExec physical operator -- doing so would mean that improvements we plan (like making UnnestExec more efficient) will not apply to using unnest as a function

As this PR is large and I would like to move it along, I suggest

  1. Remove flatten (we can put that into another PR)
  2. Make a PR with just the Expr::Unnest -- maybe this could simply create a LogicalPlan::Unnest?
  3. Make a PR to add any additional features needed for unnest (like multiple column support, for example)

What do you think?

cc @izveigor @vincev and @smiklos

# TODO: Unnest columns fails
query error DataFusion error: SQL error: TokenizerError\("Unterminated string literal at Line: 2, Column 95"\)
caused by
Internal error: UNNEST only supports list type\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears still to be NotImplemented

datafusion/expr/src/expr.rs Outdated Show resolved Hide resolved
datafusion/expr/src/expr.rs Show resolved Hide resolved
args: flatten_args,
}))
}
_ => Err(DataFusionError::Internal(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => Err(DataFusionError::Internal(format!(
_ => Err(DataFusionError::NotYetImplemented(format!(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/expr/src/expr_schema.rs Outdated Show resolved Hide resolved
@@ -1038,6 +1042,114 @@ impl LogicalPlanBuilder {
pub fn unnest_column(self, column: impl Into<Column>) -> Result<Self> {
Ok(Self::from(unnest(self.plan, column.into())?))
}

pub fn join_unnest_plans(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use https://docs.rs/datafusion/latest/datafusion/physical_plan/unnest/struct.UnnestExec.html directly rather than building up a special LogicalPlan? It seems like UnnestExec already does this

datafusion/expr/src/logical_plan/builder.rs Show resolved Hide resolved
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Aug 8, 2023

As this PR is large and I would like to move it along, I suggest

  1. Remove flatten (we can put that into another PR)
  2. Make a PR with just the Expr::Unnest -- maybe this could simply create a LogicalPlan::Unnest?
  3. Make a PR to add any additional features needed for unnest (like multiple column support, for example)

Let me work on #6995 first while thinking about step 2, since I don't think simply creating LogicalPlan from Expr::Unnest is trivial.

@alamb alamb marked this pull request as draft August 16, 2023 18:52
@alamb
Copy link
Contributor

alamb commented Aug 16, 2023

Marking as draft to signify it is not waiting on review

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Aug 30, 2023

  1. Make a PR with just the Expr::Unnest -- maybe this could simply create a LogicalPlan::Unnest?

To create LogicalPlan::Unnest, it seems we can only do that in the optimization step. But this can't avoid schema change, due to name unnest([1,2,3]) -> [1,2,3] and type list(i64) -> i64.

Convert expr to logical plan in select_to_plan works, cool.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Sep 12, 2023

The Unnest function is now processing in the planning step, converting Expr::Unnest to LogicalPlan::Unnest in this phase. Nothing to do in the optimize phase and only the existing UnnestExec is processed in the PhysicalPlan phase. Not sure if this is a good design but it works :)

Maybe we can review the overall design, and then I will address the comments and resolve conflicts.

@jayzhan211 jayzhan211 marked this pull request as ready for review September 12, 2023 12:52
@alamb
Copy link
Contributor

alamb commented Sep 12, 2023

I will try and review this more carefully later today and provide feedback. Thank you @jayzhan211

.collect::<Result<Vec<_>>>()?;

if data_types.is_empty() {
return Err(DataFusionError::NotImplemented(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use macro

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@alamb
Copy link
Contributor

alamb commented Oct 30, 2023

Marking this one as draft as it has bitrotted for a while and I think we need to work on simplifing the array code we have before we add more

@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

I think DataFusion now supports UNNEST -- is this PR still being worked on?

@jayzhan211
Copy link
Contributor Author

I think DataFusion now supports UNNEST -- is this PR still being worked on?

We support single columns only, the challenging one multiple columns has not yet been supported, I remember @jonahgao is working on it. I forgot which PR you mentioned.

@jonahgao
Copy link
Member

jonahgao commented Apr 9, 2024

I think DataFusion now supports UNNEST -- is this PR still being worked on?

We support single columns only, the challenging one multiple columns has not yet been supported, I remember @jonahgao is working on it. I forgot which PR you mentioned.

Not started yet. I just discovered that it hasn't been implemented in #9592 (comment), but I can start working on it from now on.

@jonahgao
Copy link
Member

jonahgao commented Apr 9, 2024

I think I might have misunderstood. Neither of the two queries below has been implemented, the one discussed in this PR seems to be the latter.

DataFusion CLI v37.0.0
❯ select unnest(ARRAY[1,2]), unnest(ARRAY[3,4,5]);
This feature is not implemented: Only support single unnest expression for now

❯ select * from unnest(ARRAY[1,2], ARRAY[2,3]);
This feature is not implemented: unnest() does not support multiple arguments yet

I plan to try to implement them both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants