Skip to content

Create UNION plan node with correct schema #14380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
findepi opened this issue Jan 31, 2025 · 18 comments
Open

Create UNION plan node with correct schema #14380

findepi opened this issue Jan 31, 2025 · 18 comments
Labels
enhancement New feature or request

Comments

@findepi
Copy link
Member

findepi commented Jan 31, 2025

Is your feature request related to a problem or challenge?

When building a query plan union() behavior doesn't type-coerce the expressions and it takes the left node's schema.
This creates an invalid plan which later needs to be fixed. It would be better to create a valid plan from the start.

Note: #14356 moves the code around, but doesn't change the behavior. See @alamb's #14356 (comment)

Describe the solution you'd like

Create correctly coerced plan if possible

Describe alternatives you've considered

Keep creating incorrect plan. Rely on some later logic to fix it some time later.

Additional context

@rkrishn7
Copy link
Contributor

rkrishn7 commented Feb 6, 2025

Hello @findepi!

I'd be happy to work on this. I believe it's as simple as shifting union schema coercion from the analyzer to logical plan building.

However, since wildcard expansion happens later on, we wouldn't want to attempt coercing the plan inputs. Which makes me wonder - is there benefit to producing a plan that has a correct schema but may have non-coerced input plans? 🤔

@findepi
Copy link
Member Author

findepi commented Feb 6, 2025

You have good point about the wildcards

is there benefit to producing a plan that has a correct schema but may have non-coerced input plans?

i don't think there is
you are right it would require expanding the wildcards. do you think it's not possible?

@jonahgao
Copy link
Member

jonahgao commented Feb 7, 2025

We need to rely on the TypeCoercion analyzer rule to obtain the correct schema of union inputs. See #11961 for some details.

We might need to perform union coercion twice: once in the builder, and once in TypeCoercion(after coercion of exprs).

@jonahgao
Copy link
Member

jonahgao commented Feb 7, 2025

I also think it's necessary to move the ExpandWildcardRule to the builder. This helps to solve the current problem. We won't need to perform wildcard expansion when computing schemas, which is a duplicated operation. #14118 also requires it.

Perhaps @goldmedal have some thoughts on it.

@findepi
Copy link
Member Author

findepi commented Feb 7, 2025

I also think it's necessary to move the ExpandWildcardRule to the builder.

That would be great!

That could potentially unlock removing Expr::Wildcard. It's not really an expression (just like Expr::Alias isn't #1468 and Sort weren't #12193)

@rkrishn7
Copy link
Contributor

rkrishn7 commented Feb 7, 2025

+1 on moving ExpandWildcardRule!

We might need to perform union coercion twice: once in the builder, and once in TypeCoercion(after coercion of exprs).

Hmm, my immediate thought here is that if we cannot guarantee correctly coerced inputs during the first pass, then it's probably not worth attempting to coerce in the builder at all. But, I could be missing something!

@goldmedal
Copy link
Contributor

goldmedal commented Feb 8, 2025

Perhaps @goldmedal have some thoughts on it.

The original discussion about wildcard expansion is #11639 (comment). It's used to delay the wildcard expansion to decrease the cost when facing huge column numbers (#11639 (comment)).

We won't need to perform wildcard expansion when computing schemas, which is a duplicated operation.

Indeed. I have no idea how to avoid this 🤔. Maybe moving wildcard expansion back to the builder is the only choice (?

That could potentially unlock removing Expr::Wildcard. It's not really an expression (just like Expr::Alias isn't #1468 and Sort weren't #12193)

I'm not pretty sure if Expr::Wildcard is meaningful in the logical plan phase but our project (Wren AI) will generate a different plan for the table according to Wildcard usage. However, if removing Wildcard from the logical plan phase is beneficial for the logical planner, it's ok for me to remove it. We can find another approach to do the same thing.

Or maybe we can allow the customized behavior for the wildcard planning, just like the original reason of #11639

@jonahgao
Copy link
Member

jonahgao commented Feb 8, 2025

The original discussion about wildcard expansion is #11639 (comment). It's used to delay the wildcard expansion to decrease the cost when facing huge column numbers (#11639 (comment)).

Since exprlist_to_fields is called in the builder, it seems that wildcard expansion still hasn't been delayed.

Or maybe we can allow the customized behavior for the wildcard planning, just like the original reason of #11639

Using ExprPlanner makes sense to me.

@jonahgao
Copy link
Member

jonahgao commented Feb 8, 2025

Hmm, my immediate thought here is that if we cannot guarantee correctly coerced inputs during the first pass, then it's probably not worth attempting to coerce in the builder at all. But, I could be missing something!

I think most cases can be correctly coerced during the first pass; it is a best-effort behavior.

@goldmedal
Copy link
Contributor

Since exprlist_to_fields is called in the builder, it seems that wildcard expansion still hasn't been delayed.

I see. I think we can revert the change #11681 first. Also cc @jayzhan211

Or maybe we can allow the customized behavior for the wildcard planning, just like the original reason of #11639

Using ExprPlanner makes sense to me.

👍

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 10, 2025

Hmm, my immediate thought here is that if we cannot guarantee correctly coerced inputs during the first pass, then it's probably not worth attempting to coerce in the builder at all. But, I could be missing something!

I think most cases can be correctly coerced during the first pass; it is a best-effort behavior.

We might need to perform union coercion twice: once in the builder, and once in TypeCoercion(after coercion of exprs).

Doing coercion twice doesn't make sense to me.

Should we move TypeCoercion into builder 🤔 ? If not, I don't think having coercion in builder is what we need

@jayzhan211
Copy link
Contributor

That could potentially unlock removing Expr::Wildcard. It's not really an expression (just like Expr::Alias isn't #1468 and Sort weren't #12193)

With Expr::Wildcard, it represents all the columns which might be helpful for wide columns 1000+

@findepi
Copy link
Member Author

findepi commented Feb 10, 2025

I appreciate the conciseness of representing all columns of a table with a wildcard.
This doesn't change the fact that Expr::Wildcard is not an expression.
If there are any doubts about that, then #7765 is the right place for the discussion.

Of course, there is * and t.* syntax and it needs to remain supported.

@jonahgao
Copy link
Member

jonahgao commented Feb 10, 2025

Should we move TypeCoercion into builder 🤔 ?

I also thought about that, but some users do not want the casts introduced by TypeCoercion. See #12105 (comment)
And it is a major breaking change.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 11, 2025

but some users do not want the casts introduced by TypeCoercion

We can make it optional or even customizable

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 11, 2025

Since exprlist_to_fields is called in the builder, it seems that wildcard expansion still hasn't been delayed.

Computing a schema for a wildcard differs from expanding the actual expressions that match the wildcard

@findepi
Copy link
Member Author

findepi commented Feb 11, 2025

but some users do not want the casts introduced by TypeCoercion

This is very broad statement.
I don't like the DataFusion coercion logic per se because it has bugs (eg #14273), but seems infeasible to implement a query execution engine with either (a) some coercions or (b) requirement that all arguments are pre-coerced (no-op coercion logic). TypeCoercion does the (a), a new IR (#12723) would do the (b). But we cannot say "some user's don't want (a) and don't want/have (b)". Saying so turns DataFusion into syntactical bag of functionalities without a governing design or contract. This just doesn't allow future development.

@jonahgao
Copy link
Member

Computing a schema for a wildcard differs from expanding the actual expressions that match the wildcard

I think their logic is basically the same, otherwise there would be issues similar to #14102.

Another potential issue is that in exprlist_to_fields, we don't handle replace items, but ExpandWildcardRule does.
select * replace ('foo' as a) from t will give the wrong datatype in schema before executing analyzer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants