Skip to content

Improved error for expand wildcard rule #15287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Jiashu-Hu
Copy link
Contributor

@Jiashu-Hu Jiashu-Hu commented Mar 18, 2025

Which issue does this PR close?

Rationale for this change

The current error messages for non-aggregate columns in GROUP BY queries are technically accurate but not user-friendly. They use technical terminology like "non-aggregate values" without clearly explaining the problem or providing actionable solutions. This PR improves these error messages to be more intuitive and helpful, especially for users who might not be familiar with SQL aggregation concepts.

What changes are included in this PR?

added two function in to expand_wildcard_rule.rs:

  1. fn validate_columns_in_group_by_or_aggregate(has_group_by: bool, expanded: &Vec<Expr>, group_by_columns: &Vec<Expr>, aggregate_columns: &Vec<Expr>) -> Result<(), DataFusionError>
  • Verifies all selected columns appear either in the GROUP BY clause or within aggregate functions
  • Returns a clear SchemaError::GroupByColumnNotInSelectList error when validation fails
  1. added a new SchemaError type to shows the error message - datafusion_common::SchemaError::GroupByColumnNotInSelectList

In datafusion/sql/src/utils.rs:

  • Updated message_prefix() method to provide clearer error message prefixes
  • Improved help text in check_column_satisfies_expr() to suggest specific solutions
  • for easy understanding as requested in #15004

Are these changes tested?

There are several existing tests that directly evaluate this function during the CICL process, and all of them have been successfully passed.

Are there any user-facing changes?

Yes. Users will now see more intuitive error messages when they encounter GROUP BY validation issues. For example, instead of seeing:

Error during planning: Projection references non-aggregate values: Expression foo.b could not be resolved from available columns: foo.a

They will see:

SchemaError: Column 'foo.b' must appear in the GROUP BY clause or be used in an aggregate function

This change makes errors more understandable and provides clearer guidance on how to fix the issue.

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Mar 18, 2025
@@ -826,6 +827,13 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
.map(|expr| rebase_expr(expr, &aggr_projection_exprs, input))
.collect::<Result<Vec<Expr>>>()?;

// check if the columns in the SELECT list are in the GROUP BY clause
// or are part of an aggregate function, if not, throw an error.
validate_columns_in_group_by_or_aggregate(
Copy link
Contributor

@jayzhan211 jayzhan211 Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the error should be handled in the check_columns_satisfy_exprs below with ProjectionMustReferenceAggregate, is it possible to unify these 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey jay, thanks for your review! I've unified these 2 also updated error message in the test file.

@github-actions github-actions bot removed the common Related to common crate label Mar 18, 2025
@alamb alamb changed the title Improvement/improve wildcard error 15004 Improved error for expand wildcard rule Mar 21, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Jiashu-Hu

I also verified that the example in the original reproducer now is much better 😍

DataFusion CLI v46.0.1
> create table foo(a int, b int, c timestamp) as values (1,2,'2025-01-01T12:01:02');

SELECT *
FROM foo
WHERE c >= NOW() - INTERVAL '1 hour'
GROUP BY a;
0 row(s) fetched.
Elapsed 0.037 seconds.

Error during planning: Column in SELECT must be in GROUP BY or an aggregate function: While expanding wildcard, column "foo.b" must appear in the GROUP BY clause or must be part of an aggregate function, currently only "foo.a" appears in the SELECT clause satisfies this requirement

cc @mhilton

@jayzhan211 jayzhan211 merged commit c2c43ec into apache:main Mar 22, 2025
28 checks passed
@jayzhan211
Copy link
Contributor

Thanks @alamb @Jiashu-Hu

@ozankabak
Copy link
Contributor

ozankabak commented Mar 22, 2025

This PR seems to have broken main. It is an easy fix (error messages do not match).

@alamb -- Extended tests break very frequently these days. We should prioritize completing the work on running them before merge.

@alamb
Copy link
Contributor

alamb commented Mar 23, 2025

This PR seems to have broken main. It is an easy fix (error messages do not match).

@alamb -- Extended tests break very frequently these days. We should prioritize completing the work on running them before merge.

@ozankabak I think there is a PR waiting on review that would let us more easily run the tests before merge. I am not sure how we would have known to run them in this case though

I feel like what would be idea is some workflow that actually ran the tests before merge but that didn't need us to babysit the PR -- something like letting a bot merge it or a merge queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved error for expand wildcard rule
4 participants