Skip to content

Scalars are too verbose in column name output #15395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
blaginin opened this issue Mar 24, 2025 · 9 comments
Open

Scalars are too verbose in column name output #15395

blaginin opened this issue Mar 24, 2025 · 9 comments
Labels
enhancement New feature or request

Comments

@blaginin
Copy link
Contributor

blaginin commented Mar 24, 2025

Is your feature request related to a problem or challenge?

When parsing scalar expressions, DataFusion makes queries quite complicated:

>  select 3, array_length([1, 2, 4, 5, 10000, 2.4]);;
+----------+-----------------------------------------------------------------------------------------+
| Int64(3) | array_length(make_array(Int64(1),Int64(2),Int64(4),Int64(5),Int64(10000),Float64(2.4))) |
+----------+-----------------------------------------------------------------------------------------+
| 3        | 6                                                                                       |
+----------+-----------------------------------------------------------------------------------------+

Such detailed comparison isn't needed in most cases and adds unnecessary cognitive complexity for developers. Moreover, it also breaks the original query - if you put select Int64(3), you'll get Error during planning: Invalid function 'int64'

Describe the solution you'd like

In most cases, the simplest version will be enough. On the example above it is:

> select 3, array_length([1, 2, 4, 5, 10000, 2.4]);
+---+---------------------------------------------+
| 3 | array_length(make_array(1,2,4,5,10000,2.4)) |
+---+---------------------------------------------+
| 3 | 6                                           |
+---+---------------------------------------------+

IMO there are three ways how to fix this problem:

  1. Just remove :? here for all cases
    Expr::Literal(v) => write!(f, "{v:?}"),

    This will make queries simpler everywhere. The downside is that we lose some information, e.g., type info in the shell (which one can argue isn't needed and can be checked separately)
  2. Disable verbose as a ConfigOptions param and make it changeable (via datafusion-cli or sdk).
  3. Use short names only if parsing back preserves the correct type. For example, when formatting Int64(0), output 0 (since parsing 0 results in Int64(0)). When formatting Int32(0), keep Int32(0).

Additional context

May be quite easy to update after we finish #15178

@blaginin blaginin added the enhancement New feature or request label Mar 24, 2025
@blaginin
Copy link
Contributor Author

Created an issue to ask what others think - the change is quite simple, but maybe there were previous discussions or consequences I don't see?

@Omega359
Copy link
Contributor

for context duckdb:

D select 3, array_length([1, 2, 4, 5, 10000, 2.4]);
┌───────┬───────────────────────────────────────────────────────┐
│   3   │ array_length(main.list_value(1, 2, 4, 5, 10000, 2.4)) │
│ int32 │                         int64                         │
├───────┼───────────────────────────────────────────────────────┤
│   3   │                           6                           │
└───────┴───────────────────────────────────────────────────────┘

@blaginin
Copy link
Contributor Author

Thank you Bruce!!! Btw adding column type (second line of the first row) can actually be a good thing to implement separately

@jayzhan211
Copy link
Contributor

If we have column type, we don't need to display type for inner elements. Maybe we can work on column type first?

@alamb alamb changed the title Scalars are too verbose Scalars are too verbose in column name output Mar 25, 2025
@alamb
Copy link
Contributor

alamb commented Mar 25, 2025

Here is a related ticket:

Created an issue to ask what others think - the change is quite simple, but maybe there were previous discussions or consequences I don't see?

I think @irenjj added similar code recently to support new duckdb style tree explain plans via Expr::sql_name:

However, it seems like the display is still somewhat 🤮 (I think b/c the column name is bad)

> explain format tree select 3, array_length([1, 2, 4, 5, 10000, 2.4]);;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │        Int64(3): 3        │ |
|               | │                           │ |
|               | │  array_length(make_array  │ |
|               | │     (Int64(1),Int64(2)    │ |
|               | │     ,Int64(4),Int64(5)    │ |
|               | │   ,Int64(10000),Float64   │ |
|               | │          (2.4))):         │ |
|               | │             6             │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │     PlaceholderRowExec    │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.

Perhaps we should switch to tree by default as well as use sql_name for the column name 🤔

@irenjj
Copy link
Contributor

irenjj commented Mar 26, 2025

Here is a related ticket:

Created an issue to ask what others think - the change is quite simple, but maybe there were previous discussions or consequences I don't see?

I think @irenjj added similar code recently to support new duckdb style tree explain plans via Expr::sql_name:

However, it seems like the display is still somewhat 🤮 (I think b/c the column name is bad)

explain format tree select 3, array_length([1, 2, 4, 5, 10000, 2.4]);;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ ProjectionExec │ |
| | │ -------------------- │ |
| | │ Int64(3): 3 │ |
| | │ │ |
| | │ array_length(make_array │ |
| | │ (Int64(1),Int64(2) │ |
| | │ ,Int64(4),Int64(5) │ |
| | │ ,Int64(10000),Float64 │ |
| | │ (2.4))): │ |
| | │ 6 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ PlaceholderRowExec │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.
Perhaps we should switch to tree by default as well as use sql_name for the column name 🤔

sql_name only works in aggr func, need more effort for projection, window, ...

@blaginin
Copy link
Contributor Author

If we have column type, we don't need to display type for inner elements. Maybe we can work on column type first?

Thank you!!! That's fair, I've created a separate ticket: #15442

@blaginin
Copy link
Contributor Author

I think @irenjj added similar code recently to support new duckdb style tree explain plans via Expr::sql_name

We can potentially use human_display instead of schema_name when creating schemas. Though tbf I really like the idea of always rendering data in a more compact way (the first option from my list) - it should be even easier to implement 🤗

@alamb
Copy link
Contributor

alamb commented Mar 26, 2025

always rendering data in a more compact way (the first option from my list) -

I think it is a better choice too

The challenge is that it will change the schema of the output (the output column name will change) as will the internal column names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants