Skip to content

Convert indexed ARRAY_AGG to NTH_VALUE #9213

Open
@mustafasrepo

Description

@mustafasrepo

Is your feature request related to a problem or challenge?

Query below

SELECT a, ARRAY_AGG(c ORDER BY c)[1] as result
                        FROM multiple_ordered_table
                        GROUP BY a;

and

SELECT a, NTH_VALUE(c, 1 ORDER BY c) as result
                        FROM multiple_ordered_table
                        GROUP BY a;

produces same results. However, first query generates following plan

"ProjectionExec: expr=[a@0 as a, (ARRAY_AGG(multiple_ordered_table.c) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]@1).[1] as result]",
"  AggregateExec: mode=Single, gby=[a@0 as a], aggr=[ARRAY_AGG(multiple_ordered_table.c)], ordering_mode=Sorted",
"    CsvExec: file_groups={1 group: [[CSV_PATH]]}, projection=[a, c], output_orderings=[[a@0 ASC NULLS LAST], [c@1 ASC NULLS LAST]], has_header=true",

whereas second query generates following plan

"ProjectionExec: expr=[a@0 as a, NTH_VALUE(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]@1 as result]",
"  AggregateExec: mode=Single, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1))], ordering_mode=Sorted",
"    CsvExec: file_groups={1 group: [[CSV_PATH]]}, projection=[a, c], output_orderings=[[a@0 ASC NULLS LAST], [c@1 ASC NULLS LAST]], has_header=true",

Describe the solution you'd like

we can rewrite first query as second one, which executes faster with less memory. Because it no longer needs to keep all results in the array_agg.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions