Skip to content

perf(array-agg): add fast path for array agg for merge_batch #14299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Jan 25, 2025

Which issue does this PR close?

N/A

Rationale for this change

We want faster DataFusion

What changes are included in this PR?

Added fast path for array_agg accumulator for merge_batch that return the underlying values of the list in case there is no null that point to the non empty list starting from the first valid index and until last valid index

Are these changes tested?

Existing tests

Are there any user-facing changes?

Nope


This depend on:

as we want to see how this change affect the performance.

Running locally giving promising results (almost 2 times faster for the optimized cases and no regression for not optimized):

Local benchmark result
     Running benches/array_agg.rs
array_agg i64 merge_batch no nulls
                        time:   [56.934 ns 57.381 ns 57.866 ns]
                        change: [-99.986% -99.985% -99.984%] (p = 0.00 < 0.05)
                        Performance has improved.

array_agg i64 merge_batch all nulls, 100% of nulls point to a zero length array
                        time:   [27.782 ns 27.815 ns 27.848 ns]
                        change: [-99.494% -99.492% -99.491%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

array_agg i64 merge_batch all nulls, 90% of nulls point to a zero length array
                        time:   [27.915 ns 27.962 ns 28.053 ns]
                        change: [-99.490% -99.489% -99.487%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe

array_agg i64 merge_batch 30% nulls, 100% of nulls point to a zero length array
                        time:   [3.0493 µs 3.0511 µs 3.0529 µs]
                        change: [-98.935% -98.916% -98.904%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

array_agg i64 merge_batch 70% nulls, 100% of nulls point to a zero length array
                        time:   [3.0350 µs 3.0403 µs 3.0478 µs]
                        change: [-97.645% -97.633% -97.618%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

array_agg i64 merge_batch 30% nulls, 99% of nulls point to a zero length array
                        time:   [276.64 µs 277.36 µs 278.18 µs]
                        change: [+0.1651% +0.5937% +1.0052%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

array_agg i64 merge_batch 70% nulls, 99% of nulls point to a zero length array
                        time:   [128.58 µs 129.70 µs 132.16 µs]
                        change: [-0.4591% +0.2833% +1.6758%] (p = 0.72 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

array_agg i64 merge_batch 30% nulls, 90% of nulls point to a zero length array
                        time:   [280.05 µs 280.64 µs 281.21 µs]
                        change: [-2.7929% -1.0385% +0.2092%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

array_agg i64 merge_batch 70% nulls, 90% of nulls point to a zero length array
                        time:   [129.17 µs 129.35 µs 129.56 µs]
                        change: [+0.1147% +0.4826% +0.8998%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

array_agg i64 merge_batch 30% nulls, 50% of nulls point to a zero length array
                        time:   [281.86 µs 282.80 µs 284.59 µs]
                        change: [-0.2974% +0.0394% +0.4039%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

array_agg i64 merge_batch 70% nulls, 50% of nulls point to a zero length array
                        time:   [128.73 µs 128.85 µs 129.02 µs]
                        change: [-0.4212% -0.1809% +0.0508%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

array_agg i64 merge_batch 30% nulls, 0% of nulls point to a zero length array
                        time:   [281.88 µs 282.14 µs 282.41 µs]
                        change: [+0.1493% +0.6065% +1.0981%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  12 (12.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

array_agg i64 merge_batch 70% nulls, 0% of nulls point to a zero length array
                        time:   [129.18 µs 129.30 µs 129.43 µs]
                        change: [-0.3113% -0.0708% +0.2198%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

@alamb
Copy link
Contributor

alamb commented Jan 26, 2025

I ran the newly added array_agg benchmark and got some pretty serious performance improvements:

group                                                                              improve-performance-for-array-agg-merge-batch    main
-----                                                                              ---------------------------------------------    ----
array_agg i64 merge_batch all nulls, 100% of nulls point to a zero length array    1.00     87.3±0.09ns        ? ?/sec              225.81    19.7±0.02µs        ? ?/sec
array_agg i64 merge_batch all nulls, 90% of nulls point to a zero length array     1.00     86.3±0.07ns        ? ?/sec              228.62    19.7±0.10µs        ? ?/sec

I am not sure why all the runs aren't listed. I am looking into that

@alamb
Copy link
Contributor

alamb commented Jan 26, 2025

Here is my entire benchmark run

++ critcmp main improve-performance-for-array-agg-merge-batch
                                                                      improve-performance-for-array-agg-merge-batch    main
-----                                                                              ---------------------------------------------    ----
array_agg i64 merge_batch 30% nulls, 0% of nulls point to a zero length array      1.00    565.6±1.11µs        ? ?/sec              1.00    565.4±1.29µs        ? ?/sec
array_agg i64 merge_batch 30% nulls, 100% of nulls point to a zero length array    1.00      7.7±0.01µs        ? ?/sec              73.20   562.7±1.07µs        ? ?/sec
array_agg i64 merge_batch 30% nulls, 50% of nulls point to a zero length array     1.02    579.6±3.49µs        ? ?/sec              1.00   568.4±28.98µs        ? ?/sec
array_agg i64 merge_batch 30% nulls, 90% of nulls point to a zero length array     1.00    565.2±0.70µs        ? ?/sec              1.00    563.3±0.61µs        ? ?/sec
array_agg i64 merge_batch 30% nulls, 99% of nulls point to a zero length array     1.01    566.8±0.52µs        ? ?/sec              1.00    562.4±1.00µs        ? ?/sec
array_agg i64 merge_batch 70% nulls, 0% of nulls point to a zero length array      1.01   262.1±17.69µs        ? ?/sec              1.00   259.3±15.89µs        ? ?/sec
array_agg i64 merge_batch 70% nulls, 100% of nulls point to a zero length array    1.00      7.5±0.01µs        ? ?/sec              34.02   256.2±1.94µs        ? ?/sec
array_agg i64 merge_batch 70% nulls, 50% of nulls point to a zero length array     1.06    271.8±1.50µs        ? ?/sec              1.00    256.5±0.53µs        ? ?/sec
array_agg i64 merge_batch 70% nulls, 90% of nulls point to a zero length array     1.02   261.5±17.90µs        ? ?/sec              1.00    256.0±0.54µs        ? ?/sec
array_agg i64 merge_batch 70% nulls, 99% of nulls point to a zero length array     1.00    259.0±0.30µs        ? ?/sec              1.00   259.6±16.06µs        ? ?/sec
array_agg i64 merge_batch all nulls, 100% of nulls point to a zero length array    1.00     85.5±0.08ns        ? ?/sec              230.91    19.7±0.01µs        ? ?/sec
array_agg i64 merge_batch all nulls, 90% of nulls point to a zero length array     1.00     85.8±0.27ns        ? ?/sec              229.79    19.7±0.02µs        ? ?/sec
array_agg i64 merge_batch no nulls                                                 1.00    103.4±0.21ns        ? ?/sec              6215.33   642.8±1.08µs        ? ?/sec

TLDR is that this PR appears to have a very significant performance improvement. 🚀

If no one beats me to it I will give it a good look in the upcoming week

@alamb
Copy link
Contributor

alamb commented Jan 26, 2025

Nice work @rluvaton

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I reviewed the code and this makes sense to me and the benchmark numbers are 👌 very nice

BTW if we want to make AGG_ARRAY really fast we probably need to implement a GroupsAccumlator for it -- @Rachelint has an example of doing this in #13681

@rluvaton
Copy link
Contributor Author

Already has a local implementation of array aggregation distinct with groups, that will try to upstream soon

// If no nulls than just use the fast path
// This is ok as the state is a ListArray rather than a ListViewArray so all the values are consecutive
if null_count == 0 {
// According to Arrow specification, the first offset can be non-zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while writing this code I first found out that offsets does not have to start from 0

@alamb alamb merged commit 62000b4 into apache:main Jan 29, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

Thanks again @rluvaton

@rluvaton rluvaton deleted the improve-performance-for-array-agg-merge-batch branch January 29, 2025 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants