Perf: Support automatically concat_batches for sort which will improve performance #15375

zhuqi-lucas · 2025-03-24T07:56:00Z

Is your feature request related to a problem or challenge?

We should investigate and improve the sort code to support concat_batches for more cases besides the following case:

       // If less than sort_in_place_threshold_bytes, concatenate and sort in place
        if self.reservation.size() < self.sort_in_place_threshold_bytes {
            // Concatenate memory batches together and sort
            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
            self.in_mem_batches.clear();
            self.reservation
                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
            let reservation = self.reservation.take();
            return self.sort_batch_stream(batch, metrics, reservation);
        }

See details about the performance improvement:

#15348 (comment)

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

zhuqi-lucas · 2025-03-24T08:03:21Z

take

zhuqi-lucas · 2025-03-24T09:43:51Z

cc @alamb
I did some POC testing, the performance improvement is very promising, even 30% improvement for some queries:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2241.04ms │               1816.69ms │ +1.23x faster │
│ Q2           │  1841.01ms │               1496.73ms │ +1.23x faster │
│ Q3           │ 12755.85ms │              12770.18ms │     no change │
│ Q4           │  4433.49ms │               3278.70ms │ +1.35x faster │
│ Q5           │  4414.15ms │               4409.04ms │     no change │
│ Q6           │  4543.09ms │               4597.32ms │     no change │
│ Q7           │  8012.85ms │               9026.30ms │  1.13x slower │
│ Q8           │  6572.37ms │               6049.51ms │ +1.09x faster │
│ Q9           │  6734.63ms │               6345.69ms │ +1.06x faster │
│ Q10          │  9896.16ms │               9564.17ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 61444.64ms │
│ Total Time (concat_batches_for_sort)   │ 59354.33ms │
│ Average Time (main)                    │  6144.46ms │
│ Average Time (concat_batches_for_sort) │  5935.43ms │
│ Queries Faster                         │          5 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          4 │
└────────────────────────────────────────┴────────────┘

2010YOUY01 · 2025-03-24T10:36:20Z

This is a great observation, and the POC optimization has a high ROI. Here are some additional thoughts:

This is just intuition, but I think the sorting phase should be faster than the sort-preserving merge phase. Because the sorting implementation is much simpler, it can rely on the existing optimized arrow row format and also quick sort implementation from the standard library. On the contrary, merging phase we have an in-house implementation of a loser tree heap, I think it's a bit complex so maybe it is also hard to optimize manually.
As a result, perhaps it's preferred to put more work to be done during sorting, and leave minimal work to be done during merging (however, merging is still needed inside partial sort, because it can keep all the cores busy when partial sort is all finished)

Example

There is a sort query to run in 4 partitions, and each partition will process 100 input batches

Current implementation:
- Inside partial sort, each batch will be sorted, and a 100-way merge will be contructed.
- In the final stage, a 4-way merge will produce the final result.
Putting most work in sort strategy: (we can control the partial sort and final sort to have the same merge degree, so they can process data at around the same speed)
- Inside partial sort, sort 25 batches into a single sorted run at once, and after that do a 4-way merge as the output of partial sort
- In the final stage, there is the same 4-way merge to produce the final result.

Implementation

The POC will copy the batches with cocnat() than do a bigger sort, an alternative to try to avoid copies is: first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), and get a permutation array.
Use the interleave kernel to construct the final result https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/fn.interleave.html

zhuqi-lucas · 2025-03-25T02:05:12Z

Thank you @2010YOUY01 for review and good suggestion, i will improve my POC code and add more testing.

Dandandan · 2025-04-12T08:03:33Z

Really nice observation! I think we should drive this further.

Some further observations I saw when looking at the current implementation on master for the in memory merging part:

the in memory batches are not cleared after batches are sorted
it is needlessly using spawn_buffered in the code for streaming merge () sorting can be done on the spot here and batches dropped as soon as possible.

zhuqi-lucas · 2025-04-12T15:05:48Z

Thank you @Dandandan , addressed your comments. And we can make it as the first version. And in future we may can improve it as described by @2010YOUY01 :
#15375 (comment)

And i think current implementation is also reasonable because the sort_in_place_threshold_bytes is a already used config, we can first reuse it to concat batch and it's safe.

Dandandan · 2025-04-12T19:12:06Z

@2010YOUY01 that sound like a very promising future direction. I might try something experimenting on this soon if none beats me to it.

zhuqi-lucas added the enhancement New feature or request label Mar 24, 2025

zhuqi-lucas mentioned this issue Mar 24, 2025

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Merged

github-actions bot assigned zhuqi-lucas Mar 24, 2025

zhuqi-lucas linked a pull request Mar 24, 2025 that will close this issue

Perf: Optimize in memory sort #15380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Support automatically concat_batches for sort which will improve performance #15375

Perf: Support automatically concat_batches for sort which will improve performance #15375

zhuqi-lucas commented Mar 24, 2025

zhuqi-lucas commented Mar 24, 2025

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

2010YOUY01 commented Mar 24, 2025

zhuqi-lucas commented Mar 25, 2025

Dandandan commented Apr 12, 2025

zhuqi-lucas commented Apr 12, 2025

Dandandan commented Apr 12, 2025

Perf: Support automatically concat_batches for sort which will improve performance #15375

Perf: Support automatically concat_batches for sort which will improve performance #15375

Comments

zhuqi-lucas commented Mar 24, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

zhuqi-lucas commented Mar 24, 2025

zhuqi-lucas commented Mar 24, 2025 • edited Loading

2010YOUY01 commented Mar 24, 2025

Example

Implementation

zhuqi-lucas commented Mar 25, 2025

Dandandan commented Apr 12, 2025

zhuqi-lucas commented Apr 12, 2025

Dandandan commented Apr 12, 2025

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading