Skip to content

Perf: Optimize in memory sort #15380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Mar 24, 2025

Which issue does this PR close?

Rationale for this change

Perf: Support automatically concat_batches for sort which will improve performance

And it's mergable for the first version, later we can improve it according to comments:

#15375 (comment)

What changes are included in this PR?

Perf: Support automatically concat_batches for sort which will improve performance

Are these changes tested?

Yes

Are there any user-facing changes?

No

@zhuqi-lucas zhuqi-lucas marked this pull request as draft March 24, 2025 09:42
let mut current_batches = Vec::new();
let mut current_size = 0;

for batch in std::mem::take(&mut self.in_mem_batches) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to use pop (while let Some(batch) = v.pop) here to remove the batch from the vec once sorted to reduce memory usage. Now the batch is AFAIK retained until after the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to use pop (while let Some(batch) = v.pop) here to remove the batch from the vec once sorted to reduce memory usage. Now the batch is AFAIK retained until after the loop.

Thank you @Dandandan for review and good suggestion, addressed your suggestion!

@Dandandan
Copy link
Contributor

I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas

@zhuqi-lucas
Copy link
Contributor Author

I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas

Thank you @Dandandan for review, i think we just need to add the benchmark result for this PR for next step.

And it's mergable for the first version, later we can improve it according to comments:

#15375 (comment)

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review April 12, 2025 14:41
@zhuqi-lucas zhuqi-lucas changed the title PoC (Perf: Support automatically concat_batches for sort which will improve performance) Perf: Support automatically concat_batches for sort which will improve performance Apr 12, 2025
@zhuqi-lucas
Copy link
Contributor Author

@alamb Do we have the CI benchmark running now? If no, i need your help to run... Thanks a lot!

And also for the sort-tpch itself, i was running for the improvement result, but not for other benchmark running.

Previous sort-tpch:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q12241.04ms │               1816.69ms │ +1.23x faster │
│ Q21841.01ms │               1496.73ms │ +1.23x faster │
│ Q312755.85ms │              12770.18ms │     no change │
│ Q44433.49ms │               3278.70ms │ +1.35x faster │
│ Q54414.15ms │               4409.04ms │     no change │
│ Q64543.09ms │               4597.32ms │     no change │
│ Q78012.85ms │               9026.30ms │  1.13x slower │
│ Q86572.37ms │               6049.51ms │ +1.09x faster │
│ Q96734.63ms │               6345.69ms │ +1.06x faster │
│ Q109896.16ms │               9564.17ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)61444.64ms │
│ Total Time (concat_batches_for_sort)59354.33ms │
│ Average Time (main)6144.46ms │
│ Average Time (concat_batches_for_sort)5935.43ms │
│ Queries Faster5 │
│ Queries Slower1 │
│ Queries with No Change4 │
└────────────────────────────────────────┴────────────┘

@zhuqi-lucas
Copy link
Contributor Author

Latest result based current latest code:

--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1153.49ms │                137.57ms │ +1.12x faster │
│ Q2131.29ms │                120.93ms │ +1.09x faster │
│ Q3980.57ms │                982.22ms │     no change │
│ Q4252.25ms │                245.09ms │     no change │
│ Q5464.81ms │                449.27ms │     no change │
│ Q6481.44ms │                455.45ms │ +1.06x faster │
│ Q7810.73ms │                709.74ms │ +1.14x faster │
│ Q8498.10ms │                491.12ms │     no change │
│ Q9503.80ms │                510.20ms │     no change │
│ Q10789.02ms │                706.45ms │ +1.12x faster │
│ Q11417.39ms │                411.50ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)5482.89ms │
│ Total Time (concat_batches_for_sort)5219.53ms │
│ Average Time (main)498.44ms │
│ Average Time (concat_batches_for_sort)474.50ms │
│ Queries Faster5 │
│ Queries Slower0 │
│ Queries with No Change6 │
└────────────────────────────────────────┴───────────┘
--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q12243.52ms │               1825.64ms │ +1.23x faster │
│ Q21842.11ms │               1639.00ms │ +1.12x faster │
│ Q312446.31ms │              11981.63ms │     no change │
│ Q44047.55ms │               3715.96ms │ +1.09x faster │
│ Q54364.46ms │               4503.51ms │     no change │
│ Q64561.01ms │               4688.31ms │     no change │
│ Q78158.01ms │               7915.54ms │     no change │
│ Q86077.40ms │               5524.08ms │ +1.10x faster │
│ Q96347.21ms │               5853.44ms │ +1.08x faster │
│ Q1011561.03ms │              14235.69ms │  1.23x slower │
│ Q116069.42ms │               5666.77ms │ +1.07x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)67718.04ms │
│ Total Time (concat_batches_for_sort)67549.58ms │
│ Average Time (main)6156.19ms │
│ Average Time (concat_batches_for_sort)6140.87ms │
│ Queries Faster6 │
│ Queries Slower1 │
│ Queries with No Change4 │
└────────────────────────────────────────┴────────────┘

@Dandandan
Copy link
Contributor

Thanks for sharing the results @zhuqi-lucas this is really interesting!

I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams.
The arrow kernel would avoid the regressions of concat.

@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff
Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

Thanks for sharing the results @zhuqi-lucas this is really interesting!

I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams. The arrow kernel would avoid the regressions of concat.

I think the SortPreservingMergeStream is about as efficient as we know how to make it

Maybe we can look into what overhead makes concat'ing better 🤔 Any per-stream overhead we can improve in SortPreservingMergeStream would likely flow directly to any query that does sorts

@alamb

This comment was marked as outdated.

@Dandandan
Copy link
Contributor

Hm that doesn't make much sense as

Thanks for sharing the results @zhuqi-lucas this is really interesting!
I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams. The arrow kernel would avoid the regressions of concat.

I think the SortPreservingMergeStream is about as efficient as we know how to make it

Maybe we can look into what overhead makes concat'ing better 🤔 Any per-stream overhead we can improve in SortPreservingMergeStream would likely flow directly to any query that does sorts

Hm 🤔 ... but that will still take a separate step of sorting the input bathes, which next to sorting involves a full extra copy using take (slower than concat) followed by merging the batches. Also the built-in sort on the entire output is likely to be much faster than doing a merge on the outputs.

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

@zhuqi-lucas
Copy link
Contributor Author

It seems when we merge the sorted batch, we already using the interleave to merge the sorted indices, here is the code:

    /// Drains the in_progress row indexes, and builds a new RecordBatch from them
    ///
    /// Will then drop any batches for which all rows have been yielded to the output
    ///
    /// Returns `None` if no pending rows
    pub fn build_record_batch(&mut self) -> Result<Option<RecordBatch>> {
        if self.is_empty() {
            return Ok(None);
        }

        let columns = (0..self.schema.fields.len())
            .map(|column_idx| {
                let arrays: Vec<_> = self
                    .batches
                    .iter()
                    .map(|(_, batch)| batch.column(column_idx).as_ref())
                    .collect();
                Ok(interleave(&arrays, &self.indices)?)
            })
            .collect::<Result<Vec<_>>>()?;

        self.indices.clear();

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

@zhuqi-lucas
Copy link
Contributor Author

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1 Results will be posted here when complete

Thanks @alamb for this triggering, it seems stuck.

@Dandandan
Copy link
Contributor

Dandandan commented Apr 15, 2025

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

I mean theoretically we don't have to merge anything as all the batches are in memory.

The merging is useful for sorting streams of data, but I think it is expected the process of sorting batches first followed by a custom merge implementation is slower than a single sorting pass based on rust std unstable sort (which is optimized for doing a minimal amount of comparisons quickly).

@Dandandan
Copy link
Contributor

A more complete rationale / explanation of the same idea was written here by @2010YOUY01 #15375 (comment)

An alternative to try to avoid copies is: first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), and get a permutation array.
Use the interleave kernel to construct the final result https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/fn.interleave.html

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Apr 15, 2025

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

I mean theoretically we don't have to merge anything as all the batches are in memory.

The merging is useful for sorting streams of data, but I think it is expected the process of sorting batches first followed by a custom merge implementation is slower than a single sorting pass based on rust std unstable sort (which is optimized for doing a minimal amount of comparisons quickly).

I think i got it now, thank you @Dandandan, it means we already have those in memory batch, we just need to first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), we don't need to construct the StreamingMergeBuilder for in memory sort, we just need to sort it as a single sorting pass.

Let me try this way, and compare the performance!

@zhuqi-lucas
Copy link
Contributor Author

Very interesting, firstly i now try merge all memory batch, and single sort, some query become crazy fast and some crazy slow, i think because:

  1. We sort in memory without merge, it's similar to sort single partition without partition parallel ?
  2. Previous some merge will have partition parallel?

So next step, we can try to make the in memory sort with parallel?

--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q12243.52ms │               1416.52ms │ +1.58x faster │
│ Q21842.11ms │               1096.12ms │ +1.68x faster │
│ Q312446.31ms │              12535.45ms │     no change │
│ Q44047.55ms │               1964.73ms │ +2.06x faster │
│ Q54364.46ms │               5955.70ms │  1.36x slower │
│ Q64561.01ms │               6275.39ms │  1.38x slower │
│ Q78158.01ms │              19145.68ms │  2.35x slower │
│ Q86077.40ms │               5146.80ms │ +1.18x faster │
│ Q96347.21ms │               5544.48ms │ +1.14x faster │
│ Q1011561.03ms │              23572.68ms │  2.04x slower │
│ Q116069.42ms │               4810.88ms │ +1.26x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)67718.04ms │
│ Total Time (concat_batches_for_sort)87464.44ms │
│ Average Time (main)6156.19ms │
│ Average Time (concat_batches_for_sort)7951.31ms │
│ Queries Faster6 │
│ Queries Slower4 │
│ Queries with No Change1 │
└────────────────────────────────────────┴────────────┘

Patch tried:

diff --git a/datafusion/physical-plan/src/sorts/sort.rs b/datafusion/physical-plan/src/sorts/sort.rs
index 7fd1c2b16..ec3cd89f3 100644
--- a/datafusion/physical-plan/src/sorts/sort.rs
+++ b/datafusion/physical-plan/src/sorts/sort.rs
@@ -671,85 +671,14 @@ impl ExternalSorter {
             return self.sort_batch_stream(batch, metrics, reservation);
         }

-        // If less than sort_in_place_threshold_bytes, concatenate and sort in place
-        if self.reservation.size() < self.sort_in_place_threshold_bytes {
-            // Concatenate memory batches together and sort
-            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
-            self.in_mem_batches.clear();
-            self.reservation
-                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
-            let reservation = self.reservation.take();
-            return self.sort_batch_stream(batch, metrics, reservation);
-        }
-
-        let mut merged_batches = Vec::new();
-        let mut current_batches = Vec::new();
-        let mut current_size = 0;
-
-        // Drain in_mem_batches using pop() to release memory earlier.
-        // This avoids holding onto the entire vector during iteration.
-        // Note:
-        // Now we use `sort_in_place_threshold_bytes` to determine, in future we can make it more dynamic.
-        while let Some(batch) = self.in_mem_batches.pop() {
-            let batch_size = get_reserved_byte_for_record_batch(&batch);
-
-            // If adding this batch would exceed the memory threshold, merge current_batches.
-            if current_size + batch_size > self.sort_in_place_threshold_bytes
-                && !current_batches.is_empty()
-            {
-                // Merge accumulated batches into one.
-                let merged = concat_batches(&self.schema, &current_batches)?;
-                current_batches.clear();
-
-                // Update memory reservation.
-                self.reservation.try_shrink(current_size)?;
-                let merged_size = get_reserved_byte_for_record_batch(&merged);
-                self.reservation.try_grow(merged_size)?;
-
-                merged_batches.push(merged);
-                current_size = 0;
-            }
-
-            current_batches.push(batch);
-            current_size += batch_size;
-        }
-
-        // Merge any remaining batches after the loop.
-        if !current_batches.is_empty() {
-            let merged = concat_batches(&self.schema, &current_batches)?;
-            self.reservation.try_shrink(current_size)?;
-            let merged_size = get_reserved_byte_for_record_batch(&merged);
-            self.reservation.try_grow(merged_size)?;
-            merged_batches.push(merged);
-        }
-
-        // Create sorted streams directly without using spawn_buffered.
-        // This allows for sorting to happen inline and enables earlier batch drop.
-        let streams = merged_batches
-            .into_iter()
-            .map(|batch| {
-                let metrics = self.metrics.baseline.intermediate();
-                let reservation = self
-                    .reservation
-                    .split(get_reserved_byte_for_record_batch(&batch));
-
-                // Sort the batch inline.
-                let input = self.sort_batch_stream(batch, metrics, reservation)?;
-                Ok(input)
-            })
-            .collect::<Result<_>>()?;
-
-        let expressions: LexOrdering = self.expr.iter().cloned().collect();
-
-        StreamingMergeBuilder::new()
-            .with_streams(streams)
-            .with_schema(Arc::clone(&self.schema))
-            .with_expressions(expressions.as_ref())
-            .with_metrics(metrics)
-            .with_batch_size(self.batch_size)
-            .with_fetch(None)
-            .with_reservation(self.merge_reservation.new_empty())
-            .build()
+        // Because batches are all in memory, we can sort them in place
+        // Concatenate memory batches together and sort
+        let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
+        self.in_mem_batches.clear();
+        self.reservation
+            .try_resize(get_reserved_byte_for_record_batch(&batch))?;
+        let reservation = self.reservation.take();
+        self.sort_batch_stream(batch, metrics, reservation)
     }

@Dandandan
Copy link
Contributor

Dandandan commented Apr 15, 2025

I think concat followed by sort is slower in some cases because

  • Concat involves copying the entire batch (rather than only the keys to be sorted)
  • sort_batch_stream Can be slower as lexsort_to_indices is in cases with many columns slower than the Row Format

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

The core improvements that I think are important:

  • Minimizing copying of the input batches to one (only once for the output)
  • Sorting once on the input batches rather than sort followed by merge
  • A good heuristic on when to switch from lexsort_to_indices-like sorting to RowConverter + sorting.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Apr 15, 2025

I think concat followed by sort is slower in some cases because

  • Concat involves copying the entire batch (rather than only the keys to be sorted)
  • sort_batch_stream Can be slower as lexsort_to_indices is in cases with many columns slower than the Row Format

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

The core improvements that I think are important:

  • Minimizing copying of the input batches to one (only once for the output)
  • Sorting once on the input batches rather than sort followed by merge
  • A good heuristic on when to switch from lexsort_to_indices-like sorting to RowConverter + sorting.

Good explain.

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

I see, the execute already using partition:

fn execute(
        &self,
        partition: usize,
        context: Arc<TaskContext>,
    ) -> Result<SendableRecordBatchStream> {

@2010YOUY01
Copy link
Contributor

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy.
I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished.
We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

@Dandandan
Copy link
Contributor

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy. I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished. We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Yes, to be clear I don't argue to remove SortPreservingMergeExec or sorting in two fases altogether or something similar, just was reacting to the idea of adding more parallelism in in_mem_sort_stream which probably won't help much.

SortPreserveMergeExec <= Does k-way merging based on input streams, with minimal memory overhead, maximizing input parallelism
     SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory *sorting* if possible, for each input partition in parallel, only resorting to spill/merge when does not fit into memory 

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Apr 15, 2025

Thank you @2010YOUY01 @Dandandan , it's very interesting, i am thinking:

  1. Since the all batch size sum is fixed, we can first calculate the compute size of each partition, call it partition_cal_size.
  2. Then we setting a min_sort_size and max_sort_size, so we will determine the final_merged_batch_size:
final_merged_batch_size = 
  if (partition_cal_size < min_sort_size) => min_sort_size
  else if (partition_cal_size > max_sort_size) => max_sort_size
  else => partition_cal_size

This prevents creating too many small batches (which can fragment merge tasks) or overly large batches.
It looks like the first version of heuristic

But how can we calculate the min_sort_size and max_sort_size?

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy. I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished. We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Yes, to be clear I don't argue to remove SortPreservingMergeExec or sorting in two fases altogether or something similar, just was reacting to the idea of adding more parallelism in in_mem_sort_stream which probably won't help much.

SortPreserveMergeExec <= Does k-way merging based on input streams, with minimal memory overhead, maximizing input parallelism
     SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory *sorting* if possible, for each input partition in parallel, only resorting to spill/merge when does not fit into memory 

@alamb
Copy link
Contributor

alamb commented Apr 15, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff
Benchmarks: clickbench_1 clickbench_partitioned sort_tpch
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 15, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1 Results will be posted here when complete

Thanks @alamb for this triggering, it seems stuck.

yeah, sorry I had a bug retriggered

@alamb
Copy link
Contributor

alamb commented Apr 15, 2025

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

I wonder if we can skip interleave / copying entirely?

Specifically, what if we sorted to indices, as you suggested, but then instead of calling interleave (which will copy the data) before sending it to merge_streams) maybe we could have some way to have the merge cursors also take the indicies -- so we could only copy data once 🤔

@alamb
Copy link
Contributor

alamb commented Apr 15, 2025

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.56ms │                  0.56ms │     no change │
│ QQuery 1     │    80.70ms │                 78.53ms │     no change │
│ QQuery 2     │   116.89ms │                114.69ms │     no change │
│ QQuery 3     │   130.38ms │                125.49ms │     no change │
│ QQuery 4     │   861.43ms │                756.05ms │ +1.14x faster │
│ QQuery 5     │   878.02ms │                869.13ms │     no change │
│ QQuery 6     │     0.67ms │                  0.64ms │     no change │
│ QQuery 7     │   100.33ms │                 93.51ms │ +1.07x faster │
│ QQuery 8     │   979.95ms │                956.50ms │     no change │
│ QQuery 9     │  1283.74ms │               1245.12ms │     no change │
│ QQuery 10    │   304.60ms │                306.39ms │     no change │
│ QQuery 11    │   342.54ms │                340.89ms │     no change │
│ QQuery 12    │   930.04ms │                933.37ms │     no change │
│ QQuery 13    │  1337.30ms │               1341.28ms │     no change │
│ QQuery 14    │   869.61ms │                883.04ms │     no change │
│ QQuery 15    │  1088.81ms │               1083.02ms │     no change │
│ QQuery 16    │  1841.74ms │               1788.14ms │     no change │
│ QQuery 17    │  1680.12ms │               1638.39ms │     no change │
│ QQuery 18    │  3128.65ms │               3139.26ms │     no change │
│ QQuery 19    │   127.46ms │                120.42ms │ +1.06x faster │
│ QQuery 20    │  1169.35ms │               1195.58ms │     no change │
│ QQuery 21    │  1472.42ms │               1457.18ms │     no change │
│ QQuery 22    │  2595.51ms │               2696.08ms │     no change │
│ QQuery 23    │  8475.08ms │               8735.96ms │     no change │
│ QQuery 24    │   510.59ms │                515.80ms │     no change │
│ QQuery 25    │   441.39ms │                439.44ms │     no change │
│ QQuery 26    │   569.36ms │                581.32ms │     no change │
│ QQuery 27    │  1850.31ms │               1844.63ms │     no change │
│ QQuery 28    │ 13503.59ms │              13185.12ms │     no change │
│ QQuery 29    │   587.04ms │                548.23ms │ +1.07x faster │
│ QQuery 30    │   872.06ms │                861.85ms │     no change │
│ QQuery 31    │   924.05ms │                992.86ms │  1.07x slower │
│ QQuery 32    │  2763.71ms │               2715.48ms │     no change │
│ QQuery 33    │  3455.95ms │               3450.90ms │     no change │
│ QQuery 34    │  3466.53ms │               3478.02ms │     no change │
│ QQuery 35    │  1342.93ms │               1336.22ms │     no change │
│ QQuery 36    │   179.89ms │                185.83ms │     no change │
│ QQuery 37    │   106.98ms │                102.42ms │     no change │
│ QQuery 38    │   169.57ms │                181.57ms │  1.07x slower │
│ QQuery 39    │   263.27ms │                261.33ms │     no change │
│ QQuery 40    │    90.39ms │                 86.94ms │     no change │
│ QQuery 41    │    85.04ms │                 83.29ms │     no change │
│ QQuery 42    │    78.33ms │                 78.86ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 61056.89ms │
│ Total Time (concat_batches_for_sort)   │ 60829.33ms │
│ Average Time (HEAD)                    │  1419.93ms │
│ Average Time (concat_batches_for_sort) │  1414.64ms │
│ Queries Faster                         │          4 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.39ms │                  2.57ms │ 1.07x slower │
│ QQuery 1     │    36.11ms │                 36.43ms │    no change │
│ QQuery 2     │    91.64ms │                 90.30ms │    no change │
│ QQuery 3     │    97.91ms │                 96.78ms │    no change │
│ QQuery 4     │   732.04ms │                776.35ms │ 1.06x slower │
│ QQuery 5     │   833.38ms │                872.17ms │    no change │
│ QQuery 6     │     2.10ms │                  2.26ms │ 1.07x slower │
│ QQuery 7     │    41.00ms │                 39.72ms │    no change │
│ QQuery 8     │   942.92ms │                946.92ms │    no change │
│ QQuery 9     │  1216.74ms │               1198.20ms │    no change │
│ QQuery 10    │   268.97ms │                280.66ms │    no change │
│ QQuery 11    │   302.80ms │                311.61ms │    no change │
│ QQuery 12    │   908.31ms │                943.03ms │    no change │
│ QQuery 13    │  1238.75ms │               1400.87ms │ 1.13x slower │
│ QQuery 14    │   862.33ms │                880.29ms │    no change │
│ QQuery 15    │  1072.63ms │               1053.58ms │    no change │
│ QQuery 16    │  1755.68ms │               1765.66ms │    no change │
│ QQuery 17    │  1653.94ms │               1624.45ms │    no change │
│ QQuery 18    │  3109.98ms │               3131.55ms │    no change │
│ QQuery 19    │    86.54ms │                 85.77ms │    no change │
│ QQuery 20    │  1148.95ms │               1170.51ms │    no change │
│ QQuery 21    │  1333.80ms │               1392.35ms │    no change │
│ QQuery 22    │  2373.02ms │               2456.49ms │    no change │
│ QQuery 23    │  8411.11ms │               8608.23ms │    no change │
│ QQuery 24    │   469.47ms │                488.99ms │    no change │
│ QQuery 25    │   399.27ms │                411.12ms │    no change │
│ QQuery 26    │   537.55ms │                546.69ms │    no change │
│ QQuery 27    │  1685.85ms │               1739.53ms │    no change │
│ QQuery 28    │ 12957.91ms │              12866.18ms │    no change │
│ QQuery 29    │   546.32ms │                542.10ms │    no change │
│ QQuery 30    │   846.79ms │                852.23ms │    no change │
│ QQuery 31    │   887.81ms │                891.51ms │    no change │
│ QQuery 32    │  2723.16ms │               2728.31ms │    no change │
│ QQuery 33    │  3360.97ms │               3394.86ms │    no change │
│ QQuery 34    │  3409.45ms │               3368.33ms │    no change │
│ QQuery 35    │  1284.03ms │               1297.77ms │    no change │
│ QQuery 36    │   127.54ms │                133.40ms │    no change │
│ QQuery 37    │    56.79ms │                 57.49ms │    no change │
│ QQuery 38    │   129.27ms │                127.87ms │    no change │
│ QQuery 39    │   211.07ms │                209.72ms │    no change │
│ QQuery 40    │    49.05ms │                 51.83ms │ 1.06x slower │
│ QQuery 41    │    47.29ms │                 46.37ms │    no change │
│ QQuery 42    │    39.37ms │                 39.91ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58292.02ms │
│ Total Time (concat_batches_for_sort)   │ 58960.95ms │
│ Average Time (HEAD)                    │  1355.63ms │
│ Average Time (concat_batches_for_sort) │  1371.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          5 │
│ Queries with No Change                 │         38 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.52ms │                326.59ms │ +1.15x faster │
│ Q2           │  307.50ms │                283.70ms │ +1.08x faster │
│ Q3           │ 1217.16ms │               1216.87ms │     no change │
│ Q4           │  430.52ms │                476.70ms │  1.11x slower │
│ Q5           │  433.11ms │                499.16ms │  1.15x slower │
│ Q6           │  467.50ms │                521.98ms │  1.12x slower │
│ Q7           │  960.82ms │               1016.45ms │  1.06x slower │
│ Q8           │  793.48ms │                842.76ms │  1.06x slower │
│ Q9           │  833.53ms │                862.74ms │     no change │
│ Q10          │ 1275.22ms │               1279.19ms │     no change │
│ Q11          │  771.89ms │                762.28ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7867.25ms │
│ Total Time (concat_batches_for_sort)   │ 8088.43ms │
│ Average Time (HEAD)                    │  715.20ms │
│ Average Time (concat_batches_for_sort) │  735.31ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         5 │
│ Queries with No Change                 │         4 │
└────────────────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor Author

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

I wonder if we can skip interleave / copying entirely?

Specifically, what if we sorted to indices, as you suggested, but then instead of calling interleave (which will copy the data) before sending it to merge_streams) maybe we could have some way to have the merge cursors also take the indicies -- so we could only copy data once 🤔

Thanks @alamb , it looks promising.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Apr 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.56ms │                  0.56ms │     no change │
│ QQuery 1     │    80.70ms │                 78.53ms │     no change │
│ QQuery 2     │   116.89ms │                114.69ms │     no change │
│ QQuery 3     │   130.38ms │                125.49ms │     no change │
│ QQuery 4     │   861.43ms │                756.05ms │ +1.14x faster │
│ QQuery 5     │   878.02ms │                869.13ms │     no change │
│ QQuery 6     │     0.67ms │                  0.64ms │     no change │
│ QQuery 7     │   100.33ms │                 93.51ms │ +1.07x faster │
│ QQuery 8     │   979.95ms │                956.50ms │     no change │
│ QQuery 9     │  1283.74ms │               1245.12ms │     no change │
│ QQuery 10    │   304.60ms │                306.39ms │     no change │
│ QQuery 11    │   342.54ms │                340.89ms │     no change │
│ QQuery 12    │   930.04ms │                933.37ms │     no change │
│ QQuery 13    │  1337.30ms │               1341.28ms │     no change │
│ QQuery 14    │   869.61ms │                883.04ms │     no change │
│ QQuery 15    │  1088.81ms │               1083.02ms │     no change │
│ QQuery 16    │  1841.74ms │               1788.14ms │     no change │
│ QQuery 17    │  1680.12ms │               1638.39ms │     no change │
│ QQuery 18    │  3128.65ms │               3139.26ms │     no change │
│ QQuery 19    │   127.46ms │                120.42ms │ +1.06x faster │
│ QQuery 20    │  1169.35ms │               1195.58ms │     no change │
│ QQuery 21    │  1472.42ms │               1457.18ms │     no change │
│ QQuery 22    │  2595.51ms │               2696.08ms │     no change │
│ QQuery 23    │  8475.08ms │               8735.96ms │     no change │
│ QQuery 24    │   510.59ms │                515.80ms │     no change │
│ QQuery 25    │   441.39ms │                439.44ms │     no change │
│ QQuery 26    │   569.36ms │                581.32ms │     no change │
│ QQuery 27    │  1850.31ms │               1844.63ms │     no change │
│ QQuery 28    │ 13503.59ms │              13185.12ms │     no change │
│ QQuery 29    │   587.04ms │                548.23ms │ +1.07x faster │
│ QQuery 30    │   872.06ms │                861.85ms │     no change │
│ QQuery 31    │   924.05ms │                992.86ms │  1.07x slower │
│ QQuery 32    │  2763.71ms │               2715.48ms │     no change │
│ QQuery 33    │  3455.95ms │               3450.90ms │     no change │
│ QQuery 34    │  3466.53ms │               3478.02ms │     no change │
│ QQuery 35    │  1342.93ms │               1336.22ms │     no change │
│ QQuery 36    │   179.89ms │                185.83ms │     no change │
│ QQuery 37    │   106.98ms │                102.42ms │     no change │
│ QQuery 38    │   169.57ms │                181.57ms │  1.07x slower │
│ QQuery 39    │   263.27ms │                261.33ms │     no change │
│ QQuery 40    │    90.39ms │                 86.94ms │     no change │
│ QQuery 41    │    85.04ms │                 83.29ms │     no change │
│ QQuery 42    │    78.33ms │                 78.86ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 61056.89ms │
│ Total Time (concat_batches_for_sort)   │ 60829.33ms │
│ Average Time (HEAD)                    │  1419.93ms │
│ Average Time (concat_batches_for_sort) │  1414.64ms │
│ Queries Faster                         │          4 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.39ms │                  2.57ms │ 1.07x slower │
│ QQuery 1     │    36.11ms │                 36.43ms │    no change │
│ QQuery 2     │    91.64ms │                 90.30ms │    no change │
│ QQuery 3     │    97.91ms │                 96.78ms │    no change │
│ QQuery 4     │   732.04ms │                776.35ms │ 1.06x slower │
│ QQuery 5     │   833.38ms │                872.17ms │    no change │
│ QQuery 6     │     2.10ms │                  2.26ms │ 1.07x slower │
│ QQuery 7     │    41.00ms │                 39.72ms │    no change │
│ QQuery 8     │   942.92ms │                946.92ms │    no change │
│ QQuery 9     │  1216.74ms │               1198.20ms │    no change │
│ QQuery 10    │   268.97ms │                280.66ms │    no change │
│ QQuery 11    │   302.80ms │                311.61ms │    no change │
│ QQuery 12    │   908.31ms │                943.03ms │    no change │
│ QQuery 13    │  1238.75ms │               1400.87ms │ 1.13x slower │
│ QQuery 14    │   862.33ms │                880.29ms │    no change │
│ QQuery 15    │  1072.63ms │               1053.58ms │    no change │
│ QQuery 16    │  1755.68ms │               1765.66ms │    no change │
│ QQuery 17    │  1653.94ms │               1624.45ms │    no change │
│ QQuery 18    │  3109.98ms │               3131.55ms │    no change │
│ QQuery 19    │    86.54ms │                 85.77ms │    no change │
│ QQuery 20    │  1148.95ms │               1170.51ms │    no change │
│ QQuery 21    │  1333.80ms │               1392.35ms │    no change │
│ QQuery 22    │  2373.02ms │               2456.49ms │    no change │
│ QQuery 23    │  8411.11ms │               8608.23ms │    no change │
│ QQuery 24    │   469.47ms │                488.99ms │    no change │
│ QQuery 25    │   399.27ms │                411.12ms │    no change │
│ QQuery 26    │   537.55ms │                546.69ms │    no change │
│ QQuery 27    │  1685.85ms │               1739.53ms │    no change │
│ QQuery 28    │ 12957.91ms │              12866.18ms │    no change │
│ QQuery 29    │   546.32ms │                542.10ms │    no change │
│ QQuery 30    │   846.79ms │                852.23ms │    no change │
│ QQuery 31    │   887.81ms │                891.51ms │    no change │
│ QQuery 32    │  2723.16ms │               2728.31ms │    no change │
│ QQuery 33    │  3360.97ms │               3394.86ms │    no change │
│ QQuery 34    │  3409.45ms │               3368.33ms │    no change │
│ QQuery 35    │  1284.03ms │               1297.77ms │    no change │
│ QQuery 36    │   127.54ms │                133.40ms │    no change │
│ QQuery 37    │    56.79ms │                 57.49ms │    no change │
│ QQuery 38    │   129.27ms │                127.87ms │    no change │
│ QQuery 39    │   211.07ms │                209.72ms │    no change │
│ QQuery 40    │    49.05ms │                 51.83ms │ 1.06x slower │
│ QQuery 41    │    47.29ms │                 46.37ms │    no change │
│ QQuery 42    │    39.37ms │                 39.91ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58292.02ms │
│ Total Time (concat_batches_for_sort)   │ 58960.95ms │
│ Average Time (HEAD)                    │  1355.63ms │
│ Average Time (concat_batches_for_sort) │  1371.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          5 │
│ Queries with No Change                 │         38 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.52ms │                326.59ms │ +1.15x faster │
│ Q2           │  307.50ms │                283.70ms │ +1.08x faster │
│ Q3           │ 1217.16ms │               1216.87ms │     no change │
│ Q4           │  430.52ms │                476.70ms │  1.11x slower │
│ Q5           │  433.11ms │                499.16ms │  1.15x slower │
│ Q6           │  467.50ms │                521.98ms │  1.12x slower │
│ Q7           │  960.82ms │               1016.45ms │  1.06x slower │
│ Q8           │  793.48ms │                842.76ms │  1.06x slower │
│ Q9           │  833.53ms │                862.74ms │     no change │
│ Q10          │ 1275.22ms │               1279.19ms │     no change │
│ Q11          │  771.89ms │                762.28ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7867.25ms │
│ Total Time (concat_batches_for_sort)   │ 8088.43ms │
│ Average Time (HEAD)                    │  715.20ms │
│ Average Time (concat_batches_for_sort) │  735.31ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         5 │
│ Queries with No Change                 │         4 │
└────────────────────────────────────────┴───────────┘

No performance improvement for benchmark, i believe mostly the benchmark batch size > sort_in_place size, it will not gain from this PR. Sort-tpch 10 should gain performance not in this benchmark list.

@github-actions github-actions bot added documentation Improvements or additions to documentation common Related to common crate labels Apr 18, 2025
@Dandandan
Copy link
Contributor

Thanks @zhuqi-lucas for sticking to this issue!

I think we're close to have a PR that can be merged that improved sort performance and gets some good insights for where to spend time for followup work.

@zhuqi-lucas
Copy link
Contributor Author

Thanks @zhuqi-lucas for sticking to this issue!

I think we're close to have a PR that can be merged that improved sort performance and gets some good insights for where to spend time for followup work.

Thank you @Dandandan for patient review and great guide and suggestion!

@github-actions github-actions bot added the core Core DataFusion crate label Apr 18, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 18, 2025
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks like a good improvement. @alamb can we rerun the benchmarks on this to see if we don't get regressions?

@alamb
Copy link
Contributor

alamb commented May 2, 2025

To me this looks like a good improvement. @alamb can we rerun the benchmarks on this to see if we don't get regressions?

Will do

Just FYI I would like to find some way that I am not the only one who can easily run benchmarks -- I am using my own home grown scripts here: https://github.com/alamb/datafusion-benchmarking

But I suspect it would be straight forward or others to use them

@alamb
Copy link
Contributor

alamb commented May 2, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6a3b4e7) to 061ee09 diff
Benchmarks: sort_tpch
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented May 2, 2025

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  371.51ms │                315.35ms │ +1.18x faster │
│ Q2           │  307.31ms │                241.42ms │ +1.27x faster │
│ Q3           │ 1167.09ms │               1517.17ms │  1.30x slower │
│ Q4           │  440.75ms │                373.42ms │ +1.18x faster │
│ Q5           │  430.28ms │                424.70ms │     no change │
│ Q6           │  454.68ms │                462.34ms │     no change │
│ Q7           │  924.82ms │                941.63ms │     no change │
│ Q8           │  780.68ms │                792.67ms │     no change │
│ Q9           │  798.52ms │                811.82ms │     no change │
│ Q10          │ 1231.02ms │               1218.71ms │     no change │
│ Q11          │  742.85ms │                748.07ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7649.50ms │
│ Total Time (concat_batches_for_sort)   │ 7847.31ms │
│ Average Time (HEAD)                    │  695.41ms │
│ Average Time (concat_batches_for_sort) │  713.39ms │
│ Queries Faster                         │         3 │
│ Queries Slower                         │         1 │
│ Queries with No Change                 │         7 │
└────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented May 2, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6a3b4e7) to 061ee09 diff
Benchmarks: sort
Results will be posted here when complete

@Dandandan
Copy link
Contributor

To me this looks like a good improvement. @alamb can we rerun the benchmarks on this to see if we don't get regressions?

Will do

Just FYI I would like to find some way that I am not the only one who can easily run benchmarks -- I am using my own home grown scripts here: https://github.com/alamb/datafusion-benchmarking

But I suspect it would be straight forward or others to use them

I guess #15583 would be optimal but I guess I can checkout and run the scripts myself as well :)

@alamb
Copy link
Contributor

alamb commented May 2, 2025

I guess #15583 would be optimal but I guess I can checkout and run the scripts myself as well :)

Yeah, the thing that is missing there is some stable set of machines to run the scripts on

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented May 2, 2025

Thank you @alamb @Dandandan , i can checkout the script too.

🤖: Benchmark completed

Details

The first result of the sort-tpch small data set, it seems Q3 has regression, i am confused why it not happened for the previous run in my local mac.

Attached the latest PR result:

--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1153.49ms │                141.31ms │ +1.09x faster │
│ Q2131.29ms │                113.30ms │ +1.16x faster │
│ Q3980.57ms │                986.14ms │     no change │
│ Q4252.25ms │                215.03ms │ +1.17x faster │
│ Q5464.81ms │                454.00ms │     no change │
│ Q6481.44ms │                467.38ms │     no change │
│ Q7810.73ms │                695.84ms │ +1.17x faster │
│ Q8498.10ms │                502.47ms │     no change │
│ Q9503.80ms │                506.87ms │     no change │
│ Q10789.02ms │                709.34ms │ +1.11x faster │
│ Q11417.39ms │                411.24ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)5482.89ms │
│ Total Time (concat_batches_for_sort)5202.89ms │
│ Average Time (main)498.44ms │
│ Average Time (concat_batches_for_sort)472.99ms │
│ Queries Faster5 │
│ Queries Slower0 │
│ Queries with No Change6 │
└────────────────────────────────────────┴───────────┘
--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q12243.52ms │               1602.46ms │ +1.40x faster │
│ Q21842.11ms │               1185.03ms │ +1.55x faster │
│ Q312446.31ms │              12203.05ms │     no change │
│ Q44047.55ms │               2096.56ms │ +1.93x faster │
│ Q54364.46ms │               4446.55ms │     no change │
│ Q64561.01ms │               4592.95ms │     no change │
│ Q78158.01ms │               7827.94ms │     no change │
│ Q86077.40ms │               6379.98ms │     no change │
│ Q96347.21ms │               6225.14ms │     no change │
│ Q1011561.03ms │               9491.51ms │ +1.22x faster │
│ Q116069.42ms │               5676.88ms │ +1.07x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)67718.04ms │
│ Total Time (concat_batches_for_sort)61728.05ms │
│ Average Time (main)6156.19ms │
│ Average Time (concat_batches_for_sort)5611.64ms │
│ Queries Faster5 │
│ Queries Slower0 │
│ Queries with No Change6 │
└────────────────────────────────────────┴────────────┘

And the performance improvement will be obvious for large data set sort-tpch 10 for my previous running in local mac.

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@zhuqi-lucas
Copy link
Contributor Author

It seems latest benchmark not triggered.

@alamb
Copy link
Contributor

alamb commented May 5, 2025

It seems latest benchmark not triggered.

I think it was due to

@alamb
Copy link
Contributor

alamb commented May 5, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6a3b4e7) to 061ee09 diff
Benchmarks: sort_tpch
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented May 5, 2025

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.58ms │                315.39ms │ +1.19x faster │
│ Q2           │  309.30ms │                247.77ms │ +1.25x faster │
│ Q3           │ 1153.49ms │               1551.83ms │  1.35x slower │
│ Q4           │  442.26ms │                378.16ms │ +1.17x faster │
│ Q5           │  422.86ms │                417.10ms │     no change │
│ Q6           │  466.48ms │                464.71ms │     no change │
│ Q7           │  937.61ms │                936.46ms │     no change │
│ Q8           │  785.12ms │                783.71ms │     no change │
│ Q9           │  813.77ms │                810.37ms │     no change │
│ Q10          │ 1237.84ms │               1215.64ms │     no change │
│ Q11          │  742.69ms │                745.74ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7688.02ms │
│ Total Time (concat_batches_for_sort)   │ 7866.87ms │
│ Average Time (HEAD)                    │  698.91ms │
│ Average Time (concat_batches_for_sort) │  715.17ms │
│ Queries Faster                         │         3 │
│ Queries Slower                         │         1 │
│ Queries with No Change                 │         7 │
└────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented May 5, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6a3b4e7) to 061ee09 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented May 5, 2025

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1798.43ms │               1940.33ms │ 1.08x slower │
│ QQuery 1     │   765.15ms │                745.38ms │    no change │
│ QQuery 2     │  1469.33ms │               1455.93ms │    no change │
│ QQuery 3     │   714.01ms │                711.25ms │    no change │
│ QQuery 4     │  1440.15ms │               1494.27ms │    no change │
│ QQuery 5     │ 15077.27ms │              15375.55ms │    no change │
│ QQuery 6     │  2103.95ms │               2076.60ms │    no change │
│ QQuery 7     │  2639.24ms │               3057.50ms │ 1.16x slower │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 26007.52ms │
│ Total Time (concat_batches_for_sort)   │ 26856.83ms │
│ Average Time (HEAD)                    │  3250.94ms │
│ Average Time (concat_batches_for_sort) │  3357.10ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          6 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.30ms │                  2.32ms │     no change │
│ QQuery 1     │    38.02ms │                 37.35ms │     no change │
│ QQuery 2     │    94.96ms │                 93.59ms │     no change │
│ QQuery 3     │   100.08ms │                 98.32ms │     no change │
│ QQuery 4     │   789.12ms │                746.42ms │ +1.06x faster │
│ QQuery 5     │   877.73ms │                862.93ms │     no change │
│ QQuery 6     │     2.27ms │                  2.19ms │     no change │
│ QQuery 7     │    45.37ms │                 44.36ms │     no change │
│ QQuery 8     │   901.21ms │                926.34ms │     no change │
│ QQuery 9     │  1218.91ms │               1177.91ms │     no change │
│ QQuery 10    │   270.25ms │                279.97ms │     no change │
│ QQuery 11    │   308.43ms │                312.84ms │     no change │
│ QQuery 12    │   919.73ms │                920.03ms │     no change │
│ QQuery 13    │  1366.14ms │               1367.29ms │     no change │
│ QQuery 14    │   860.82ms │                846.70ms │     no change │
│ QQuery 15    │  1040.40ms │               1028.77ms │     no change │
│ QQuery 16    │  1731.65ms │               1740.68ms │     no change │
│ QQuery 17    │  1602.81ms │               1610.68ms │     no change │
│ QQuery 18    │  3103.53ms │               3092.98ms │     no change │
│ QQuery 19    │    86.08ms │                 85.34ms │     no change │
│ QQuery 20    │  1145.40ms │               1145.07ms │     no change │
│ QQuery 21    │  1358.37ms │               1305.73ms │     no change │
│ QQuery 22    │  2263.53ms │               2197.97ms │     no change │
│ QQuery 23    │  8571.27ms │               8547.03ms │     no change │
│ QQuery 24    │   481.50ms │                472.74ms │     no change │
│ QQuery 25    │   402.66ms │                386.01ms │     no change │
│ QQuery 26    │   546.18ms │                537.46ms │     no change │
│ QQuery 27    │  1733.45ms │               1701.90ms │     no change │
│ QQuery 28    │ 12841.03ms │              12641.37ms │     no change │
│ QQuery 29    │   527.30ms │                549.88ms │     no change │
│ QQuery 30    │   826.29ms │                816.81ms │     no change │
│ QQuery 31    │   877.56ms │                860.70ms │     no change │
│ QQuery 32    │  2718.95ms │               2698.24ms │     no change │
│ QQuery 33    │  3394.98ms │               3375.37ms │     no change │
│ QQuery 34    │  3416.80ms │               3398.81ms │     no change │
│ QQuery 35    │  1294.85ms │               1312.09ms │     no change │
│ QQuery 36    │   131.52ms │                129.00ms │     no change │
│ QQuery 37    │    57.34ms │                 57.84ms │     no change │
│ QQuery 38    │   131.68ms │                124.24ms │ +1.06x faster │
│ QQuery 39    │   201.63ms │                203.26ms │     no change │
│ QQuery 40    │    48.82ms │                 49.53ms │     no change │
│ QQuery 41    │    44.78ms │                 47.60ms │  1.06x slower │
│ QQuery 42    │    38.47ms │                 39.98ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58414.16ms │
│ Total Time (concat_batches_for_sort)   │ 57875.61ms │
│ Average Time (HEAD)                    │  1358.47ms │
│ Average Time (concat_batches_for_sort) │  1345.94ms │
│ Queries Faster                         │          2 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         40 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 124.58ms │                122.90ms │ no change │
│ QQuery 2     │  23.69ms │                 24.03ms │ no change │
│ QQuery 3     │  35.10ms │                 35.20ms │ no change │
│ QQuery 4     │  21.21ms │                 20.89ms │ no change │
│ QQuery 5     │  55.52ms │                 55.41ms │ no change │
│ QQuery 6     │  12.28ms │                 12.28ms │ no change │
│ QQuery 7     │ 104.30ms │                101.37ms │ no change │
│ QQuery 8     │  26.56ms │                 26.78ms │ no change │
│ QQuery 9     │  63.58ms │                 62.79ms │ no change │
│ QQuery 10    │  57.36ms │                 57.53ms │ no change │
│ QQuery 11    │  13.08ms │                 13.02ms │ no change │
│ QQuery 12    │  45.60ms │                 45.08ms │ no change │
│ QQuery 13    │  29.78ms │                 30.57ms │ no change │
│ QQuery 14    │  10.09ms │                 10.24ms │ no change │
│ QQuery 15    │  25.56ms │                 25.16ms │ no change │
│ QQuery 16    │  22.85ms │                 23.40ms │ no change │
│ QQuery 17    │  97.53ms │                 96.30ms │ no change │
│ QQuery 18    │ 243.31ms │                242.98ms │ no change │
│ QQuery 19    │  28.34ms │                 28.25ms │ no change │
│ QQuery 20    │  39.12ms │                 40.89ms │ no change │
│ QQuery 21    │ 171.44ms │                176.11ms │ no change │
│ QQuery 22    │  17.63ms │                 17.85ms │ no change │
└──────────────┴──────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1268.49ms │
│ Total Time (concat_batches_for_sort)   │ 1269.04ms │
│ Average Time (HEAD)                    │   57.66ms │
│ Average Time (concat_batches_for_sort) │   57.68ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        22 │
└────────────────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb , it can be reproduced that Q3 has some regression, i will take a look if we can continue optimize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perf: Support automatically concat_batches for sort which will improve performance
4 participants