Perf: Optimize in memory sort #15380

zhuqi-lucas · 2025-03-24T09:42:24Z

Which issue does this PR close?

Closes #15375

Rationale for this change

Perf: Support automatically concat_batches for sort which will improve performance

And it's mergable for the first version, later we can improve it according to comments:

#15375 (comment)

What changes are included in this PR?

Perf: Support automatically concat_batches for sort which will improve performance

Are these changes tested?

Yes

Are there any user-facing changes?

No

Dandandan · 2025-04-12T07:35:03Z

datafusion/physical-plan/src/sorts/sort.rs

+        let mut current_batches = Vec::new();
+        let mut current_size = 0;
+
+        for batch in std::mem::take(&mut self.in_mem_batches) {


I think it would be nice to use pop (while let Some(batch) = v.pop) here to remove the batch from the vec once sorted to reduce memory usage. Now the batch is AFAIK retained until after the loop.

I think it would be nice to use pop (while let Some(batch) = v.pop) here to remove the batch from the vec once sorted to reduce memory usage. Now the batch is AFAIK retained until after the loop.

Thank you @Dandandan for review and good suggestion, addressed your suggestion!

Dandandan · 2025-04-12T07:37:13Z

I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas

zhuqi-lucas · 2025-04-12T14:40:14Z

I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas

Thank you @Dandandan for review, i think we just need to add the benchmark result for this PR for next step.

And it's mergable for the first version, later we can improve it according to comments:

#15375 (comment)

zhuqi-lucas · 2025-04-12T15:10:44Z

@alamb Do we have the CI benchmark running now? If no, i need your help to run... Thanks a lot!

And also for the sort-tpch itself, i was running for the improvement result, but not for other benchmark running.

Previous sort-tpch:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2241.04ms │               1816.69ms │ +1.23x faster │
│ Q2           │  1841.01ms │               1496.73ms │ +1.23x faster │
│ Q3           │ 12755.85ms │              12770.18ms │     no change │
│ Q4           │  4433.49ms │               3278.70ms │ +1.35x faster │
│ Q5           │  4414.15ms │               4409.04ms │     no change │
│ Q6           │  4543.09ms │               4597.32ms │     no change │
│ Q7           │  8012.85ms │               9026.30ms │  1.13x slower │
│ Q8           │  6572.37ms │               6049.51ms │ +1.09x faster │
│ Q9           │  6734.63ms │               6345.69ms │ +1.06x faster │
│ Q10          │  9896.16ms │               9564.17ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 61444.64ms │
│ Total Time (concat_batches_for_sort)   │ 59354.33ms │
│ Average Time (main)                    │  6144.46ms │
│ Average Time (concat_batches_for_sort) │  5935.43ms │
│ Queries Faster                         │          5 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          4 │
└────────────────────────────────────────┴────────────┘

zhuqi-lucas · 2025-04-12T15:40:57Z

Latest result based current latest code:

--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │ 153.49ms │                137.57ms │ +1.12x faster │
│ Q2           │ 131.29ms │                120.93ms │ +1.09x faster │
│ Q3           │ 980.57ms │                982.22ms │     no change │
│ Q4           │ 252.25ms │                245.09ms │     no change │
│ Q5           │ 464.81ms │                449.27ms │     no change │
│ Q6           │ 481.44ms │                455.45ms │ +1.06x faster │
│ Q7           │ 810.73ms │                709.74ms │ +1.14x faster │
│ Q8           │ 498.10ms │                491.12ms │     no change │
│ Q9           │ 503.80ms │                510.20ms │     no change │
│ Q10          │ 789.02ms │                706.45ms │ +1.12x faster │
│ Q11          │ 417.39ms │                411.50ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                      │ 5482.89ms │
│ Total Time (concat_batches_for_sort)   │ 5219.53ms │
│ Average Time (main)                    │  498.44ms │
│ Average Time (concat_batches_for_sort) │  474.50ms │
│ Queries Faster                         │         5 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │         6 │
└────────────────────────────────────────┴───────────┘
--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2243.52ms │               1825.64ms │ +1.23x faster │
│ Q2           │  1842.11ms │               1639.00ms │ +1.12x faster │
│ Q3           │ 12446.31ms │              11981.63ms │     no change │
│ Q4           │  4047.55ms │               3715.96ms │ +1.09x faster │
│ Q5           │  4364.46ms │               4503.51ms │     no change │
│ Q6           │  4561.01ms │               4688.31ms │     no change │
│ Q7           │  8158.01ms │               7915.54ms │     no change │
│ Q8           │  6077.40ms │               5524.08ms │ +1.10x faster │
│ Q9           │  6347.21ms │               5853.44ms │ +1.08x faster │
│ Q10          │ 11561.03ms │              14235.69ms │  1.23x slower │
│ Q11          │  6069.42ms │               5666.77ms │ +1.07x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 67718.04ms │
│ Total Time (concat_batches_for_sort)   │ 67549.58ms │
│ Average Time (main)                    │  6156.19ms │
│ Average Time (concat_batches_for_sort) │  6140.87ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          4 │
└────────────────────────────────────────┴────────────┘

Dandandan · 2025-04-12T19:00:12Z

Thanks for sharing the results @zhuqi-lucas this is really interesting!

I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams.
The arrow kernel would avoid the regressions of concat.

alamb · 2025-04-14T20:16:54Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff
Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1
Results will be posted here when complete

alamb · 2025-04-14T20:28:34Z

Thanks for sharing the results @zhuqi-lucas this is really interesting!

I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams. The arrow kernel would avoid the regressions of concat.

I think the SortPreservingMergeStream is about as efficient as we know how to make it

Maybe we can look into what overhead makes concat'ing better 🤔 Any per-stream overhead we can improve in SortPreservingMergeStream would likely flow directly to any query that does sorts

Dandandan · 2025-04-15T01:35:14Z

Hm that doesn't make much sense as

Thanks for sharing the results @zhuqi-lucas this is really interesting!
I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams. The arrow kernel would avoid the regressions of concat.

I think the SortPreservingMergeStream is about as efficient as we know how to make it

Maybe we can look into what overhead makes concat'ing better 🤔 Any per-stream overhead we can improve in SortPreservingMergeStream would likely flow directly to any query that does sorts

Hm 🤔 ... but that will still take a separate step of sorting the input bathes, which next to sorting involves a full extra copy using take (slower than concat) followed by merging the batches. Also the built-in sort on the entire output is likely to be much faster than doing a merge on the outputs.

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

zhuqi-lucas · 2025-04-15T05:49:16Z

It seems when we merge the sorted batch, we already using the interleave to merge the sorted indices, here is the code:

    /// Drains the in_progress row indexes, and builds a new RecordBatch from them
    ///
    /// Will then drop any batches for which all rows have been yielded to the output
    ///
    /// Returns `None` if no pending rows
    pub fn build_record_batch(&mut self) -> Result<Option<RecordBatch>> {
        if self.is_empty() {
            return Ok(None);
        }

        let columns = (0..self.schema.fields.len())
            .map(|column_idx| {
                let arrays: Vec<_> = self
                    .batches
                    .iter()
                    .map(|(_, batch)| batch.column(column_idx).as_ref())
                    .collect();
                Ok(interleave(&arrays, &self.indices)?)
            })
            .collect::<Result<Vec<_>>>()?;

        self.indices.clear();

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

zhuqi-lucas · 2025-04-15T05:51:05Z

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1 Results will be posted here when complete

Thanks @alamb for this triggering, it seems stuck.

Dandandan · 2025-04-15T07:09:10Z

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

I mean theoretically we don't have to merge anything as all the batches are in memory.

The merging is useful for sorting streams of data, but I think it is expected the process of sorting batches first followed by a custom merge implementation is slower than a single sorting pass based on rust std unstable sort (which is optimized for doing a minimal amount of comparisons quickly).

Dandandan · 2025-04-15T07:12:51Z

A more complete rationale / explanation of the same idea was written here by @2010YOUY01 #15375 (comment)

An alternative to try to avoid copies is: first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), and get a permutation array.
Use the interleave kernel to construct the final result https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/fn.interleave.html

zhuqi-lucas · 2025-04-15T07:25:21Z

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

I mean theoretically we don't have to merge anything as all the batches are in memory.

The merging is useful for sorting streams of data, but I think it is expected the process of sorting batches first followed by a custom merge implementation is slower than a single sorting pass based on rust std unstable sort (which is optimized for doing a minimal amount of comparisons quickly).

I think i got it now, thank you @Dandandan, it means we already have those in memory batch, we just need to first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), we don't need to construct the StreamingMergeBuilder for in memory sort, we just need to sort it as a single sorting pass.

Let me try this way, and compare the performance!

zhuqi-lucas · 2025-04-15T08:16:21Z

Very interesting, firstly i now try merge all memory batch, and single sort, some query become crazy fast and some crazy slow, i think because:

We sort in memory without merge, it's similar to sort single partition without partition parallel ？
Previous some merge will have partition parallel?

So next step, we can try to make the in memory sort with parallel?

--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2243.52ms │               1416.52ms │ +1.58x faster │
│ Q2           │  1842.11ms │               1096.12ms │ +1.68x faster │
│ Q3           │ 12446.31ms │              12535.45ms │     no change │
│ Q4           │  4047.55ms │               1964.73ms │ +2.06x faster │
│ Q5           │  4364.46ms │               5955.70ms │  1.36x slower │
│ Q6           │  4561.01ms │               6275.39ms │  1.38x slower │
│ Q7           │  8158.01ms │              19145.68ms │  2.35x slower │
│ Q8           │  6077.40ms │               5146.80ms │ +1.18x faster │
│ Q9           │  6347.21ms │               5544.48ms │ +1.14x faster │
│ Q10          │ 11561.03ms │              23572.68ms │  2.04x slower │
│ Q11          │  6069.42ms │               4810.88ms │ +1.26x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 67718.04ms │
│ Total Time (concat_batches_for_sort)   │ 87464.44ms │
│ Average Time (main)                    │  6156.19ms │
│ Average Time (concat_batches_for_sort) │  7951.31ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          4 │
│ Queries with No Change                 │          1 │
└────────────────────────────────────────┴────────────┘

Patch tried:

diff --git a/datafusion/physical-plan/src/sorts/sort.rs b/datafusion/physical-plan/src/sorts/sort.rs
index 7fd1c2b16..ec3cd89f3 100644
--- a/datafusion/physical-plan/src/sorts/sort.rs
+++ b/datafusion/physical-plan/src/sorts/sort.rs
@@ -671,85 +671,14 @@ impl ExternalSorter {
             return self.sort_batch_stream(batch, metrics, reservation);
         }

-        // If less than sort_in_place_threshold_bytes, concatenate and sort in place
-        if self.reservation.size() < self.sort_in_place_threshold_bytes {
-            // Concatenate memory batches together and sort
-            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
-            self.in_mem_batches.clear();
-            self.reservation
-                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
-            let reservation = self.reservation.take();
-            return self.sort_batch_stream(batch, metrics, reservation);
-        }
-
-        let mut merged_batches = Vec::new();
-        let mut current_batches = Vec::new();
-        let mut current_size = 0;
-
-        // Drain in_mem_batches using pop() to release memory earlier.
-        // This avoids holding onto the entire vector during iteration.
-        // Note:
-        // Now we use `sort_in_place_threshold_bytes` to determine, in future we can make it more dynamic.
-        while let Some(batch) = self.in_mem_batches.pop() {
-            let batch_size = get_reserved_byte_for_record_batch(&batch);
-
-            // If adding this batch would exceed the memory threshold, merge current_batches.
-            if current_size + batch_size > self.sort_in_place_threshold_bytes
-                && !current_batches.is_empty()
-            {
-                // Merge accumulated batches into one.
-                let merged = concat_batches(&self.schema, &current_batches)?;
-                current_batches.clear();
-
-                // Update memory reservation.
-                self.reservation.try_shrink(current_size)?;
-                let merged_size = get_reserved_byte_for_record_batch(&merged);
-                self.reservation.try_grow(merged_size)?;
-
-                merged_batches.push(merged);
-                current_size = 0;
-            }
-
-            current_batches.push(batch);
-            current_size += batch_size;
-        }
-
-        // Merge any remaining batches after the loop.
-        if !current_batches.is_empty() {
-            let merged = concat_batches(&self.schema, &current_batches)?;
-            self.reservation.try_shrink(current_size)?;
-            let merged_size = get_reserved_byte_for_record_batch(&merged);
-            self.reservation.try_grow(merged_size)?;
-            merged_batches.push(merged);
-        }
-
-        // Create sorted streams directly without using spawn_buffered.
-        // This allows for sorting to happen inline and enables earlier batch drop.
-        let streams = merged_batches
-            .into_iter()
-            .map(|batch| {
-                let metrics = self.metrics.baseline.intermediate();
-                let reservation = self
-                    .reservation
-                    .split(get_reserved_byte_for_record_batch(&batch));
-
-                // Sort the batch inline.
-                let input = self.sort_batch_stream(batch, metrics, reservation)?;
-                Ok(input)
-            })
-            .collect::<Result<_>>()?;
-
-        let expressions: LexOrdering = self.expr.iter().cloned().collect();
-
-        StreamingMergeBuilder::new()
-            .with_streams(streams)
-            .with_schema(Arc::clone(&self.schema))
-            .with_expressions(expressions.as_ref())
-            .with_metrics(metrics)
-            .with_batch_size(self.batch_size)
-            .with_fetch(None)
-            .with_reservation(self.merge_reservation.new_empty())
-            .build()
+        // Because batches are all in memory, we can sort them in place
+        // Concatenate memory batches together and sort
+        let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
+        self.in_mem_batches.clear();
+        self.reservation
+            .try_resize(get_reserved_byte_for_record_batch(&batch))?;
+        let reservation = self.reservation.take();
+        self.sort_batch_stream(batch, metrics, reservation)
     }

Dandandan · 2025-04-15T08:47:00Z

I think concat followed by sort is slower in some cases because

Concat involves copying the entire batch (rather than only the keys to be sorted)
sort_batch_stream Can be slower as lexsort_to_indices is in cases with many columns slower than the Row Format

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

The core improvements that I think are important:

Minimizing copying of the input batches to one (only once for the output)
Sorting once on the input batches rather than sort followed by merge
A good heuristic on when to switch from lexsort_to_indices-like sorting to RowConverter + sorting.

zhuqi-lucas · 2025-04-15T09:07:42Z

I think concat followed by sort is slower in some cases because

Concat involves copying the entire batch (rather than only the keys to be sorted)

sort_batch_stream Can be slower as lexsort_to_indices is in cases with many columns slower than the Row Format

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

The core improvements that I think are important:

Minimizing copying of the input batches to one (only once for the output)

Sorting once on the input batches rather than sort followed by merge

A good heuristic on when to switch from lexsort_to_indices-like sorting to RowConverter + sorting.

Good explain.

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

I see, the execute already using partition:

fn execute(
        &self,
        partition: usize,
        context: Arc<TaskContext>,
    ) -> Result<SendableRecordBatchStream> {

2010YOUY01 · 2025-04-15T09:46:57Z

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy.
I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished.
We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Dandandan · 2025-04-15T10:06:39Z

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy. I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished. We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Yes, to be clear I don't argue to remove SortPreservingMergeExec or sorting in two fases altogether or something similar, just was reacting to the idea of adding more parallelism in in_mem_sort_stream which probably won't help much.

SortPreserveMergeExec <= Does k-way merging based on input streams, with minimal memory overhead, maximizing input parallelism
     SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory *sorting* if possible, for each input partition in parallel, only resorting to spill/merge when does not fit into memory

zhuqi-lucas · 2025-04-15T10:26:08Z

Thank you @2010YOUY01 @Dandandan , it's very interesting, i am thinking:

Since the all batch size sum is fixed, we can first calculate the compute size of each partition, call it partition_cal_size.
Then we setting a min_sort_size and max_sort_size, so we will determine the final_merged_batch_size:

final_merged_batch_size = 
  if (partition_cal_size < min_sort_size) => min_sort_size
  else if (partition_cal_size > max_sort_size) => max_sort_size
  else => partition_cal_size

This prevents creating too many small batches (which can fragment merge tasks) or overly large batches.
It looks like the first version of heuristic

But how can we calculate the min_sort_size and max_sort_size?

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy. I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished. We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Yes, to be clear I don't argue to remove SortPreservingMergeExec or sorting in two fases altogether or something similar, just was reacting to the idea of adding more parallelism in in_mem_sort_stream which probably won't help much.
SortPreserveMergeExec <= Does k-way merging based on input streams, with minimal memory overhead, maximizing input parallelism
     SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory *sorting* if possible, for each input partition in parallel, only resorting to spill/merge when does not fit into memory 

alamb · 2025-04-15T16:53:55Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff
Benchmarks: clickbench_1 clickbench_partitioned sort_tpch
Results will be posted here when complete

alamb · 2025-04-15T16:54:34Z

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1 Results will be posted here when complete

Thanks @alamb for this triggering, it seems stuck.

yeah, sorry I had a bug retriggered

alamb · 2025-04-15T16:56:53Z

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

I wonder if we can skip interleave / copying entirely?

Specifically, what if we sorted to indices, as you suggested, but then instead of calling interleave (which will copy the data) before sending it to merge_streams) maybe we could have some way to have the merge cursors also take the indicies -- so we could only copy data once 🤔

alamb · 2025-04-15T17:15:44Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.56ms │                  0.56ms │     no change │
│ QQuery 1     │    80.70ms │                 78.53ms │     no change │
│ QQuery 2     │   116.89ms │                114.69ms │     no change │
│ QQuery 3     │   130.38ms │                125.49ms │     no change │
│ QQuery 4     │   861.43ms │                756.05ms │ +1.14x faster │
│ QQuery 5     │   878.02ms │                869.13ms │     no change │
│ QQuery 6     │     0.67ms │                  0.64ms │     no change │
│ QQuery 7     │   100.33ms │                 93.51ms │ +1.07x faster │
│ QQuery 8     │   979.95ms │                956.50ms │     no change │
│ QQuery 9     │  1283.74ms │               1245.12ms │     no change │
│ QQuery 10    │   304.60ms │                306.39ms │     no change │
│ QQuery 11    │   342.54ms │                340.89ms │     no change │
│ QQuery 12    │   930.04ms │                933.37ms │     no change │
│ QQuery 13    │  1337.30ms │               1341.28ms │     no change │
│ QQuery 14    │   869.61ms │                883.04ms │     no change │
│ QQuery 15    │  1088.81ms │               1083.02ms │     no change │
│ QQuery 16    │  1841.74ms │               1788.14ms │     no change │
│ QQuery 17    │  1680.12ms │               1638.39ms │     no change │
│ QQuery 18    │  3128.65ms │               3139.26ms │     no change │
│ QQuery 19    │   127.46ms │                120.42ms │ +1.06x faster │
│ QQuery 20    │  1169.35ms │               1195.58ms │     no change │
│ QQuery 21    │  1472.42ms │               1457.18ms │     no change │
│ QQuery 22    │  2595.51ms │               2696.08ms │     no change │
│ QQuery 23    │  8475.08ms │               8735.96ms │     no change │
│ QQuery 24    │   510.59ms │                515.80ms │     no change │
│ QQuery 25    │   441.39ms │                439.44ms │     no change │
│ QQuery 26    │   569.36ms │                581.32ms │     no change │
│ QQuery 27    │  1850.31ms │               1844.63ms │     no change │
│ QQuery 28    │ 13503.59ms │              13185.12ms │     no change │
│ QQuery 29    │   587.04ms │                548.23ms │ +1.07x faster │
│ QQuery 30    │   872.06ms │                861.85ms │     no change │
│ QQuery 31    │   924.05ms │                992.86ms │  1.07x slower │
│ QQuery 32    │  2763.71ms │               2715.48ms │     no change │
│ QQuery 33    │  3455.95ms │               3450.90ms │     no change │
│ QQuery 34    │  3466.53ms │               3478.02ms │     no change │
│ QQuery 35    │  1342.93ms │               1336.22ms │     no change │
│ QQuery 36    │   179.89ms │                185.83ms │     no change │
│ QQuery 37    │   106.98ms │                102.42ms │     no change │
│ QQuery 38    │   169.57ms │                181.57ms │  1.07x slower │
│ QQuery 39    │   263.27ms │                261.33ms │     no change │
│ QQuery 40    │    90.39ms │                 86.94ms │     no change │
│ QQuery 41    │    85.04ms │                 83.29ms │     no change │
│ QQuery 42    │    78.33ms │                 78.86ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 61056.89ms │
│ Total Time (concat_batches_for_sort)   │ 60829.33ms │
│ Average Time (HEAD)                    │  1419.93ms │
│ Average Time (concat_batches_for_sort) │  1414.64ms │
│ Queries Faster                         │          4 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.39ms │                  2.57ms │ 1.07x slower │
│ QQuery 1     │    36.11ms │                 36.43ms │    no change │
│ QQuery 2     │    91.64ms │                 90.30ms │    no change │
│ QQuery 3     │    97.91ms │                 96.78ms │    no change │
│ QQuery 4     │   732.04ms │                776.35ms │ 1.06x slower │
│ QQuery 5     │   833.38ms │                872.17ms │    no change │
│ QQuery 6     │     2.10ms │                  2.26ms │ 1.07x slower │
│ QQuery 7     │    41.00ms │                 39.72ms │    no change │
│ QQuery 8     │   942.92ms │                946.92ms │    no change │
│ QQuery 9     │  1216.74ms │               1198.20ms │    no change │
│ QQuery 10    │   268.97ms │                280.66ms │    no change │
│ QQuery 11    │   302.80ms │                311.61ms │    no change │
│ QQuery 12    │   908.31ms │                943.03ms │    no change │
│ QQuery 13    │  1238.75ms │               1400.87ms │ 1.13x slower │
│ QQuery 14    │   862.33ms │                880.29ms │    no change │
│ QQuery 15    │  1072.63ms │               1053.58ms │    no change │
│ QQuery 16    │  1755.68ms │               1765.66ms │    no change │
│ QQuery 17    │  1653.94ms │               1624.45ms │    no change │
│ QQuery 18    │  3109.98ms │               3131.55ms │    no change │
│ QQuery 19    │    86.54ms │                 85.77ms │    no change │
│ QQuery 20    │  1148.95ms │               1170.51ms │    no change │
│ QQuery 21    │  1333.80ms │               1392.35ms │    no change │
│ QQuery 22    │  2373.02ms │               2456.49ms │    no change │
│ QQuery 23    │  8411.11ms │               8608.23ms │    no change │
│ QQuery 24    │   469.47ms │                488.99ms │    no change │
│ QQuery 25    │   399.27ms │                411.12ms │    no change │
│ QQuery 26    │   537.55ms │                546.69ms │    no change │
│ QQuery 27    │  1685.85ms │               1739.53ms │    no change │
│ QQuery 28    │ 12957.91ms │              12866.18ms │    no change │
│ QQuery 29    │   546.32ms │                542.10ms │    no change │
│ QQuery 30    │   846.79ms │                852.23ms │    no change │
│ QQuery 31    │   887.81ms │                891.51ms │    no change │
│ QQuery 32    │  2723.16ms │               2728.31ms │    no change │
│ QQuery 33    │  3360.97ms │               3394.86ms │    no change │
│ QQuery 34    │  3409.45ms │               3368.33ms │    no change │
│ QQuery 35    │  1284.03ms │               1297.77ms │    no change │
│ QQuery 36    │   127.54ms │                133.40ms │    no change │
│ QQuery 37    │    56.79ms │                 57.49ms │    no change │
│ QQuery 38    │   129.27ms │                127.87ms │    no change │
│ QQuery 39    │   211.07ms │                209.72ms │    no change │
│ QQuery 40    │    49.05ms │                 51.83ms │ 1.06x slower │
│ QQuery 41    │    47.29ms │                 46.37ms │    no change │
│ QQuery 42    │    39.37ms │                 39.91ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58292.02ms │
│ Total Time (concat_batches_for_sort)   │ 58960.95ms │
│ Average Time (HEAD)                    │  1355.63ms │
│ Average Time (concat_batches_for_sort) │  1371.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          5 │
│ Queries with No Change                 │         38 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.52ms │                326.59ms │ +1.15x faster │
│ Q2           │  307.50ms │                283.70ms │ +1.08x faster │
│ Q3           │ 1217.16ms │               1216.87ms │     no change │
│ Q4           │  430.52ms │                476.70ms │  1.11x slower │
│ Q5           │  433.11ms │                499.16ms │  1.15x slower │
│ Q6           │  467.50ms │                521.98ms │  1.12x slower │
│ Q7           │  960.82ms │               1016.45ms │  1.06x slower │
│ Q8           │  793.48ms │                842.76ms │  1.06x slower │
│ Q9           │  833.53ms │                862.74ms │     no change │
│ Q10          │ 1275.22ms │               1279.19ms │     no change │
│ Q11          │  771.89ms │                762.28ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7867.25ms │
│ Total Time (concat_batches_for_sort)   │ 8088.43ms │
│ Average Time (HEAD)                    │  715.20ms │
│ Average Time (concat_batches_for_sort) │  735.31ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         5 │
│ Queries with No Change                 │         4 │
└────────────────────────────────────────┴───────────┘

zhuqi-lucas · 2025-04-16T03:11:54Z

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

I wonder if we can skip interleave / copying entirely?

Specifically, what if we sorted to indices, as you suggested, but then instead of calling interleave (which will copy the data) before sending it to merge_streams) maybe we could have some way to have the merge cursors also take the indicies -- so we could only copy data once 🤔

Thanks @alamb , it looks promising.

zhuqi-lucas · 2025-04-16T03:15:12Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.56ms │                  0.56ms │     no change │
│ QQuery 1     │    80.70ms │                 78.53ms │     no change │
│ QQuery 2     │   116.89ms │                114.69ms │     no change │
│ QQuery 3     │   130.38ms │                125.49ms │     no change │
│ QQuery 4     │   861.43ms │                756.05ms │ +1.14x faster │
│ QQuery 5     │   878.02ms │                869.13ms │     no change │
│ QQuery 6     │     0.67ms │                  0.64ms │     no change │
│ QQuery 7     │   100.33ms │                 93.51ms │ +1.07x faster │
│ QQuery 8     │   979.95ms │                956.50ms │     no change │
│ QQuery 9     │  1283.74ms │               1245.12ms │     no change │
│ QQuery 10    │   304.60ms │                306.39ms │     no change │
│ QQuery 11    │   342.54ms │                340.89ms │     no change │
│ QQuery 12    │   930.04ms │                933.37ms │     no change │
│ QQuery 13    │  1337.30ms │               1341.28ms │     no change │
│ QQuery 14    │   869.61ms │                883.04ms │     no change │
│ QQuery 15    │  1088.81ms │               1083.02ms │     no change │
│ QQuery 16    │  1841.74ms │               1788.14ms │     no change │
│ QQuery 17    │  1680.12ms │               1638.39ms │     no change │
│ QQuery 18    │  3128.65ms │               3139.26ms │     no change │
│ QQuery 19    │   127.46ms │                120.42ms │ +1.06x faster │
│ QQuery 20    │  1169.35ms │               1195.58ms │     no change │
│ QQuery 21    │  1472.42ms │               1457.18ms │     no change │
│ QQuery 22    │  2595.51ms │               2696.08ms │     no change │
│ QQuery 23    │  8475.08ms │               8735.96ms │     no change │
│ QQuery 24    │   510.59ms │                515.80ms │     no change │
│ QQuery 25    │   441.39ms │                439.44ms │     no change │
│ QQuery 26    │   569.36ms │                581.32ms │     no change │
│ QQuery 27    │  1850.31ms │               1844.63ms │     no change │
│ QQuery 28    │ 13503.59ms │              13185.12ms │     no change │
│ QQuery 29    │   587.04ms │                548.23ms │ +1.07x faster │
│ QQuery 30    │   872.06ms │                861.85ms │     no change │
│ QQuery 31    │   924.05ms │                992.86ms │  1.07x slower │
│ QQuery 32    │  2763.71ms │               2715.48ms │     no change │
│ QQuery 33    │  3455.95ms │               3450.90ms │     no change │
│ QQuery 34    │  3466.53ms │               3478.02ms │     no change │
│ QQuery 35    │  1342.93ms │               1336.22ms │     no change │
│ QQuery 36    │   179.89ms │                185.83ms │     no change │
│ QQuery 37    │   106.98ms │                102.42ms │     no change │
│ QQuery 38    │   169.57ms │                181.57ms │  1.07x slower │
│ QQuery 39    │   263.27ms │                261.33ms │     no change │
│ QQuery 40    │    90.39ms │                 86.94ms │     no change │
│ QQuery 41    │    85.04ms │                 83.29ms │     no change │
│ QQuery 42    │    78.33ms │                 78.86ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 61056.89ms │
│ Total Time (concat_batches_for_sort)   │ 60829.33ms │
│ Average Time (HEAD)                    │  1419.93ms │
│ Average Time (concat_batches_for_sort) │  1414.64ms │
│ Queries Faster                         │          4 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.39ms │                  2.57ms │ 1.07x slower │
│ QQuery 1     │    36.11ms │                 36.43ms │    no change │
│ QQuery 2     │    91.64ms │                 90.30ms │    no change │
│ QQuery 3     │    97.91ms │                 96.78ms │    no change │
│ QQuery 4     │   732.04ms │                776.35ms │ 1.06x slower │
│ QQuery 5     │   833.38ms │                872.17ms │    no change │
│ QQuery 6     │     2.10ms │                  2.26ms │ 1.07x slower │
│ QQuery 7     │    41.00ms │                 39.72ms │    no change │
│ QQuery 8     │   942.92ms │                946.92ms │    no change │
│ QQuery 9     │  1216.74ms │               1198.20ms │    no change │
│ QQuery 10    │   268.97ms │                280.66ms │    no change │
│ QQuery 11    │   302.80ms │                311.61ms │    no change │
│ QQuery 12    │   908.31ms │                943.03ms │    no change │
│ QQuery 13    │  1238.75ms │               1400.87ms │ 1.13x slower │
│ QQuery 14    │   862.33ms │                880.29ms │    no change │
│ QQuery 15    │  1072.63ms │               1053.58ms │    no change │
│ QQuery 16    │  1755.68ms │               1765.66ms │    no change │
│ QQuery 17    │  1653.94ms │               1624.45ms │    no change │
│ QQuery 18    │  3109.98ms │               3131.55ms │    no change │
│ QQuery 19    │    86.54ms │                 85.77ms │    no change │
│ QQuery 20    │  1148.95ms │               1170.51ms │    no change │
│ QQuery 21    │  1333.80ms │               1392.35ms │    no change │
│ QQuery 22    │  2373.02ms │               2456.49ms │    no change │
│ QQuery 23    │  8411.11ms │               8608.23ms │    no change │
│ QQuery 24    │   469.47ms │                488.99ms │    no change │
│ QQuery 25    │   399.27ms │                411.12ms │    no change │
│ QQuery 26    │   537.55ms │                546.69ms │    no change │
│ QQuery 27    │  1685.85ms │               1739.53ms │    no change │
│ QQuery 28    │ 12957.91ms │              12866.18ms │    no change │
│ QQuery 29    │   546.32ms │                542.10ms │    no change │
│ QQuery 30    │   846.79ms │                852.23ms │    no change │
│ QQuery 31    │   887.81ms │                891.51ms │    no change │
│ QQuery 32    │  2723.16ms │               2728.31ms │    no change │
│ QQuery 33    │  3360.97ms │               3394.86ms │    no change │
│ QQuery 34    │  3409.45ms │               3368.33ms │    no change │
│ QQuery 35    │  1284.03ms │               1297.77ms │    no change │
│ QQuery 36    │   127.54ms │                133.40ms │    no change │
│ QQuery 37    │    56.79ms │                 57.49ms │    no change │
│ QQuery 38    │   129.27ms │                127.87ms │    no change │
│ QQuery 39    │   211.07ms │                209.72ms │    no change │
│ QQuery 40    │    49.05ms │                 51.83ms │ 1.06x slower │
│ QQuery 41    │    47.29ms │                 46.37ms │    no change │
│ QQuery 42    │    39.37ms │                 39.91ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58292.02ms │
│ Total Time (concat_batches_for_sort)   │ 58960.95ms │
│ Average Time (HEAD)                    │  1355.63ms │
│ Average Time (concat_batches_for_sort) │  1371.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          5 │
│ Queries with No Change                 │         38 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.52ms │                326.59ms │ +1.15x faster │
│ Q2           │  307.50ms │                283.70ms │ +1.08x faster │
│ Q3           │ 1217.16ms │               1216.87ms │     no change │
│ Q4           │  430.52ms │                476.70ms │  1.11x slower │
│ Q5           │  433.11ms │                499.16ms │  1.15x slower │
│ Q6           │  467.50ms │                521.98ms │  1.12x slower │
│ Q7           │  960.82ms │               1016.45ms │  1.06x slower │
│ Q8           │  793.48ms │                842.76ms │  1.06x slower │
│ Q9           │  833.53ms │                862.74ms │     no change │
│ Q10          │ 1275.22ms │               1279.19ms │     no change │
│ Q11          │  771.89ms │                762.28ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7867.25ms │
│ Total Time (concat_batches_for_sort)   │ 8088.43ms │
│ Average Time (HEAD)                    │  715.20ms │
│ Average Time (concat_batches_for_sort) │  735.31ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         5 │
│ Queries with No Change                 │         4 │
└────────────────────────────────────────┴───────────┘

No performance improvement for benchmark, i believe mostly the benchmark batch size > sort_in_place size, it will not gain from this PR. Sort-tpch 10 should gain performance not in this benchmark list.

Dandandan · 2025-05-20T15:14:19Z

Looking at the earlier result

│ Q1 │ 333.47ms │ 375.58ms │ 1.13x slower │

This is the query

        SELECT l_linenumber, l_partkey
        FROM lineitem
        ORDER BY l_linenumber

So it might actually be the case that the changed code is a bit slower for this case. In the query there is only little data to copy (so concat batches -> concat sort keys doesn't help that much) while maybe the overhead of using interleave_batches is higher 🤔

alamb · 2025-05-20T15:49:16Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1865.36ms │               2095.69ms │ 1.12x slower │
│ QQuery 1     │   701.32ms │                710.76ms │    no change │
│ QQuery 2     │  1453.42ms │               1442.06ms │    no change │
│ QQuery 3     │   700.26ms │                711.29ms │    no change │
│ QQuery 4     │  1437.90ms │               1482.90ms │    no change │
│ QQuery 5     │ 15296.81ms │              15446.47ms │    no change │
│ QQuery 6     │  2051.26ms │               2069.51ms │    no change │
│ QQuery 7     │  2136.98ms │               2120.09ms │    no change │
│ QQuery 8     │  2079.21ms │               2103.34ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 27722.52ms │
│ Total Time (concat_batches_for_sort)   │ 28182.11ms │
│ Average Time (HEAD)                    │  3080.28ms │
│ Average Time (concat_batches_for_sort) │  3131.35ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          8 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.27ms │                  2.34ms │     no change │
│ QQuery 1     │    33.89ms │                 35.06ms │     no change │
│ QQuery 2     │    80.84ms │                 81.71ms │     no change │
│ QQuery 3     │    96.89ms │                 98.27ms │     no change │
│ QQuery 4     │   587.18ms │                592.23ms │     no change │
│ QQuery 5     │   871.05ms │                846.05ms │     no change │
│ QQuery 6     │     2.29ms │                  2.29ms │     no change │
│ QQuery 7     │    40.92ms │                 39.49ms │     no change │
│ QQuery 8     │   916.50ms │                921.84ms │     no change │
│ QQuery 9     │  1235.05ms │               1205.14ms │     no change │
│ QQuery 10    │   270.45ms │                266.99ms │     no change │
│ QQuery 11    │   308.22ms │                308.32ms │     no change │
│ QQuery 12    │   906.16ms │                911.28ms │     no change │
│ QQuery 13    │  1190.03ms │               1364.47ms │  1.15x slower │
│ QQuery 14    │   861.45ms │                837.70ms │     no change │
│ QQuery 15    │   840.69ms │                824.48ms │     no change │
│ QQuery 16    │  1738.36ms │               1717.85ms │     no change │
│ QQuery 17    │  1605.80ms │               1603.35ms │     no change │
│ QQuery 18    │  3088.42ms │               3074.28ms │     no change │
│ QQuery 19    │    83.36ms │                 88.70ms │  1.06x slower │
│ QQuery 20    │  1133.60ms │               1134.33ms │     no change │
│ QQuery 21    │  1320.19ms │               1312.32ms │     no change │
│ QQuery 22    │  2224.06ms │               2200.69ms │     no change │
│ QQuery 23    │  8080.77ms │               8128.07ms │     no change │
│ QQuery 24    │   477.97ms │                478.34ms │     no change │
│ QQuery 25    │   390.14ms │                391.28ms │     no change │
│ QQuery 26    │   552.37ms │                539.99ms │     no change │
│ QQuery 27    │  1591.16ms │               1564.84ms │     no change │
│ QQuery 28    │ 13161.07ms │              12339.15ms │ +1.07x faster │
│ QQuery 29    │   547.45ms │                538.58ms │     no change │
│ QQuery 30    │   813.88ms │                818.90ms │     no change │
│ QQuery 31    │   856.38ms │                871.60ms │     no change │
│ QQuery 32    │  2624.66ms │               2669.71ms │     no change │
│ QQuery 33    │  3320.68ms │               3363.16ms │     no change │
│ QQuery 34    │  3382.83ms │               3434.36ms │     no change │
│ QQuery 35    │  1312.99ms │               1276.70ms │     no change │
│ QQuery 36    │   123.92ms │                126.90ms │     no change │
│ QQuery 37    │    58.15ms │                 55.63ms │     no change │
│ QQuery 38    │   123.37ms │                123.18ms │     no change │
│ QQuery 39    │   201.66ms │                196.83ms │     no change │
│ QQuery 40    │    49.85ms │                 48.28ms │     no change │
│ QQuery 41    │    46.54ms │                 44.78ms │     no change │
│ QQuery 42    │    40.40ms │                 38.56ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 57193.89ms │
│ Total Time (concat_batches_for_sort)   │ 56518.02ms │
│ Average Time (HEAD)                    │  1330.09ms │
│ Average Time (concat_batches_for_sort) │  1314.37ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         40 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 124.69ms │                123.12ms │     no change │
│ QQuery 2     │  24.66ms │                 23.74ms │     no change │
│ QQuery 3     │  35.50ms │                 36.17ms │     no change │
│ QQuery 4     │  20.90ms │                 21.45ms │     no change │
│ QQuery 5     │  56.48ms │                 55.99ms │     no change │
│ QQuery 6     │  12.21ms │                 12.46ms │     no change │
│ QQuery 7     │ 110.00ms │                103.26ms │ +1.07x faster │
│ QQuery 8     │  25.99ms │                 27.11ms │     no change │
│ QQuery 9     │  63.52ms │                 63.21ms │     no change │
│ QQuery 10    │  60.10ms │                 58.58ms │     no change │
│ QQuery 11    │  13.17ms │                 13.12ms │     no change │
│ QQuery 12    │  47.29ms │                 43.23ms │ +1.09x faster │
│ QQuery 13    │  30.98ms │                 30.00ms │     no change │
│ QQuery 14    │  10.32ms │                 10.57ms │     no change │
│ QQuery 15    │  26.03ms │                 25.47ms │     no change │
│ QQuery 16    │  23.95ms │                 23.14ms │     no change │
│ QQuery 17    │ 103.56ms │                103.39ms │     no change │
│ QQuery 18    │ 241.08ms │                240.55ms │     no change │
│ QQuery 19    │  27.15ms │                 28.33ms │     no change │
│ QQuery 20    │  39.67ms │                 40.16ms │     no change │
│ QQuery 21    │ 173.56ms │                174.10ms │     no change │
│ QQuery 22    │  18.03ms │                 18.25ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1288.83ms │
│ Total Time (concat_batches_for_sort)   │ 1275.42ms │
│ Average Time (HEAD)                    │   58.58ms │
│ Average Time (concat_batches_for_sort) │   57.97ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        20 │
└────────────────────────────────────────┴───────────┘

alamb · 2025-05-20T15:49:18Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (485499f) to 3e30f77 diff
Benchmarks: sort_tpch
Results will be posted here when complete

alamb · 2025-05-20T15:50:40Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  346.79ms │                372.60ms │  1.07x slower │
│ Q2           │  309.33ms │                273.93ms │ +1.13x faster │
│ Q3           │ 1213.14ms │               1202.90ms │     no change │
│ Q4           │  431.58ms │                432.34ms │     no change │
│ Q5           │  427.71ms │                444.55ms │     no change │
│ Q6           │  478.91ms │                456.74ms │     no change │
│ Q7           │  942.76ms │                921.41ms │     no change │
│ Q8           │  785.64ms │                792.43ms │     no change │
│ Q9           │  826.21ms │                831.16ms │     no change │
│ Q10          │ 1235.28ms │               1216.72ms │     no change │
│ Q11          │  737.28ms │                775.00ms │  1.05x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7734.64ms │
│ Total Time (concat_batches_for_sort)   │ 7719.79ms │
│ Average Time (HEAD)                    │  703.15ms │
│ Average Time (concat_batches_for_sort) │  701.80ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │         8 │
└────────────────────────────────────────┴───────────┘

zhuqi-lucas · 2025-05-21T07:03:33Z

Looking at the earlier result

│ Q1 │ 333.47ms │ 375.58ms │ 1.13x slower │

This is the query
        SELECT l_linenumber, l_partkey
        FROM lineitem
        ORDER BY l_linenumber
So it might actually be the case that the changed code is a bit slower for this case. In the query there is only little data to copy (so concat batches -> concat sort keys doesn't help that much) while maybe the overhead of using interleave_batches is higher 🤔

Yeah @Dandandan , even the latest result, it still has regression about Q1.

zhuqi-lucas · 2025-06-01T08:07:34Z

So it might actually be the case that the changed code is a bit slower for this case. In the query there is only little data to copy (so concat batches -> concat sort keys doesn't help that much) while maybe the overhead of using interleave_batches is higher 🤔

I am wandering if we can have a adaptive solution, make it's always optimized for the memory sort.

…sort

alamb · 2025-06-19T15:18:49Z

I'll rerun the benchmarks and see if we can see something different.

alamb · 2025-06-19T15:19:25Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (ae88893) to e6df27c diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-06-19T15:51:38Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1934.65 ms │              1946.36 ms │ no change │
│ QQuery 1     │   707.66 ms │               709.44 ms │ no change │
│ QQuery 2     │  1373.01 ms │              1353.06 ms │ no change │
│ QQuery 3     │   669.98 ms │               679.36 ms │ no change │
│ QQuery 4     │  1362.20 ms │              1409.75 ms │ no change │
│ QQuery 5     │ 15239.36 ms │             15216.19 ms │ no change │
│ QQuery 6     │  2039.94 ms │              2072.63 ms │ no change │
│ QQuery 7     │  1864.59 ms │              1904.60 ms │ no change │
│ QQuery 8     │   821.75 ms │               821.43 ms │ no change │
└──────────────┴─────────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 26013.13ms │
│ Total Time (concat_batches_for_sort)   │ 26112.81ms │
│ Average Time (HEAD)                    │  2890.35ms │
│ Average Time (concat_batches_for_sort) │  2901.42ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │          9 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.09 ms │                14.97 ms │     no change │
│ QQuery 1     │    32.99 ms │                33.77 ms │     no change │
│ QQuery 2     │    82.39 ms │                81.42 ms │     no change │
│ QQuery 3     │    98.22 ms │                99.00 ms │     no change │
│ QQuery 4     │   664.93 ms │               588.03 ms │ +1.13x faster │
│ QQuery 5     │   889.54 ms │               852.78 ms │     no change │
│ QQuery 6     │    23.86 ms │                24.59 ms │     no change │
│ QQuery 7     │    38.37 ms │                36.93 ms │     no change │
│ QQuery 8     │   882.82 ms │               851.79 ms │     no change │
│ QQuery 9     │  1214.55 ms │              1149.55 ms │ +1.06x faster │
│ QQuery 10    │   254.96 ms │               250.89 ms │     no change │
│ QQuery 11    │   284.67 ms │               290.41 ms │     no change │
│ QQuery 12    │   862.64 ms │               880.92 ms │     no change │
│ QQuery 13    │  1271.56 ms │              1214.31 ms │     no change │
│ QQuery 14    │   807.47 ms │               800.81 ms │     no change │
│ QQuery 15    │   770.12 ms │               760.78 ms │     no change │
│ QQuery 16    │  1619.27 ms │              1616.24 ms │     no change │
│ QQuery 17    │  1615.25 ms │              1619.69 ms │     no change │
│ QQuery 18    │  2876.77 ms │              2875.57 ms │     no change │
│ QQuery 19    │    83.93 ms │                86.33 ms │     no change │
│ QQuery 20    │  1150.17 ms │              1170.58 ms │     no change │
│ QQuery 21    │  1299.19 ms │              1296.39 ms │     no change │
│ QQuery 22    │  2143.30 ms │              2174.35 ms │     no change │
│ QQuery 23    │  7403.55 ms │              7529.36 ms │     no change │
│ QQuery 24    │   441.66 ms │               445.08 ms │     no change │
│ QQuery 25    │   303.00 ms │               310.22 ms │     no change │
│ QQuery 26    │   435.06 ms │               448.35 ms │     no change │
│ QQuery 27    │  1523.92 ms │              1550.28 ms │     no change │
│ QQuery 28    │ 11722.53 ms │             11864.55 ms │     no change │
│ QQuery 29    │   532.72 ms │               534.76 ms │     no change │
│ QQuery 30    │   768.35 ms │               773.97 ms │     no change │
│ QQuery 31    │   791.70 ms │               816.20 ms │     no change │
│ QQuery 32    │  2484.58 ms │              2422.51 ms │     no change │
│ QQuery 33    │  3173.47 ms │              3178.03 ms │     no change │
│ QQuery 34    │  3162.30 ms │              3187.03 ms │     no change │
│ QQuery 35    │  1246.51 ms │              1258.48 ms │     no change │
│ QQuery 36    │   124.92 ms │               124.37 ms │     no change │
│ QQuery 37    │    58.76 ms │                57.60 ms │     no change │
│ QQuery 38    │   127.48 ms │               126.65 ms │     no change │
│ QQuery 39    │   199.04 ms │               204.35 ms │     no change │
│ QQuery 40    │    47.27 ms │                47.81 ms │     no change │
│ QQuery 41    │    44.42 ms │                45.85 ms │     no change │
│ QQuery 42    │    38.85 ms │                40.42 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 53612.17ms │
│ Total Time (concat_batches_for_sort)   │ 53736.00ms │
│ Average Time (HEAD)                    │  1246.79ms │
│ Average Time (concat_batches_for_sort) │  1249.67ms │
│ Queries Faster                         │          2 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │         41 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │  99.28 ms │                99.05 ms │ no change │
│ QQuery 2     │  20.41 ms │                21.25 ms │ no change │
│ QQuery 3     │  32.10 ms │                33.15 ms │ no change │
│ QQuery 4     │  18.75 ms │                18.72 ms │ no change │
│ QQuery 5     │  50.28 ms │                49.10 ms │ no change │
│ QQuery 6     │  11.94 ms │                11.94 ms │ no change │
│ QQuery 7     │  87.10 ms │                88.28 ms │ no change │
│ QQuery 8     │  24.79 ms │                24.81 ms │ no change │
│ QQuery 9     │  54.25 ms │                55.35 ms │ no change │
│ QQuery 10    │  43.31 ms │                42.81 ms │ no change │
│ QQuery 11    │  11.41 ms │                11.57 ms │ no change │
│ QQuery 12    │  34.70 ms │                34.68 ms │ no change │
│ QQuery 13    │  26.62 ms │                26.04 ms │ no change │
│ QQuery 14    │   9.98 ms │                10.02 ms │ no change │
│ QQuery 15    │  19.30 ms │                19.68 ms │ no change │
│ QQuery 16    │  18.73 ms │                19.20 ms │ no change │
│ QQuery 17    │  94.29 ms │                96.46 ms │ no change │
│ QQuery 18    │ 201.99 ms │               199.62 ms │ no change │
│ QQuery 19    │  25.18 ms │                25.23 ms │ no change │
│ QQuery 20    │  31.56 ms │                31.48 ms │ no change │
│ QQuery 21    │ 147.57 ms │               149.92 ms │ no change │
│ QQuery 22    │  15.53 ms │                15.07 ms │ no change │
└──────────────┴───────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1079.08ms │
│ Total Time (concat_batches_for_sort)   │ 1083.41ms │
│ Average Time (HEAD)                    │   49.05ms │
│ Average Time (concat_batches_for_sort) │   49.25ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        22 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

alamb · 2025-06-19T15:51:40Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (ae88893) to e6df27c diff
Benchmarks: sort_tpch
Results will be posted here when complete

alamb · 2025-06-19T15:52:59Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Q1           │  334.77 ms │               334.65 ms │ no change │
│ Q2           │  306.13 ms │               292.37 ms │ no change │
│ Q3           │ 1202.95 ms │              1171.68 ms │ no change │
│ Q4           │  418.47 ms │               421.97 ms │ no change │
│ Q5           │  418.14 ms │               417.66 ms │ no change │
│ Q6           │  454.33 ms │               441.03 ms │ no change │
│ Q7           │  900.60 ms │               898.49 ms │ no change │
│ Q8           │  774.41 ms │               769.11 ms │ no change │
│ Q9           │  820.96 ms │               811.73 ms │ no change │
│ Q10          │ 1204.89 ms │              1200.38 ms │ no change │
│ Q11          │  731.77 ms │               742.05 ms │ no change │
└──────────────┴────────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7567.43ms │
│ Total Time (concat_batches_for_sort)   │ 7501.12ms │
│ Average Time (HEAD)                    │  687.95ms │
│ Average Time (concat_batches_for_sort) │  681.92ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        11 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

zhuqi-lucas · 2025-06-20T03:06:54Z

Thank you @alamb , the result no regression now, but also no obvious performance improvement. Let me try to increase the memory internal sort size to see the result.

…sort

zhuqi-lucas · 2025-06-20T14:40:22Z

I do the experiment change for the mem sort size in latest PR, may be we can trigger a new clickbench benchmark to see the result, @alamb thanks!

alamb · 2025-06-20T20:17:29Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (b6bc0bb) to a0eaf51 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-06-20T20:57:44Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1924.72 ms │              1964.31 ms │    no change │
│ QQuery 1     │   712.36 ms │               777.36 ms │ 1.09x slower │
│ QQuery 2     │  1363.01 ms │              1427.46 ms │    no change │
│ QQuery 3     │   685.89 ms │               656.36 ms │    no change │
│ QQuery 4     │  1363.91 ms │              1385.15 ms │    no change │
│ QQuery 5     │ 15140.65 ms │             15403.65 ms │    no change │
│ QQuery 6     │  2050.60 ms │              2042.47 ms │    no change │
│ QQuery 7     │  2022.51 ms │              2095.08 ms │    no change │
│ QQuery 8     │   827.72 ms │               821.06 ms │    no change │
└──────────────┴─────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 26091.37ms │
│ Total Time (concat_batches_for_sort)   │ 26572.90ms │
│ Average Time (HEAD)                    │  2899.04ms │
│ Average Time (concat_batches_for_sort) │  2952.54ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          8 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.21 ms │                 2.26 ms │     no change │
│ QQuery 1     │    33.41 ms │                34.14 ms │     no change │
│ QQuery 2     │    82.74 ms │                81.51 ms │     no change │
│ QQuery 3     │    97.16 ms │                98.32 ms │     no change │
│ QQuery 4     │   578.27 ms │               584.29 ms │     no change │
│ QQuery 5     │   852.71 ms │               865.58 ms │     no change │
│ QQuery 6     │     2.21 ms │                 2.29 ms │     no change │
│ QQuery 7     │    38.01 ms │                39.52 ms │     no change │
│ QQuery 8     │   848.81 ms │               857.45 ms │     no change │
│ QQuery 9     │  1140.39 ms │              1157.19 ms │     no change │
│ QQuery 10    │   256.05 ms │               257.82 ms │     no change │
│ QQuery 11    │   282.42 ms │               293.77 ms │     no change │
│ QQuery 12    │   868.71 ms │               857.33 ms │     no change │
│ QQuery 13    │  1243.76 ms │              1269.42 ms │     no change │
│ QQuery 14    │   796.55 ms │               813.71 ms │     no change │
│ QQuery 15    │   768.80 ms │               796.38 ms │     no change │
│ QQuery 16    │  1609.37 ms │              1632.41 ms │     no change │
│ QQuery 17    │  1592.55 ms │              1607.83 ms │     no change │
│ QQuery 18    │  2900.11 ms │              2933.52 ms │     no change │
│ QQuery 19    │    84.56 ms │                88.38 ms │     no change │
│ QQuery 20    │  1129.96 ms │              1117.45 ms │     no change │
│ QQuery 21    │  1277.09 ms │              1277.51 ms │     no change │
│ QQuery 22    │  2135.68 ms │              2110.81 ms │     no change │
│ QQuery 23    │  7361.27 ms │              7494.75 ms │     no change │
│ QQuery 24    │   455.36 ms │               451.82 ms │     no change │
│ QQuery 25    │   379.68 ms │               398.65 ms │     no change │
│ QQuery 26    │   520.18 ms │               530.36 ms │     no change │
│ QQuery 27    │  1564.37 ms │              1566.59 ms │     no change │
│ QQuery 28    │ 11931.35 ms │             12629.15 ms │  1.06x slower │
│ QQuery 29    │   530.79 ms │               529.35 ms │     no change │
│ QQuery 30    │   780.90 ms │               776.44 ms │     no change │
│ QQuery 31    │   817.88 ms │               806.49 ms │     no change │
│ QQuery 32    │  2491.17 ms │              2543.27 ms │     no change │
│ QQuery 33    │  3170.59 ms │              3172.60 ms │     no change │
│ QQuery 34    │  3222.82 ms │              3239.30 ms │     no change │
│ QQuery 35    │  1268.17 ms │              1215.30 ms │     no change │
│ QQuery 36    │   120.85 ms │               120.47 ms │     no change │
│ QQuery 37    │    52.27 ms │                51.32 ms │     no change │
│ QQuery 38    │   118.87 ms │               118.27 ms │     no change │
│ QQuery 39    │   192.09 ms │               193.24 ms │     no change │
│ QQuery 40    │    44.23 ms │                41.89 ms │ +1.06x faster │
│ QQuery 41    │    38.07 ms │                37.55 ms │     no change │
│ QQuery 42    │    31.40 ms │                31.71 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 53713.88ms │
│ Total Time (concat_batches_for_sort)   │ 54727.38ms │
│ Average Time (HEAD)                    │  1249.16ms │
│ Average Time (concat_batches_for_sort) │  1272.73ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         41 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  99.07 ms │                99.06 ms │     no change │
│ QQuery 2     │  20.67 ms │                21.26 ms │     no change │
│ QQuery 3     │  31.53 ms │                30.81 ms │     no change │
│ QQuery 4     │  18.52 ms │                18.14 ms │     no change │
│ QQuery 5     │  48.23 ms │                49.40 ms │     no change │
│ QQuery 6     │  11.76 ms │                11.74 ms │     no change │
│ QQuery 7     │  87.51 ms │                86.28 ms │     no change │
│ QQuery 8     │  24.48 ms │                24.73 ms │     no change │
│ QQuery 9     │  54.53 ms │                52.95 ms │     no change │
│ QQuery 10    │  43.94 ms │                43.14 ms │     no change │
│ QQuery 11    │  11.26 ms │                11.26 ms │     no change │
│ QQuery 12    │  35.25 ms │                35.45 ms │     no change │
│ QQuery 13    │  26.33 ms │                26.33 ms │     no change │
│ QQuery 14    │   9.50 ms │                 9.76 ms │     no change │
│ QQuery 15    │  19.05 ms │                18.84 ms │     no change │
│ QQuery 16    │  19.03 ms │                19.07 ms │     no change │
│ QQuery 17    │  94.40 ms │                93.55 ms │     no change │
│ QQuery 18    │ 200.43 ms │               189.52 ms │ +1.06x faster │
│ QQuery 19    │  24.77 ms │                24.79 ms │     no change │
│ QQuery 20    │  31.64 ms │                31.85 ms │     no change │
│ QQuery 21    │ 146.14 ms │               146.71 ms │     no change │
│ QQuery 22    │  14.85 ms │                15.06 ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1072.89ms │
│ Total Time (concat_batches_for_sort)   │ 1059.71ms │
│ Average Time (HEAD)                    │   48.77ms │
│ Average Time (concat_batches_for_sort) │   48.17ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        21 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

alamb · 2025-06-20T20:57:46Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (b6bc0bb) to a0eaf51 diff
Benchmarks: sort_tpch
Results will be posted here when complete

alamb · 2025-06-20T20:59:06Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Q1           │  356.60 ms │               354.72 ms │ no change │
│ Q2           │  304.89 ms │               298.48 ms │ no change │
│ Q3           │ 1186.23 ms │              1183.68 ms │ no change │
│ Q4           │  419.87 ms │               407.76 ms │ no change │
│ Q5           │  427.22 ms │               413.47 ms │ no change │
│ Q6           │  451.48 ms │               459.65 ms │ no change │
│ Q7           │  905.93 ms │               902.46 ms │ no change │
│ Q8           │  778.09 ms │               775.25 ms │ no change │
│ Q9           │  816.55 ms │               811.49 ms │ no change │
│ Q10          │ 1210.37 ms │              1201.37 ms │ no change │
│ Q11          │  792.29 ms │               793.98 ms │ no change │
└──────────────┴────────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7649.52ms │
│ Total Time (concat_batches_for_sort)   │ 7602.31ms │
│ Average Time (HEAD)                    │  695.41ms │
│ Average Time (concat_batches_for_sort) │  691.12ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        11 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

zhuqi-lucas · 2025-06-21T03:05:25Z

Still no performance gain, will investigate later.

Dandandan · 2025-06-21T08:06:52Z

Thank you @zhuqi-lucas for experimenting on this. Maybe it's a good idea to do some profiling to see the hots spots?

For example, this is the profile I get from the sort-tpch benchmark.

You can see here most of the work is concentrated in SortPreservingMerge, rather than the sorts, so perhaps in this case making the SortExec faster won't help a ton to improve the total performance. Maybe we can use target_partitions=1 to concentrate more work on SortExec so we can have a look.
I made a change here that's Speedup interleave_views (4-7x faster) arrow-rs#7695 that will probably help a quite a bit with the performance of SortPreserveMergeExec andSortExec, maybe we can look at where the next hotspots after this change, I think probably a lot in converting to Row, doing comparison on byte slices and doing allocations. But also some parts seem related that we don't handle views as efficiently as possible.
One example I see is for example we do call .gc() which currently does a not-fast implementation.

Another one, GenericByteViewArray::compare_unchecked:

zhuqi-lucas · 2025-06-21T09:34:16Z

Thank you @zhuqi-lucas for experimenting on this. Maybe it's a good idea to do some profiling to see the hots spots?

For example, this is the profile I get from the sort-tpch benchmark.
* You can see here most of the work is concentrated in SortPreservingMerge, rather than the sorts, so perhaps in this case making the `SortExec` faster won't help a ton to improve the total performance. Maybe we can use `target_partitions=1` to concentrate more work on `SortExec` so we can have a look. * I made a change here that's [Speedup `interleave_views` (4-7x faster) arrow-rs#7695](https://github.com/apache/arrow-rs/pull/7695) that will probably help a quite a bit with the performance of `SortPreserveMergeExec` and`SortExec`, maybe we can look at where the next hotspots after this change, I think probably a lot in converting to `Row`, doing comparison on byte slices and doing allocations. But also some parts seem related that we don't handle views as efficiently as possible. * One example I see is for example we do call `.gc()` which currently does a not-fast implementation. * Another one, GenericByteViewArray::compare_unchecked:

Thank you @Dandandan , this is really helpful and valuable for further investigation, i will do some investigation based on these directions.

And i also take one of the above topic:
apache/arrow-rs#7621 (comment)

May be i can start from it to see if we can benefit from it, thanks again!

zhuqi-lucas marked this pull request as draft March 24, 2025 09:42

Dandandan reviewed Apr 12, 2025

View reviewed changes

zhuqi-lucas marked this pull request as ready for review April 12, 2025 14:41

zhuqi-lucas changed the title ~~PoC (Perf: Support automatically concat_batches for sort which will improve performance)~~ Perf: Support automatically concat_batches for sort which will improve performance Apr 12, 2025

This comment was marked as outdated.

Sign in to view

2010YOUY01 mentioned this pull request May 21, 2025

[Track]: Enable Arrow Row format by default in sort execution #16131

Open

zhuqi-lucas added 2 commits June 19, 2025 21:40

Merge remote-tracking branch 'upstream/main' into concat_batches_for_…

3472352

…sort

fmt

ae88893

zhuqi-lucas added 3 commits June 20, 2025 22:24

try to increase in mem sort size

001284a

doc change

ddc9773

Merge remote-tracking branch 'upstream/main' into concat_batches_for_…

b6bc0bb

…sort

github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Jun 20, 2025

Merge branch 'main' into concat_batches_for_sort

00f5bec

Perf: Optimize in memory sort #15380

Are you sure you want to change the base?

Perf: Optimize in memory sort #15380

Uh oh!

Conversation

zhuqi-lucas commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 12, 2025

Uh oh!

zhuqi-lucas commented Apr 12, 2025

Uh oh!

zhuqi-lucas commented Apr 12, 2025

Uh oh!

zhuqi-lucas commented Apr 12, 2025

Uh oh!

Dandandan commented Apr 12, 2025

Uh oh!

alamb commented Apr 14, 2025

Uh oh!

alamb commented Apr 14, 2025

Uh oh!

This comment was marked as outdated.

Dandandan commented Apr 15, 2025

Uh oh!

zhuqi-lucas commented Apr 15, 2025

Uh oh!

zhuqi-lucas commented Apr 15, 2025

Uh oh!

Dandandan commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Apr 15, 2025

Uh oh!

zhuqi-lucas commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 15, 2025

Uh oh!

Dandandan commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 commented Apr 15, 2025

Uh oh!

Dandandan commented Apr 15, 2025

Uh oh!

zhuqi-lucas commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Apr 15, 2025

Uh oh!

alamb commented Apr 15, 2025

Uh oh!

alamb commented Apr 15, 2025

Uh oh!

alamb commented Apr 15, 2025

Uh oh!

zhuqi-lucas commented Apr 16, 2025

Uh oh!

zhuqi-lucas commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented May 20, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

Dandandan commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 15, 2025 •

edited

Loading

Dandandan commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 16, 2025 •

edited

Loading

Dandandan commented Jun 21, 2025 •

edited

Loading