Apply pre-selection and computation skipping to short-circuit optimization #15694

acking-you · 2025-04-12T15:13:33Z

Which issue does this PR close?

Closes Add more short-circuit optimization scenarios for OR and AND #15636

You can learn the reason behind this PR by reviewing the discussion in this issue: #15631 (comment)

Rationale for this change

Many thanks to @kosiew for doing a tremendous amount of work. Based on his PR #15648, I have made the following changes:

Performance test cases have been added for "all false" in AND operations and "all true" in OR operations.
Reduce some unnecessary function calls: for instance, false_count and true_count can be unified as values().count_set_bits() in non-null cases, which would eliminate two calls to nulls_count and one call to true_count.
Add heuristic pre-selection under the and operation. For more details on the principle, you can refer to: Improve the performance of early exit evaluation in binary_expr #15631 (comment).

What changes are included in this PR?

Wonderful work done by kosiew: #15648

Discussion of the related issue for this PR: #15631 (comment)

Are these changes tested?

Are there any user-facing changes?

acking-you · 2025-04-12T19:30:24Z

The error in cargo test is caused by an incorrect calculation of the pre-selection. The correct steps for calculating the pre-selection are as follows:

Compute the boolean array on the left-hand side.
Filter to obtain a new record batch based on the left-hand boolean array.
Compute the boolean array on the right-hand side using the new record batch (note that this boolean array will have incorrect positions since it corresponds to the new record batch).
Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as true based on the values from the right-hand array).

To illustrate this, I’ve drawn a diagram (it’s quite rough since I used a mouse, so I hope you don’t mind 😂):

The current code only handles up to the second step, which is why the error occurs. If you use evaluate_selection, the issue persists because its internal call to scatter (which corresponds to completing the fourth step) fills in the missing parts of the right-hand boolean array with null values, resulting in an incorrect outcome.

I’m currently working on implementing the fourth step to fix the issue, but it may take some time.

acking-you · 2025-04-13T19:24:21Z

The relevant bug fixes have been completed, and corresponding performance tests have been conducted. The results show that pre-selection has achieved significant gains! @Dandandan @alamb

Compare the current optimization with the main branch using cargo bench --bench binary_op. The results are as follows, where fluctuations within ±5% are considered as no changes.

Performance Comparison of the AND Logic Group

Test Case	main	short-and-optimize	Ratio	Change
all_false	62.623 ns	65.923 ns	0.95x	no changes
one_true_first	448.69 µs	195.60 µs	2.29x	↑ ✅
one_true_last	452.00 µs	171.91 µs	2.63x	↑ ✅
one_true_middle	453.12 µs	173.94 µs	2.60x	↑ ✅
one_true_middle_left	453.06 µs	165.70 µs	2.73x	↑ ✅
one_true_middle_right	459.61 µs	171.53 µs	2.68x	↑ ✅
all_true_in_and	450.03 µs	445.76 µs	1.01x	no changes

Performance Comparison of the OR Logic Group

Test Case	main	short-and-optimize	Ratio	Change
all_true	61.162 ns	64.430 ns	0.95x	no changes
one_false_first	448.51 µs	439.92 µs	1.02x	no changes
one_false_last	447.38 µs	453.64 µs	0.99x	no changes
one_false_middle	457.79 µs	447.15 µs	1.02x	no changes
one_false_middle_left	452.78 µs	447.75 µs	1.01x	no changes
one_false_middle_right	451.21 µs	444.23 µs	1.02x	no changes
all_false_in_or	449.90 µs	442.36 µs	1.02x	no changes

Possible next step(extend to nulls)

Short-circuit optimization cannot be extended to nulls

The current short-circuit optimization is only applicable to cases without null values. However, based on the calculation principles of "and" and "or", if the left-hand side (lhs) evaluates to null, then the final result can only be determined by continuing to calculate the right-hand side (rhs). Therefore, optimization for this scenario is not feasible. Below is an example of a calculation where lhs is null:

❯ select null and true;
+------------------------+
| NULL AND Boolean(true) |
+------------------------+
|                        |
+------------------------+
1 row in set. Query took 0.000 seconds.

❯ select null and false;
+-------------------------+
| NULL AND Boolean(false) |
+-------------------------+
| false                   |
+-------------------------+
1 row in set. Query took 0.000 seconds.

❯ select null or false;
+------------------------+
| NULL OR Boolean(false) |
+------------------------+
|                        |
+------------------------+
1 row in set. Query took 0.000 seconds.

❯ select null or true;
+-----------------------+
| NULL OR Boolean(true) |
+-----------------------+
| true                  |
+-----------------------+
1 row in set. Query took 0.000 seconds.

Pre-selection can be extended to include nulls

As I explained earlier: #15694 (comment), pre-selection can actually be extended to cover cases involving null values. However, one point needs to be confirmed: filter_record_batch will retain rows that are null.

Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as true based on the values from the right-hand array).

Afterward, we only need to modify the fourth step of the pre-selection process mentioned earlier to complete the extension that supports nulls.

kosiew

@acking-you ,
pre_selection_scatter is a GREAT idea!!

kosiew · 2025-04-14T02:46:49Z

datafusion/physical-expr/src/expressions/binary.rs

+/// Based on the results calculated from the left side of the short-circuit operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;


Suggested change

const PRE_SELECTIO_THRESHOLD: f32 = 0.2;

const PRE_SELECTION_THRESHOLD: f32 = 0.2;

kosiew · 2025-04-14T04:00:22Z

datafusion/physical-expr/src/expressions/binary.rs

+/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?
+/// In practice, `left_result` should have only one owner, so making changes should be safe.
+/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].


Suggested change

/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?

/// In practice, `left_result` should have only one owner, so making changes should be safe.

/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].

/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?

/// In practice, `left_result` should have only one owner, so making changes should be safe.

/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].

/// Creates a new boolean array based on the evaluation of the right expression,

/// but only for positions where the left_result is true.

///

/// This function is used for short-circuit evaluation optimization of logical AND operations:

/// - When left_result has few true values, we only evaluate the right expression for those positions

/// - Values are copied from right_array where left_result is true

/// - All other positions are filled with false values

///

/// @param left_result Boolean array with selection mask (typically from left side of AND)

/// @param right_result Result of evaluating right side of expression (only for selected positions)

/// @return A combined ColumnarValue with values from right_result where left_result is true

kosiew · 2025-04-14T04:04:46Z

datafusion/physical-expr/src/expressions/binary.rs

+    let right_array = if let ColumnarValue::Array(array) = right_result {
+        array
+    } else {
+        return Ok(right_result);
+    };


// Return scalar values as-is if let ColumnarValue::Scalar(_) = right_result { return Ok(right_result); } let right_array = right_result.into_array(left_result.len())?;

Maybe this reads easier?

Perhaps changing it to match would be better? Calling into_array(left_result.len()) might make people think that right_result has the same length as left_result, right?

good point.

kosiew · 2025-04-14T04:17:25Z

datafusion/physical-expr/src/expressions/binary.rs

+    let mut filled = 0;
+    // keep track of current position we have in right boolean array
+    let mut right_array_pos = 0;
+
+    let mut result_array_builder = BooleanArray::builder(result_len);
+    SlicesIterator::new(left_result).for_each(|(start, end)| {
+        // the gap needs to be filled with false
+        if start > filled {
+            (filled..start).for_each(|_| result_array_builder.append_value(false));
+        }
+        // fill with right_result values
+        let len = end - start;
+        right_boolean_array
+            .slice(right_array_pos, len)
+            .iter()
+            .for_each(|v| result_array_builder.append_option(v));
+
+        right_array_pos += len;
+        filled = end;
+    });
+    // the remaining part is falsy
+    if filled < result_len {
+        (filled..result_len).for_each(|_| result_array_builder.append_value(false));
+    }
+    let boolean_result = result_array_builder.finish();
+
+    Ok(ColumnarValue::Array(Arc::new(boolean_result)))


Here's an alternative approach using last_end, instead of filled.

last_end tracks the end of the last filled region.

// Build the resulting array efficiently let mut result_array_builder = BooleanArray::builder(result_len); let mut right_pos = 0; // Process the array using iterator of slices (contiguous true regions) let mut last_end = 0; for (start, end) in SlicesIterator::new(left_result) { // Fill the gap between last processed position and current slice start with false result_array_builder.append_n(start - last_end, false); // Copy values from right array for this slice let slice_len = end - start; right_boolean_array .slice(right_pos, slice_len) .iter() .for_each(|v| result_array_builder.append_option(v)); right_pos += slice_len; last_end = end; } // Fill any remaining positions with false if last_end < result_len { result_array_builder.append_n(result_len - last_end, false); } Ok(ColumnarValue::Array(Arc::new(result_array_builder.finish())))

Interestingly, this improved the performance further from

to

append_n is indeed a more efficient choice. I checked its source code and found that it doesn't handle append_n(0, true) separately to skip it. However, for append_n(0, false), it will be detected and skipped during self.advance. Since we will only append false, we don't need a separate check, but we should add a comment to clarify this behavior.

I continued to check the source code and found that its self.null_buffer_builder.append_n_non_nulls(additional) will append true, so we still need an if statement to check whether it is 0.

We might also do something like this (set bits on left side based on set_indices

(pseudocode)

for (index, left_index) in left.values().set_indices().iter().enumerate() { left_map.set_bit(left_map.get_bit(left_index) & right_map.get_bit(index))) }

This avoids building a new map and only does some bit indexing / efficiently iterating the left side.

We might also do something like this (set bits on left side based on set_indices

(pseudocode)

for (index, left_index) in left.values().set_indices().iter().enumerate() { left_map.set_bit(left_map.get_bit(left_index) & right_map.get_bit(index))) }

This avoids building a new map and only does some bit indexing / efficiently iterating the left side.

Yes, I mentioned this in the comments, but the immutability of Arc and BooleanArray makes this operation difficult to complete:

/// Perhaps it would be better to modify `left_result` directly without creating a copy? /// In practice, `left_result` should have only one owner, so making changes should be safe. /// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].

One possibility is copying the left buffer and then modifying it, this is faster than creating it from scratch using a builder :)

One possibility is copying the left buffer and then modifying it, this is faster than creating it from scratch using a builder :)

I think it should be about the same, because append won't allocate new memory—it just assigns values at the corresponding positions. There shouldn't be much difference compared to directly copying and then assigning values. This corresponds to the pre-allocation logic of the builder:

let mut result_array_builder = BooleanArray::builder(result_len);

Dandandan · 2025-04-14T08:24:38Z

| However, one point needs to be confirmed: filter_record_batch will retain rows that are null.

filter_record_batch removes rows that are null.

Dandandan · 2025-04-14T08:27:28Z

Very cool.
It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here.

acking-you · 2025-04-14T09:15:55Z

| However, one point needs to be confirmed: filter_record_batch will retain rows that are null.

filter_record_batch removes rows that are null.

If that's the case, we may need to do some hacky things to extend it to nulls.

acking-you · 2025-04-14T09:50:03Z

Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here.

I tried running clickbench, and there wasn't a significant improvement. The optimization is still relatively selective about the cost of expression evaluation. For example, if the computation on the right involves only simple integer comparisons or short string comparisons, skipping a large amount of computation will not significantly improve performance..

New ideas

One potential area for significant improvement that I thought of is: heuristically replacing the computation process of filter_and_project.

The current state of this function is:

Pass in the batch to compute the predicate and obtain a boolean array.
Perform the actual filtering process on the batch based on the boolean array and return the result.

Pros

Since the filtering process involves copying to produce a new batch, calculating a boolean array based on the predicate only requires a single copy. If filtering is performed during the predicate computation, multiple copies will occur.

Cons

If the predicate is an AND expression, after the left side is computed, it may already have filtered out most of the batch and computed a new batch. However, the final batch generation is still based on the previous batch and the boolean array, which introduces unnecessary copy overhead.

Conclusion

Considering both aspects, I believe we can strike a balance. For example, if it is discovered during the AND computation process that most of the batch can be filtered out, then early filtering can be performed as the final result.
However, this optimization scenario is very limited: it can only be applied when the predicate is entirely an AND computation, such as a AND b AND c .... Once an OR appears in the middle, we cannot perform early filtering to produce the result.

The amount of code involved in implementing this optimization may be substantial,and it would only apply to predicates in the form of a AND b AND c.... It might not be worth the effort.

alamb · 2025-04-15T17:15:46Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing short-and-optmize (d104e3b) to 0b01fdf diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-04-15T17:58:06Z

🤖: Benchmark completed

Details

Comparing HEAD and short-and-optmize
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1918.00ms │         1941.59ms │ no change │
│ QQuery 1     │   711.53ms │          702.35ms │ no change │
│ QQuery 2     │  1429.36ms │         1446.13ms │ no change │
│ QQuery 3     │   704.72ms │          727.05ms │ no change │
│ QQuery 4     │  1516.84ms │         1514.23ms │ no change │
│ QQuery 5     │ 17398.39ms │        17271.95ms │ no change │
│ QQuery 6     │  2026.72ms │         2124.80ms │ no change │
└──────────────┴────────────┴───────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 25705.56ms │
│ Total Time (short-and-optmize)   │ 25728.10ms │
│ Average Time (HEAD)              │  3672.22ms │
│ Average Time (short-and-optmize) │  3675.44ms │
│ Queries Faster                   │          0 │
│ Queries Slower                   │          0 │
│ Queries with No Change           │          7 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.74ms │            2.25ms │ +1.21x faster │
│ QQuery 1     │    36.47ms │           36.63ms │     no change │
│ QQuery 2     │    90.59ms │           88.44ms │     no change │
│ QQuery 3     │   101.66ms │           95.74ms │ +1.06x faster │
│ QQuery 4     │   802.73ms │          840.27ms │     no change │
│ QQuery 5     │   891.98ms │          914.36ms │     no change │
│ QQuery 6     │     2.10ms │            2.14ms │     no change │
│ QQuery 7     │    41.09ms │           42.92ms │     no change │
│ QQuery 8     │   948.80ms │          961.56ms │     no change │
│ QQuery 9     │  1269.90ms │         1292.89ms │     no change │
│ QQuery 10    │   264.01ms │          274.69ms │     no change │
│ QQuery 11    │   301.65ms │          316.08ms │     no change │
│ QQuery 12    │   943.62ms │          966.90ms │     no change │
│ QQuery 13    │  1405.86ms │         1426.41ms │     no change │
│ QQuery 14    │   900.03ms │          898.57ms │     no change │
│ QQuery 15    │  1079.97ms │         1069.93ms │     no change │
│ QQuery 16    │  1778.50ms │         1782.02ms │     no change │
│ QQuery 17    │  1665.53ms │         1666.25ms │     no change │
│ QQuery 18    │  3160.53ms │         3186.64ms │     no change │
│ QQuery 19    │    86.25ms │           92.80ms │  1.08x slower │
│ QQuery 20    │  1170.29ms │         1183.35ms │     no change │
│ QQuery 21    │  1355.56ms │         1374.25ms │     no change │
│ QQuery 22    │  2374.52ms │         2322.80ms │     no change │
│ QQuery 23    │  8507.41ms │         8569.51ms │     no change │
│ QQuery 24    │   470.15ms │          480.90ms │     no change │
│ QQuery 25    │   394.74ms │          408.10ms │     no change │
│ QQuery 26    │   539.09ms │          561.82ms │     no change │
│ QQuery 27    │  1706.49ms │         1747.01ms │     no change │
│ QQuery 28    │ 12891.10ms │        12837.09ms │     no change │
│ QQuery 29    │   536.68ms │          522.79ms │     no change │
│ QQuery 30    │   826.10ms │          849.08ms │     no change │
│ QQuery 31    │   879.52ms │          918.11ms │     no change │
│ QQuery 32    │  2784.99ms │         2694.41ms │     no change │
│ QQuery 33    │  3398.18ms │         3423.14ms │     no change │
│ QQuery 34    │  3435.31ms │         3521.83ms │     no change │
│ QQuery 35    │  1320.36ms │         1303.37ms │     no change │
│ QQuery 36    │   126.83ms │          132.86ms │     no change │
│ QQuery 37    │    57.78ms │           59.40ms │     no change │
│ QQuery 38    │   126.83ms │          125.35ms │     no change │
│ QQuery 39    │   201.53ms │          208.57ms │     no change │
│ QQuery 40    │    51.48ms │           47.72ms │ +1.08x faster │
│ QQuery 41    │    47.15ms │           46.73ms │     no change │
│ QQuery 42    │    40.77ms │           39.91ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 59016.87ms │
│ Total Time (short-and-optmize)   │ 59335.58ms │
│ Average Time (HEAD)              │  1372.49ms │
│ Average Time (short-and-optmize) │  1379.90ms │
│ Queries Faster                   │          3 │
│ Queries Slower                   │          1 │
│ Queries with No Change           │         39 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 126.24ms │          123.04ms │     no change │
│ QQuery 2     │  24.28ms │           24.40ms │     no change │
│ QQuery 3     │  36.24ms │           37.03ms │     no change │
│ QQuery 4     │  20.48ms │           21.62ms │  1.06x slower │
│ QQuery 5     │  57.43ms │           57.95ms │     no change │
│ QQuery 6     │   8.32ms │           12.39ms │  1.49x slower │
│ QQuery 7     │ 108.70ms │          111.45ms │     no change │
│ QQuery 8     │  27.29ms │           27.24ms │     no change │
│ QQuery 9     │  64.45ms │           64.35ms │     no change │
│ QQuery 10    │  60.42ms │           59.08ms │     no change │
│ QQuery 11    │  13.32ms │           14.03ms │  1.05x slower │
│ QQuery 12    │  38.77ms │           46.44ms │  1.20x slower │
│ QQuery 13    │  28.91ms │           30.29ms │     no change │
│ QQuery 14    │   9.96ms │           10.23ms │     no change │
│ QQuery 15    │  25.93ms │           26.97ms │     no change │
│ QQuery 16    │  24.03ms │           23.96ms │     no change │
│ QQuery 17    │ 101.44ms │          100.13ms │     no change │
│ QQuery 18    │ 251.74ms │          245.84ms │     no change │
│ QQuery 19    │  30.50ms │           27.81ms │ +1.10x faster │
│ QQuery 20    │  40.06ms │           40.93ms │     no change │
│ QQuery 21    │ 176.83ms │          178.37ms │     no change │
│ QQuery 22    │  17.83ms │           17.84ms │     no change │
└──────────────┴──────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 1293.17ms │
│ Total Time (short-and-optmize)   │ 1301.40ms │
│ Average Time (HEAD)              │   58.78ms │
│ Average Time (short-and-optmize) │   59.15ms │
│ Queries Faster                   │         1 │
│ Queries Slower                   │         4 │
│ Queries with No Change           │        17 │
└──────────────────────────────────┴───────────┘

acking-you · 2025-04-15T18:32:26Z

🤖: Benchmark completed

Details

Comparing HEAD and short-and-optmize
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1918.00ms │         1941.59ms │ no change │
│ QQuery 1     │   711.53ms │          702.35ms │ no change │
│ QQuery 2     │  1429.36ms │         1446.13ms │ no change │
│ QQuery 3     │   704.72ms │          727.05ms │ no change │
│ QQuery 4     │  1516.84ms │         1514.23ms │ no change │
│ QQuery 5     │ 17398.39ms │        17271.95ms │ no change │
│ QQuery 6     │  2026.72ms │         2124.80ms │ no change │
└──────────────┴────────────┴───────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 25705.56ms │
│ Total Time (short-and-optmize)   │ 25728.10ms │
│ Average Time (HEAD)              │  3672.22ms │
│ Average Time (short-and-optmize) │  3675.44ms │
│ Queries Faster                   │          0 │
│ Queries Slower                   │          0 │
│ Queries with No Change           │          7 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.74ms │            2.25ms │ +1.21x faster │
│ QQuery 1     │    36.47ms │           36.63ms │     no change │
│ QQuery 2     │    90.59ms │           88.44ms │     no change │
│ QQuery 3     │   101.66ms │           95.74ms │ +1.06x faster │
│ QQuery 4     │   802.73ms │          840.27ms │     no change │
│ QQuery 5     │   891.98ms │          914.36ms │     no change │
│ QQuery 6     │     2.10ms │            2.14ms │     no change │
│ QQuery 7     │    41.09ms │           42.92ms │     no change │
│ QQuery 8     │   948.80ms │          961.56ms │     no change │
│ QQuery 9     │  1269.90ms │         1292.89ms │     no change │
│ QQuery 10    │   264.01ms │          274.69ms │     no change │
│ QQuery 11    │   301.65ms │          316.08ms │     no change │
│ QQuery 12    │   943.62ms │          966.90ms │     no change │
│ QQuery 13    │  1405.86ms │         1426.41ms │     no change │
│ QQuery 14    │   900.03ms │          898.57ms │     no change │
│ QQuery 15    │  1079.97ms │         1069.93ms │     no change │
│ QQuery 16    │  1778.50ms │         1782.02ms │     no change │
│ QQuery 17    │  1665.53ms │         1666.25ms │     no change │
│ QQuery 18    │  3160.53ms │         3186.64ms │     no change │
│ QQuery 19    │    86.25ms │           92.80ms │  1.08x slower │
│ QQuery 20    │  1170.29ms │         1183.35ms │     no change │
│ QQuery 21    │  1355.56ms │         1374.25ms │     no change │
│ QQuery 22    │  2374.52ms │         2322.80ms │     no change │
│ QQuery 23    │  8507.41ms │         8569.51ms │     no change │
│ QQuery 24    │   470.15ms │          480.90ms │     no change │
│ QQuery 25    │   394.74ms │          408.10ms │     no change │
│ QQuery 26    │   539.09ms │          561.82ms │     no change │
│ QQuery 27    │  1706.49ms │         1747.01ms │     no change │
│ QQuery 28    │ 12891.10ms │        12837.09ms │     no change │
│ QQuery 29    │   536.68ms │          522.79ms │     no change │
│ QQuery 30    │   826.10ms │          849.08ms │     no change │
│ QQuery 31    │   879.52ms │          918.11ms │     no change │
│ QQuery 32    │  2784.99ms │         2694.41ms │     no change │
│ QQuery 33    │  3398.18ms │         3423.14ms │     no change │
│ QQuery 34    │  3435.31ms │         3521.83ms │     no change │
│ QQuery 35    │  1320.36ms │         1303.37ms │     no change │
│ QQuery 36    │   126.83ms │          132.86ms │     no change │
│ QQuery 37    │    57.78ms │           59.40ms │     no change │
│ QQuery 38    │   126.83ms │          125.35ms │     no change │
│ QQuery 39    │   201.53ms │          208.57ms │     no change │
│ QQuery 40    │    51.48ms │           47.72ms │ +1.08x faster │
│ QQuery 41    │    47.15ms │           46.73ms │     no change │
│ QQuery 42    │    40.77ms │           39.91ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 59016.87ms │
│ Total Time (short-and-optmize)   │ 59335.58ms │
│ Average Time (HEAD)              │  1372.49ms │
│ Average Time (short-and-optmize) │  1379.90ms │
│ Queries Faster                   │          3 │
│ Queries Slower                   │          1 │
│ Queries with No Change           │         39 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 126.24ms │          123.04ms │     no change │
│ QQuery 2     │  24.28ms │           24.40ms │     no change │
│ QQuery 3     │  36.24ms │           37.03ms │     no change │
│ QQuery 4     │  20.48ms │           21.62ms │  1.06x slower │
│ QQuery 5     │  57.43ms │           57.95ms │     no change │
│ QQuery 6     │   8.32ms │           12.39ms │  1.49x slower │
│ QQuery 7     │ 108.70ms │          111.45ms │     no change │
│ QQuery 8     │  27.29ms │           27.24ms │     no change │
│ QQuery 9     │  64.45ms │           64.35ms │     no change │
│ QQuery 10    │  60.42ms │           59.08ms │     no change │
│ QQuery 11    │  13.32ms │           14.03ms │  1.05x slower │
│ QQuery 12    │  38.77ms │           46.44ms │  1.20x slower │
│ QQuery 13    │  28.91ms │           30.29ms │     no change │
│ QQuery 14    │   9.96ms │           10.23ms │     no change │
│ QQuery 15    │  25.93ms │           26.97ms │     no change │
│ QQuery 16    │  24.03ms │           23.96ms │     no change │
│ QQuery 17    │ 101.44ms │          100.13ms │     no change │
│ QQuery 18    │ 251.74ms │          245.84ms │     no change │
│ QQuery 19    │  30.50ms │           27.81ms │ +1.10x faster │
│ QQuery 20    │  40.06ms │           40.93ms │     no change │
│ QQuery 21    │ 176.83ms │          178.37ms │     no change │
│ QQuery 22    │  17.83ms │           17.84ms │     no change │
└──────────────┴──────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 1293.17ms │
│ Total Time (short-and-optmize)   │ 1301.40ms │
│ Average Time (HEAD)              │   58.78ms │
│ Average Time (short-and-optmize) │   59.15ms │
│ Queries Faster                   │         1 │
│ Queries Slower                   │         4 │
│ Queries with No Change           │        17 │
└──────────────────────────────────┴───────────┘

The result met expectations. This optimization primarily reduces the subsequent filter computation load by making early decisions. This is useful when the filter computation is heavy and short-circuiting fails. But I believe there are at least three areas for further optimization:

Not only considering the ratio of true_count, but also taking into account the data size of the batch.
Striving to implement zero-copy for the lhs.
Extending the approach to handle cases where lhs contains null values.

alamb

Thank you @acking-you -- I think this is looking really nice -- thank you and @kosiew . It is amazing to see this level of optimization

The only thing I think is needed in this PR is a few more tests for the pre_selection_scatter function and then it will be ready to go

I think we should plan to merge this once we create a datafusion 47 release

Again, really nice work and thank you

alamb · 2025-04-15T22:42:06Z

datafusion/physical-expr/src/expressions/binary.rs

+
+/// Based on the results calculated from the left side of the short-circuit operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is an `and`,
+/// the `RecordBatch` will be filtered in advance.


this seems reasonable to me

alamb · 2025-04-15T22:44:55Z

datafusion/physical-expr/src/expressions/binary.rs

+/// # Note
+/// Perhaps it would be better to modify `left_result` directly without creating a copy?
+/// In practice, `left_result` should have only one owner, so making changes should be safe.
+/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].


I think we could potentially use kernels like unary_mut to reuse the allocation: https://docs.rs/arrow/latest/arrow/compute/fn.unary_mut.html

However, I don't think that is currently supported in arrow-rs for BooleanArray -- we could add it upstream if we wanted to explore this optimization

However each BooleanArray is likely to have 8192 elements , so 1K bytes, where maybe the overhead is ok

Yes, I also think that copying 1KB of data is perfectly acceptable, as it should only incur a nanosecond-level overhead.

datafusion/physical-expr/src/expressions/binary.rs

- Delay evaluation of the right-hand side (RHS) unless necessary. - Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first. - Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS. - Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations.

…esult and update assertions - Renamed the test function for clarity. - Updated assertions to use get_short_circuit_result instead of check_short_circuit. - Added additional test cases for AND and OR operations with expected results.

…lt function for null - Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array. - Adjusted test case to reflect the change in expected output.

…rcuit checks

…HS is only evaluated when necessary

…ort_circuit_result function

… BinaryExpr evaluation

…ssion tests

…d enhance documentation for get_short_circuit_result

…t evaluation in BinaryExpr

…values in BinaryExpr

…tors - Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function. - Updated handling of Boolean data types to return `None` for null values. - Simplified short-circuit checks for AND/OR operations by consolidating logic. - Enhanced readability and maintainability of the code by restructuring match statements.

…optimize logical operations

…Expr to optimize logical operations" This reverts commit a62df47.

…circuit evaluation

…luation in BinaryExpr - Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic. - Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators. - Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS. - Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations.

…uit function

…nction - Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance. - Updated documentation to reflect changes in array handling techniques.

…_short_circuit logic - Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance. - Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations. - Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism.

…ze check_short_circuit logic" This reverts commit e2b9f77.

…uit function

…-selection scatter function documentation

acking-you · 2025-04-16T16:51:42Z

The only thing I think is needed in this PR is a few more tests for the pre_selection_scatter function and then it will be ready to go

done

alamb

THank you again @acking-you and @kosiew -- really nice work

alamb · 2025-04-17T17:04:24Z

datafusion/physical-expr/src/expressions/binary.rs

+        fn create_bool_array(bools: Vec<bool>) -> BooleanArray {
+            BooleanArray::from(bools.into_iter().map(Some).collect::<Vec<_>>())
+        }
+        // Test sparse left with interleaved true/false


…ation (apache#15694) * Enhance short-circuit evaluation for binary expressions - Delay evaluation of the right-hand side (RHS) unless necessary. - Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first. - Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS. - Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations. * refactor: rename test_check_short_circuit to test_get_short_circuit_result and update assertions - Renamed the test function for clarity. - Updated assertions to use get_short_circuit_result instead of check_short_circuit. - Added additional test cases for AND and OR operations with expected results. * fix: enhance short-circuit evaluation logic in get_short_circuit_result function for null - Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array. - Adjusted test case to reflect the change in expected output. * feat: add debug logging for binary expression evaluation and short-circuit checks * fix: improve short-circuit evaluation logic in BinaryExpr to ensure RHS is only evaluated when necessary * fix: restrict short-circuit evaluation to logical operators in get_short_circuit_result function * add more println!("==> "); * fix: remove duplicate data type checks for left and right operands in BinaryExpr evaluation * feat: add debug prints for dictionary values and keys in binary expression tests * Tests pass * fix: remove redundant short-circuit evaluation check in BinaryExpr and enhance documentation for get_short_circuit_result * refactor: remove unnecessary debug prints and streamline short-circuit evaluation in BinaryExpr * test: enhance short-circuit evaluation tests for nullable and scalar values in BinaryExpr * add benchmark * refactor: improve short-circuit logic in BinaryExpr for logical operators - Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function. - Updated handling of Boolean data types to return `None` for null values. - Simplified short-circuit checks for AND/OR operations by consolidating logic. - Enhanced readability and maintainability of the code by restructuring match statements. * refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations * Revert "refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations" This reverts commit a62df47. * bench: add benchmark for OR operation with all false values in short-circuit evaluation * refactor: add ShortCircuitStrategy enum to optimize short-circuit evaluation in BinaryExpr - Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic. - Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators. - Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS. - Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations. * refactor: simplify short-circuit evaluation logic in check_short_circuit function * datafusion_expr::lit as expr_lit * refactor: optimize short-circuit evaluation in check_short_circuit function - Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance. - Updated documentation to reflect changes in array handling techniques. * refactor: add count_boolean_values helper function and optimize check_short_circuit logic - Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance. - Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations. - Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism. * Revert "refactor: add count_boolean_values helper function and optimize check_short_circuit logic" This reverts commit e2b9f77. * optimise evaluate * optimise evaluate 2 * refactor op:AND, lhs all false op:OR, lhs all true to be faster * fix clippy warning * refactor: optimize short-circuit evaluation logic in check_short_circuit function * fix clippy warning * add pre selection * add some comments * [WIP] fix pre-selection result * fix: Error in calculating the ratio * fix: Correct typo in pre-selection threshold constant and improve pre-selection scatter function documentation * fix doctest error * fix cargo doc * fix cargo doc * test: Add unit tests for pre_selection_scatter function --------- Co-authored-by: Siew Kam Onn <[email protected]>

github-actions bot added the physical-expr Changes to the physical-expr crates label Apr 12, 2025

acking-you mentioned this pull request Apr 12, 2025

Improve the performance of early exit evaluation in binary_expr #15631

Open

kosiew reviewed Apr 14, 2025

View reviewed changes

kosiew mentioned this pull request Apr 14, 2025

Optimize BinaryExpr Evaluation with Short-Circuiting for AND/OR Operators #15648

Closed

Dandandan mentioned this pull request Apr 14, 2025

Optimize TopK with threshold filter ~1.4x speedup #15697

Open

alamb approved these changes Apr 15, 2025

View reviewed changes

alamb added the performance Make DataFusion faster label Apr 15, 2025

kosiew added 14 commits April 17, 2025 00:48

feat: add debug logging for binary expression evaluation and short-ci…

8fda5af

…rcuit checks

fix: improve short-circuit evaluation logic in BinaryExpr to ensure R…

0da5b61

…HS is only evaluated when necessary

fix: restrict short-circuit evaluation to logical operators in get_sh…

8f8d615

…ort_circuit_result function

add more println!("==> ");

c322c78

fix: remove duplicate data type checks for left and right operands in…

2101f4e

… BinaryExpr evaluation

feat: add debug prints for dictionary values and keys in binary expre…

4b5c870

…ssion tests

Tests pass

044756d

fix: remove redundant short-circuit evaluation check in BinaryExpr an…

f03e11c

…d enhance documentation for get_short_circuit_result

refactor: remove unnecessary debug prints and streamline short-circui…

a1d1f6b

…t evaluation in BinaryExpr

test: enhance short-circuit evaluation tests for nullable and scalar …

58d1c74

…values in BinaryExpr

add benchmark

345878a

kosiew and others added 25 commits April 17, 2025 00:48

refactor: enhance short-circuit evaluation strategy in BinaryExpr to …

1a8b7f9

…optimize logical operations

Revert "refactor: enhance short-circuit evaluation strategy in Binary…

e653a1d

…Expr to optimize logical operations" This reverts commit a62df47.

bench: add benchmark for OR operation with all false values in short-…

05a715b

…circuit evaluation

refactor: simplify short-circuit evaluation logic in check_short_circ…

b342b40

…uit function

datafusion_expr::lit as expr_lit

d9171be

refactor: optimize short-circuit evaluation in check_short_circuit fu…

24ed343

…nction - Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance. - Updated documentation to reflect changes in array handling techniques.

Revert "refactor: add count_boolean_values helper function and optimi…

66b0bf4

…ze check_short_circuit logic" This reverts commit e2b9f77.

optimise evaluate

281dcc5

optimise evaluate 2

3b3c81b

refactor op:AND, lhs all false op:OR, lhs all true to be faster

6fe51c1

fix clippy warning

be3c38c

refactor: optimize short-circuit evaluation logic in check_short_circ…

7f01930

…uit function

fix clippy warning

3aebed6

add pre selection

10c5b56

add some comments

96c4b55

[WIP] fix pre-selection result

08de684

fix: Error in calculating the ratio

9f3f452

fix: Correct typo in pre-selection threshold constant and improve pre…

070eaf8

…-selection scatter function documentation

fix doctest error

ab50eaa

fix cargo doc

4d276ff

fix cargo doc

bc381c8

test: Add unit tests for pre_selection_scatter function

983e81e

acking-you force-pushed the short-and-optmize branch from eb50121 to 983e81e Compare April 16, 2025 16:48

alamb approved these changes Apr 17, 2025

View reviewed changes

alamb merged commit 4818966 into apache:main Apr 17, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply pre-selection and computation skipping to short-circuit optimization #15694

Apply pre-selection and computation skipping to short-circuit optimization #15694

acking-you commented Apr 12, 2025 •

edited

Loading

acking-you commented Apr 12, 2025

acking-you commented Apr 13, 2025

kosiew left a comment •

edited

Loading

kosiew Apr 14, 2025

kosiew Apr 14, 2025

kosiew Apr 14, 2025

acking-you Apr 14, 2025

kosiew Apr 16, 2025

kosiew Apr 14, 2025

kosiew Apr 14, 2025 •

edited

Loading

acking-you Apr 14, 2025 •

edited

Loading

Dandandan Apr 14, 2025 •

edited

Loading

acking-you Apr 14, 2025

Dandandan Apr 14, 2025

acking-you Apr 15, 2025

Dandandan commented Apr 14, 2025

Dandandan commented Apr 14, 2025

acking-you commented Apr 14, 2025

acking-you commented Apr 14, 2025 •

edited

Loading

alamb commented Apr 15, 2025

alamb commented Apr 15, 2025

acking-you commented Apr 15, 2025 •

edited

Loading

alamb left a comment

alamb Apr 15, 2025

alamb Apr 15, 2025

acking-you Apr 16, 2025

acking-you commented Apr 16, 2025

alamb left a comment

alamb Apr 17, 2025

	const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
	const PRE_SELECTION_THRESHOLD: f32 = 0.2;

-/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?
-/// In practice, `left_result` should have only one owner, so making changes should be safe.
-/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
+/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?
+/// In practice, `left_result` should have only one owner, so making changes should be safe.
+/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
+/// Creates a new boolean array based on the evaluation of the right expression,
+/// but only for positions where the left_result is true.
+///
+/// This function is used for short-circuit evaluation optimization of logical AND operations:
+/// - When left_result has few true values, we only evaluate the right expression for those positions
+/// - Values are copied from right_array where left_result is true
+/// - All other positions are filled with false values
+///
+/// @param left_result Boolean array with selection mask (typically from left side of AND)
+/// @param right_result Result of evaluating right side of expression (only for selected positions)
+/// @return A combined ColumnarValue with values from right_result where left_result is true

Apply pre-selection and computation skipping to short-circuit optimization #15694

Apply pre-selection and computation skipping to short-circuit optimization #15694

Conversation

acking-you commented Apr 12, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

acking-you commented Apr 12, 2025

acking-you commented Apr 13, 2025

Performance Comparison of the AND Logic Group

Performance Comparison of the OR Logic Group

Possible next step(extend to nulls)

Short-circuit optimization cannot be extended to nulls

Pre-selection can be extended to include nulls

kosiew left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kosiew Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

acking-you Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

Dandandan Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Apr 14, 2025

Dandandan commented Apr 14, 2025

acking-you commented Apr 14, 2025

acking-you commented Apr 14, 2025 • edited Loading

New ideas

Pros

Cons

Conclusion

alamb commented Apr 15, 2025

alamb commented Apr 15, 2025

acking-you commented Apr 15, 2025 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acking-you commented Apr 16, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acking-you commented Apr 12, 2025 •

edited

Loading

kosiew left a comment •

edited

Loading

kosiew Apr 14, 2025 •

edited

Loading

acking-you Apr 14, 2025 •

edited

Loading

Dandandan Apr 14, 2025 •

edited

Loading

acking-you commented Apr 14, 2025 •

edited

Loading

acking-you commented Apr 15, 2025 •

edited

Loading