Skip to content

Apply pre-selection and computation skipping to short-circuit optimization #15694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Apr 17, 2025

Conversation

acking-you
Copy link
Contributor

@acking-you acking-you commented Apr 12, 2025

Which issue does this PR close?

You can learn the reason behind this PR by reviewing the discussion in this issue: #15631 (comment)

Rationale for this change

Many thanks to @kosiew for doing a tremendous amount of work. Based on his PR #15648, I have made the following changes:

  1. Performance test cases have been added for "all false" in AND operations and "all true" in OR operations.
  2. Reduce some unnecessary function calls: for instance, false_count and true_count can be unified as values().count_set_bits() in non-null cases, which would eliminate two calls to nulls_count and one call to true_count.
  3. Add heuristic pre-selection under the and operation. For more details on the principle, you can refer to: Improve the performance of early exit evaluation in binary_expr #15631 (comment).

What changes are included in this PR?

Wonderful work done by kosiew: #15648

Discussion of the related issue for this PR: #15631 (comment)

Are these changes tested?

Are there any user-facing changes?

@acking-you
Copy link
Contributor Author

The error in cargo test is caused by an incorrect calculation of the pre-selection. The correct steps for calculating the pre-selection are as follows:

  1. Compute the boolean array on the left-hand side.
  2. Filter to obtain a new record batch based on the left-hand boolean array.
  3. Compute the boolean array on the right-hand side using the new record batch (note that this boolean array will have incorrect positions since it corresponds to the new record batch).
  4. Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as true based on the values from the right-hand array).

To illustrate this, I’ve drawn a diagram (it’s quite rough since I used a mouse, so I hope you don’t mind 😂):

image

The current code only handles up to the second step, which is why the error occurs. If you use evaluate_selection, the issue persists because its internal call to scatter (which corresponds to completing the fourth step) fills in the missing parts of the right-hand boolean array with null values, resulting in an incorrect outcome.

I’m currently working on implementing the fourth step to fix the issue, but it may take some time.

@acking-you
Copy link
Contributor Author

The relevant bug fixes have been completed, and corresponding performance tests have been conducted. The results show that pre-selection has achieved significant gains! @Dandandan @alamb

Compare the current optimization with the main branch using cargo bench --bench binary_op. The results are as follows, where fluctuations within ±5% are considered as no changes.

Performance Comparison of the AND Logic Group

Test Case main short-and-optimize Ratio Change
all_false 62.623 ns 65.923 ns 0.95x no changes
one_true_first 448.69 µs 195.60 µs 2.29x ↑ ✅
one_true_last 452.00 µs 171.91 µs 2.63x ↑ ✅
one_true_middle 453.12 µs 173.94 µs 2.60x ↑ ✅
one_true_middle_left 453.06 µs 165.70 µs 2.73x ↑ ✅
one_true_middle_right 459.61 µs 171.53 µs 2.68x ↑ ✅
all_true_in_and 450.03 µs 445.76 µs 1.01x no changes

Performance Comparison of the OR Logic Group

Test Case main short-and-optimize Ratio Change
all_true 61.162 ns 64.430 ns 0.95x no changes
one_false_first 448.51 µs 439.92 µs 1.02x no changes
one_false_last 447.38 µs 453.64 µs 0.99x no changes
one_false_middle 457.79 µs 447.15 µs 1.02x no changes
one_false_middle_left 452.78 µs 447.75 µs 1.01x no changes
one_false_middle_right 451.21 µs 444.23 µs 1.02x no changes
all_false_in_or 449.90 µs 442.36 µs 1.02x no changes

Possible next step(extend to nulls)

Short-circuit optimization cannot be extended to nulls

The current short-circuit optimization is only applicable to cases without null values. However, based on the calculation principles of "and" and "or", if the left-hand side (lhs) evaluates to null, then the final result can only be determined by continuing to calculate the right-hand side (rhs). Therefore, optimization for this scenario is not feasible. Below is an example of a calculation where lhs is null:

select null and true;
+------------------------+
| NULL AND Boolean(true) |
+------------------------+
|                        |
+------------------------+
1 row in set. Query took 0.000 seconds.

❯ select null and false;
+-------------------------+
| NULL AND Boolean(false) |
+-------------------------+
| false                   |
+-------------------------+
1 row in set. Query took 0.000 seconds.

❯ select null or false;
+------------------------+
| NULL OR Boolean(false) |
+------------------------+
|                        |
+------------------------+
1 row in set. Query took 0.000 seconds.

❯ select null or true;
+-----------------------+
| NULL OR Boolean(true) |
+-----------------------+
| true                  |
+-----------------------+
1 row in set. Query took 0.000 seconds.

Pre-selection can be extended to include nulls

As I explained earlier: #15694 (comment), pre-selection can actually be extended to cover cases involving null values. However, one point needs to be confirmed: filter_record_batch will retain rows that are null.

  1. Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as true based on the values from the right-hand array).

Afterward, we only need to modify the fourth step of the pre-selection process mentioned earlier to complete the extension that supports nulls.

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acking-you ,
pre_selection_scatter is a GREAT idea!!

/// Based on the results calculated from the left side of the short-circuit operation,
/// if the proportion of `true` is less than 0.2 and the current operation is an `and`,
/// the `RecordBatch` will be filtered in advance.
const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
const PRE_SELECTION_THRESHOLD: f32 = 0.2;

Comment on lines 940 to 975
/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?
/// In practice, `left_result` should have only one owner, so making changes should be safe.
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?
/// In practice, `left_result` should have only one owner, so making changes should be safe.
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy?
/// In practice, `left_result` should have only one owner, so making changes should be safe.
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
/// Creates a new boolean array based on the evaluation of the right expression,
/// but only for positions where the left_result is true.
///
/// This function is used for short-circuit evaluation optimization of logical AND operations:
/// - When left_result has few true values, we only evaluate the right expression for those positions
/// - Values are copied from right_array where left_result is true
/// - All other positions are filled with false values
///
/// @param left_result Boolean array with selection mask (typically from left side of AND)
/// @param right_result Result of evaluating right side of expression (only for selected positions)
/// @return A combined ColumnarValue with values from right_result where left_result is true

Comment on lines 947 to 983
let right_array = if let ColumnarValue::Array(array) = right_result {
array
} else {
return Ok(right_result);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // Return scalar values as-is
    if let ColumnarValue::Scalar(_) = right_result {
        return Ok(right_result);
    }

    let right_array = right_result.into_array(left_result.len())?;

Maybe this reads easier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps changing it to match would be better? Calling into_array(left_result.len()) might make people think that right_result has the same length as left_result, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

Comment on lines 956 to 1017
let mut filled = 0;
// keep track of current position we have in right boolean array
let mut right_array_pos = 0;

let mut result_array_builder = BooleanArray::builder(result_len);
SlicesIterator::new(left_result).for_each(|(start, end)| {
// the gap needs to be filled with false
if start > filled {
(filled..start).for_each(|_| result_array_builder.append_value(false));
}
// fill with right_result values
let len = end - start;
right_boolean_array
.slice(right_array_pos, len)
.iter()
.for_each(|v| result_array_builder.append_option(v));

right_array_pos += len;
filled = end;
});
// the remaining part is falsy
if filled < result_len {
(filled..result_len).for_each(|_| result_array_builder.append_value(false));
}
let boolean_result = result_array_builder.finish();

Ok(ColumnarValue::Array(Arc::new(boolean_result)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative approach using last_end, instead of filled.

last_end tracks the end of the last filled region.

    // Build the resulting array efficiently
    let mut result_array_builder = BooleanArray::builder(result_len);
    let mut right_pos = 0;

    // Process the array using iterator of slices (contiguous true regions)
    let mut last_end = 0;
    for (start, end) in SlicesIterator::new(left_result) {
        // Fill the gap between last processed position and current slice start with false
        result_array_builder.append_n(start - last_end, false);

        // Copy values from right array for this slice
        let slice_len = end - start;
        right_boolean_array
            .slice(right_pos, slice_len)
            .iter()
            .for_each(|v| result_array_builder.append_option(v));

        right_pos += slice_len;
        last_end = end;
    }

    // Fill any remaining positions with false
    if last_end < result_len {
        result_array_builder.append_n(result_len - last_end, false);
    }

    Ok(ColumnarValue::Array(Arc::new(result_array_builder.finish())))

Copy link
Contributor

@kosiew kosiew Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this improved the performance further from

image

to

image

Copy link
Contributor Author

@acking-you acking-you Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append_n is indeed a more efficient choice. I checked its source code and found that it doesn't handle append_n(0, true) separately to skip it. However, for append_n(0, false), it will be detected and skipped during self.advance. Since we will only append false, we don't need a separate check, but we should add a comment to clarify this behavior.

I continued to check the source code and found that its self.null_buffer_builder.append_n_non_nulls(additional) will append true, so we still need an if statement to check whether it is 0.

Copy link
Contributor

@Dandandan Dandandan Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also do something like this (set bits on left side based on set_indices

(pseudocode)

for (index, left_index) in left.values().set_indices().iter().enumerate() {
    left_map.set_bit(left_map.get_bit(left_index) & right_map.get_bit(index)))
}

This avoids building a new map and only does some bit indexing / efficiently iterating the left side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also do something like this (set bits on left side based on set_indices

(pseudocode)

for (index, left_index) in left.values().set_indices().iter().enumerate() {
    left_map.set_bit(left_map.get_bit(left_index) & right_map.get_bit(index)))
}

This avoids building a new map and only does some bit indexing / efficiently iterating the left side.

Yes, I mentioned this in the comments, but the immutability of Arc and BooleanArray makes this operation difficult to complete:

/// Perhaps it would be better to modify `left_result` directly without creating a copy?
/// In practice, `left_result` should have only one owner, so making changes should be safe.
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is copying the left buffer and then modifying it, this is faster than creating it from scratch using a builder :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is copying the left buffer and then modifying it, this is faster than creating it from scratch using a builder :)

I think it should be about the same, because append won't allocate new memory—it just assigns values at the corresponding positions. There shouldn't be much difference compared to directly copying and then assigning values. This corresponds to the pre-allocation logic of the builder:

    let mut result_array_builder = BooleanArray::builder(result_len);

@Dandandan
Copy link
Contributor

| However, one point needs to be confirmed: filter_record_batch will retain rows that are null.

filter_record_batch removes rows that are null.

@Dandandan
Copy link
Contributor

Very cool.
It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here.

@acking-you
Copy link
Contributor Author

| However, one point needs to be confirmed: filter_record_batch will retain rows that are null.

filter_record_batch removes rows that are null.

If that's the case, we may need to do some hacky things to extend it to nulls.

@acking-you
Copy link
Contributor Author

acking-you commented Apr 14, 2025

Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here.

I tried running clickbench, and there wasn't a significant improvement. The optimization is still relatively selective about the cost of expression evaluation. For example, if the computation on the right involves only simple integer comparisons or short string comparisons, skipping a large amount of computation will not significantly improve performance..

New ideas

One potential area for significant improvement that I thought of is: heuristically replacing the computation process of filter_and_project.

The current state of this function is:

  1. Pass in the batch to compute the predicate and obtain a boolean array.
  2. Perform the actual filtering process on the batch based on the boolean array and return the result.

Pros

Since the filtering process involves copying to produce a new batch, calculating a boolean array based on the predicate only requires a single copy. If filtering is performed during the predicate computation, multiple copies will occur.

Cons

If the predicate is an AND expression, after the left side is computed, it may already have filtered out most of the batch and computed a new batch. However, the final batch generation is still based on the previous batch and the boolean array, which introduces unnecessary copy overhead.

Conclusion

Considering both aspects, I believe we can strike a balance. For example, if it is discovered during the AND computation process that most of the batch can be filtered out, then early filtering can be performed as the final result.
However, this optimization scenario is very limited: it can only be applied when the predicate is entirely an AND computation, such as a AND b AND c .... Once an OR appears in the middle, we cannot perform early filtering to produce the result.

The amount of code involved in implementing this optimization may be substantial,and it would only apply to predicates in the form of a AND b AND c.... It might not be worth the effort.

@alamb
Copy link
Contributor

alamb commented Apr 15, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing short-and-optmize (d104e3b) to 0b01fdf diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 15, 2025

🤖: Benchmark completed

Details

Comparing HEAD and short-and-optmize
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1918.00ms │         1941.59ms │ no change │
│ QQuery 1     │   711.53ms │          702.35ms │ no change │
│ QQuery 2     │  1429.36ms │         1446.13ms │ no change │
│ QQuery 3     │   704.72ms │          727.05ms │ no change │
│ QQuery 4     │  1516.84ms │         1514.23ms │ no change │
│ QQuery 5     │ 17398.39ms │        17271.95ms │ no change │
│ QQuery 6     │  2026.72ms │         2124.80ms │ no change │
└──────────────┴────────────┴───────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 25705.56ms │
│ Total Time (short-and-optmize)   │ 25728.10ms │
│ Average Time (HEAD)              │  3672.22ms │
│ Average Time (short-and-optmize) │  3675.44ms │
│ Queries Faster                   │          0 │
│ Queries Slower                   │          0 │
│ Queries with No Change           │          7 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.74ms │            2.25ms │ +1.21x faster │
│ QQuery 1     │    36.47ms │           36.63ms │     no change │
│ QQuery 2     │    90.59ms │           88.44ms │     no change │
│ QQuery 3     │   101.66ms │           95.74ms │ +1.06x faster │
│ QQuery 4     │   802.73ms │          840.27ms │     no change │
│ QQuery 5     │   891.98ms │          914.36ms │     no change │
│ QQuery 6     │     2.10ms │            2.14ms │     no change │
│ QQuery 7     │    41.09ms │           42.92ms │     no change │
│ QQuery 8     │   948.80ms │          961.56ms │     no change │
│ QQuery 9     │  1269.90ms │         1292.89ms │     no change │
│ QQuery 10    │   264.01ms │          274.69ms │     no change │
│ QQuery 11    │   301.65ms │          316.08ms │     no change │
│ QQuery 12    │   943.62ms │          966.90ms │     no change │
│ QQuery 13    │  1405.86ms │         1426.41ms │     no change │
│ QQuery 14    │   900.03ms │          898.57ms │     no change │
│ QQuery 15    │  1079.97ms │         1069.93ms │     no change │
│ QQuery 16    │  1778.50ms │         1782.02ms │     no change │
│ QQuery 17    │  1665.53ms │         1666.25ms │     no change │
│ QQuery 18    │  3160.53ms │         3186.64ms │     no change │
│ QQuery 19    │    86.25ms │           92.80ms │  1.08x slower │
│ QQuery 20    │  1170.29ms │         1183.35ms │     no change │
│ QQuery 21    │  1355.56ms │         1374.25ms │     no change │
│ QQuery 22    │  2374.52ms │         2322.80ms │     no change │
│ QQuery 23    │  8507.41ms │         8569.51ms │     no change │
│ QQuery 24    │   470.15ms │          480.90ms │     no change │
│ QQuery 25    │   394.74ms │          408.10ms │     no change │
│ QQuery 26    │   539.09ms │          561.82ms │     no change │
│ QQuery 27    │  1706.49ms │         1747.01ms │     no change │
│ QQuery 28    │ 12891.10ms │        12837.09ms │     no change │
│ QQuery 29    │   536.68ms │          522.79ms │     no change │
│ QQuery 30    │   826.10ms │          849.08ms │     no change │
│ QQuery 31    │   879.52ms │          918.11ms │     no change │
│ QQuery 32    │  2784.99ms │         2694.41ms │     no change │
│ QQuery 33    │  3398.18ms │         3423.14ms │     no change │
│ QQuery 34    │  3435.31ms │         3521.83ms │     no change │
│ QQuery 35    │  1320.36ms │         1303.37ms │     no change │
│ QQuery 36    │   126.83ms │          132.86ms │     no change │
│ QQuery 37    │    57.78ms │           59.40ms │     no change │
│ QQuery 38    │   126.83ms │          125.35ms │     no change │
│ QQuery 39    │   201.53ms │          208.57ms │     no change │
│ QQuery 40    │    51.48ms │           47.72ms │ +1.08x faster │
│ QQuery 41    │    47.15ms │           46.73ms │     no change │
│ QQuery 42    │    40.77ms │           39.91ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 59016.87ms │
│ Total Time (short-and-optmize)   │ 59335.58ms │
│ Average Time (HEAD)              │  1372.49ms │
│ Average Time (short-and-optmize) │  1379.90ms │
│ Queries Faster                   │          3 │
│ Queries Slower                   │          1 │
│ Queries with No Change           │         39 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 126.24ms │          123.04ms │     no change │
│ QQuery 2     │  24.28ms │           24.40ms │     no change │
│ QQuery 3     │  36.24ms │           37.03ms │     no change │
│ QQuery 4     │  20.48ms │           21.62ms │  1.06x slower │
│ QQuery 5     │  57.43ms │           57.95ms │     no change │
│ QQuery 6     │   8.32ms │           12.39ms │  1.49x slower │
│ QQuery 7     │ 108.70ms │          111.45ms │     no change │
│ QQuery 8     │  27.29ms │           27.24ms │     no change │
│ QQuery 9     │  64.45ms │           64.35ms │     no change │
│ QQuery 10    │  60.42ms │           59.08ms │     no change │
│ QQuery 11    │  13.32ms │           14.03ms │  1.05x slower │
│ QQuery 12    │  38.77ms │           46.44ms │  1.20x slower │
│ QQuery 13    │  28.91ms │           30.29ms │     no change │
│ QQuery 14    │   9.96ms │           10.23ms │     no change │
│ QQuery 15    │  25.93ms │           26.97ms │     no change │
│ QQuery 16    │  24.03ms │           23.96ms │     no change │
│ QQuery 17    │ 101.44ms │          100.13ms │     no change │
│ QQuery 18    │ 251.74ms │          245.84ms │     no change │
│ QQuery 19    │  30.50ms │           27.81ms │ +1.10x faster │
│ QQuery 20    │  40.06ms │           40.93ms │     no change │
│ QQuery 21    │ 176.83ms │          178.37ms │     no change │
│ QQuery 22    │  17.83ms │           17.84ms │     no change │
└──────────────┴──────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 1293.17ms │
│ Total Time (short-and-optmize)   │ 1301.40ms │
│ Average Time (HEAD)              │   58.78ms │
│ Average Time (short-and-optmize) │   59.15ms │
│ Queries Faster                   │         1 │
│ Queries Slower                   │         4 │
│ Queries with No Change           │        17 │
└──────────────────────────────────┴───────────┘

@acking-you
Copy link
Contributor Author

acking-you commented Apr 15, 2025

🤖: Benchmark completed

Details

Comparing HEAD and short-and-optmize
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1918.00ms │         1941.59ms │ no change │
│ QQuery 1     │   711.53ms │          702.35ms │ no change │
│ QQuery 2     │  1429.36ms │         1446.13ms │ no change │
│ QQuery 3     │   704.72ms │          727.05ms │ no change │
│ QQuery 4     │  1516.84ms │         1514.23ms │ no change │
│ QQuery 5     │ 17398.39ms │        17271.95ms │ no change │
│ QQuery 6     │  2026.72ms │         2124.80ms │ no change │
└──────────────┴────────────┴───────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 25705.56ms │
│ Total Time (short-and-optmize)   │ 25728.10ms │
│ Average Time (HEAD)              │  3672.22ms │
│ Average Time (short-and-optmize) │  3675.44ms │
│ Queries Faster                   │          0 │
│ Queries Slower                   │          0 │
│ Queries with No Change           │          7 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.74ms │            2.25ms │ +1.21x faster │
│ QQuery 1     │    36.47ms │           36.63ms │     no change │
│ QQuery 2     │    90.59ms │           88.44ms │     no change │
│ QQuery 3     │   101.66ms │           95.74ms │ +1.06x faster │
│ QQuery 4     │   802.73ms │          840.27ms │     no change │
│ QQuery 5     │   891.98ms │          914.36ms │     no change │
│ QQuery 6     │     2.10ms │            2.14ms │     no change │
│ QQuery 7     │    41.09ms │           42.92ms │     no change │
│ QQuery 8     │   948.80ms │          961.56ms │     no change │
│ QQuery 9     │  1269.90ms │         1292.89ms │     no change │
│ QQuery 10    │   264.01ms │          274.69ms │     no change │
│ QQuery 11    │   301.65ms │          316.08ms │     no change │
│ QQuery 12    │   943.62ms │          966.90ms │     no change │
│ QQuery 13    │  1405.86ms │         1426.41ms │     no change │
│ QQuery 14    │   900.03ms │          898.57ms │     no change │
│ QQuery 15    │  1079.97ms │         1069.93ms │     no change │
│ QQuery 16    │  1778.50ms │         1782.02ms │     no change │
│ QQuery 17    │  1665.53ms │         1666.25ms │     no change │
│ QQuery 18    │  3160.53ms │         3186.64ms │     no change │
│ QQuery 19    │    86.25ms │           92.80ms │  1.08x slower │
│ QQuery 20    │  1170.29ms │         1183.35ms │     no change │
│ QQuery 21    │  1355.56ms │         1374.25ms │     no change │
│ QQuery 22    │  2374.52ms │         2322.80ms │     no change │
│ QQuery 23    │  8507.41ms │         8569.51ms │     no change │
│ QQuery 24    │   470.15ms │          480.90ms │     no change │
│ QQuery 25    │   394.74ms │          408.10ms │     no change │
│ QQuery 26    │   539.09ms │          561.82ms │     no change │
│ QQuery 27    │  1706.49ms │         1747.01ms │     no change │
│ QQuery 28    │ 12891.10ms │        12837.09ms │     no change │
│ QQuery 29    │   536.68ms │          522.79ms │     no change │
│ QQuery 30    │   826.10ms │          849.08ms │     no change │
│ QQuery 31    │   879.52ms │          918.11ms │     no change │
│ QQuery 32    │  2784.99ms │         2694.41ms │     no change │
│ QQuery 33    │  3398.18ms │         3423.14ms │     no change │
│ QQuery 34    │  3435.31ms │         3521.83ms │     no change │
│ QQuery 35    │  1320.36ms │         1303.37ms │     no change │
│ QQuery 36    │   126.83ms │          132.86ms │     no change │
│ QQuery 37    │    57.78ms │           59.40ms │     no change │
│ QQuery 38    │   126.83ms │          125.35ms │     no change │
│ QQuery 39    │   201.53ms │          208.57ms │     no change │
│ QQuery 40    │    51.48ms │           47.72ms │ +1.08x faster │
│ QQuery 41    │    47.15ms │           46.73ms │     no change │
│ QQuery 42    │    40.77ms │           39.91ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 59016.87ms │
│ Total Time (short-and-optmize)   │ 59335.58ms │
│ Average Time (HEAD)              │  1372.49ms │
│ Average Time (short-and-optmize) │  1379.90ms │
│ Queries Faster                   │          3 │
│ Queries Slower                   │          1 │
│ Queries with No Change           │         39 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ short-and-optmize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 126.24ms │          123.04ms │     no change │
│ QQuery 2     │  24.28ms │           24.40ms │     no change │
│ QQuery 3     │  36.24ms │           37.03ms │     no change │
│ QQuery 4     │  20.48ms │           21.62ms │  1.06x slower │
│ QQuery 5     │  57.43ms │           57.95ms │     no change │
│ QQuery 6     │   8.32ms │           12.39ms │  1.49x slower │
│ QQuery 7     │ 108.70ms │          111.45ms │     no change │
│ QQuery 8     │  27.29ms │           27.24ms │     no change │
│ QQuery 9     │  64.45ms │           64.35ms │     no change │
│ QQuery 10    │  60.42ms │           59.08ms │     no change │
│ QQuery 11    │  13.32ms │           14.03ms │  1.05x slower │
│ QQuery 12    │  38.77ms │           46.44ms │  1.20x slower │
│ QQuery 13    │  28.91ms │           30.29ms │     no change │
│ QQuery 14    │   9.96ms │           10.23ms │     no change │
│ QQuery 15    │  25.93ms │           26.97ms │     no change │
│ QQuery 16    │  24.03ms │           23.96ms │     no change │
│ QQuery 17    │ 101.44ms │          100.13ms │     no change │
│ QQuery 18    │ 251.74ms │          245.84ms │     no change │
│ QQuery 19    │  30.50ms │           27.81ms │ +1.10x faster │
│ QQuery 20    │  40.06ms │           40.93ms │     no change │
│ QQuery 21    │ 176.83ms │          178.37ms │     no change │
│ QQuery 22    │  17.83ms │           17.84ms │     no change │
└──────────────┴──────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                │ 1293.17ms │
│ Total Time (short-and-optmize)   │ 1301.40ms │
│ Average Time (HEAD)              │   58.78ms │
│ Average Time (short-and-optmize) │   59.15ms │
│ Queries Faster                   │         1 │
│ Queries Slower                   │         4 │
│ Queries with No Change           │        17 │
└──────────────────────────────────┴───────────┘

The result met expectations. This optimization primarily reduces the subsequent filter computation load by making early decisions. This is useful when the filter computation is heavy and short-circuiting fails. But I believe there are at least three areas for further optimization:

  1. Not only considering the ratio of true_count, but also taking into account the data size of the batch.

  2. Striving to implement zero-copy for the lhs.

  3. Extending the approach to handle cases where lhs contains null values.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @acking-you -- I think this is looking really nice -- thank you and @kosiew . It is amazing to see this level of optimization

The only thing I think is needed in this PR is a few more tests for the pre_selection_scatter function and then it will be ready to go

I think we should plan to merge this once we create a datafusion 47 release

Again, really nice work and thank you


/// Based on the results calculated from the left side of the short-circuit operation,
/// if the proportion of `true` is less than 0.2 and the current operation is an `and`,
/// the `RecordBatch` will be filtered in advance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems reasonable to me

/// # Note
/// Perhaps it would be better to modify `left_result` directly without creating a copy?
/// In practice, `left_result` should have only one owner, so making changes should be safe.
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could potentially use kernels like unary_mut to reuse the allocation: https://docs.rs/arrow/latest/arrow/compute/fn.unary_mut.html

However, I don't think that is currently supported in arrow-rs for BooleanArray -- we could add it upstream if we wanted to explore this optimization

However each BooleanArray is likely to have 8192 elements , so 1K bytes, where maybe the overhead is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also think that copying 1KB of data is perfectly acceptable, as it should only incur a nanosecond-level overhead.

@alamb alamb added the performance Make DataFusion faster label Apr 15, 2025
kosiew added 14 commits April 17, 2025 00:48
- Delay evaluation of the right-hand side (RHS) unless necessary.
- Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first.
- Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS.
- Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations.
…esult and update assertions

- Renamed the test function for clarity.
- Updated assertions to use get_short_circuit_result instead of check_short_circuit.
- Added additional test cases for AND and OR operations with expected results.
…lt function for null

- Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array.
- Adjusted test case to reflect the change in expected output.
…d enhance documentation for get_short_circuit_result
kosiew and others added 25 commits April 17, 2025 00:48
…tors

- Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function.
- Updated handling of Boolean data types to return `None` for null values.
- Simplified short-circuit checks for AND/OR operations by consolidating logic.
- Enhanced readability and maintainability of the code by restructuring match statements.
…Expr to optimize logical operations"

This reverts commit a62df47.
…luation in BinaryExpr

- Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic.
- Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators.
- Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS.
- Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations.
…nction

- Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance.
- Updated documentation to reflect changes in array handling techniques.
…_short_circuit logic

- Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance.
- Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations.
- Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism.
…ze check_short_circuit logic"

This reverts commit e2b9f77.
@acking-you
Copy link
Contributor Author

The only thing I think is needed in this PR is a few more tests for the pre_selection_scatter function and then it will be ready to go

done

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you again @acking-you and @kosiew -- really nice work

fn create_bool_array(bools: Vec<bool>) -> BooleanArray {
BooleanArray::from(bools.into_iter().map(Some).collect::<Vec<_>>())
}
// Test sparse left with interleaved true/false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@alamb alamb merged commit 4818966 into apache:main Apr 17, 2025
27 checks passed
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…ation (apache#15694)

* Enhance short-circuit evaluation for binary expressions

- Delay evaluation of the right-hand side (RHS) unless necessary.
- Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first.
- Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS.
- Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations.

* refactor: rename test_check_short_circuit to test_get_short_circuit_result and update assertions

- Renamed the test function for clarity.
- Updated assertions to use get_short_circuit_result instead of check_short_circuit.
- Added additional test cases for AND and OR operations with expected results.

* fix: enhance short-circuit evaluation logic in get_short_circuit_result function for null

- Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array.
- Adjusted test case to reflect the change in expected output.

* feat: add debug logging for binary expression evaluation and short-circuit checks

* fix: improve short-circuit evaluation logic in BinaryExpr to ensure RHS is only evaluated when necessary

* fix: restrict short-circuit evaluation to logical operators in get_short_circuit_result function

* add more println!("==> ");

* fix: remove duplicate data type checks for left and right operands in BinaryExpr evaluation

* feat: add debug prints for dictionary values and keys in binary expression tests

* Tests pass

* fix: remove redundant short-circuit evaluation check in BinaryExpr and enhance documentation for get_short_circuit_result

* refactor: remove unnecessary debug prints and streamline short-circuit evaluation in BinaryExpr

* test: enhance short-circuit evaluation tests for nullable and scalar values in BinaryExpr

* add benchmark

* refactor: improve short-circuit logic in BinaryExpr for logical operators

- Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function.
- Updated handling of Boolean data types to return `None` for null values.
- Simplified short-circuit checks for AND/OR operations by consolidating logic.
- Enhanced readability and maintainability of the code by restructuring match statements.

* refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations

* Revert "refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations"

This reverts commit a62df47.

* bench: add benchmark for OR operation with all false values in short-circuit evaluation

* refactor: add ShortCircuitStrategy enum to optimize short-circuit evaluation in BinaryExpr

- Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic.
- Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators.
- Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS.
- Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations.

* refactor: simplify short-circuit evaluation logic in check_short_circuit function

* datafusion_expr::lit as expr_lit

* refactor: optimize short-circuit evaluation in check_short_circuit function
- Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance.
- Updated documentation to reflect changes in array handling techniques.

* refactor: add count_boolean_values helper function and optimize check_short_circuit logic

- Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance.
- Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations.
- Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism.

* Revert "refactor: add count_boolean_values helper function and optimize check_short_circuit logic"

This reverts commit e2b9f77.

* optimise evaluate

* optimise evaluate 2

* refactor  op:AND, lhs all false op:OR, lhs all true to be faster

* fix clippy warning

* refactor: optimize short-circuit evaluation logic in check_short_circuit function

* fix clippy warning

* add pre selection

* add some comments

* [WIP] fix pre-selection result

* fix: Error in calculating the ratio

* fix: Correct typo in pre-selection threshold constant and improve pre-selection scatter function documentation

* fix doctest error

* fix cargo doc

* fix cargo doc

* test: Add unit tests for pre_selection_scatter function

---------

Co-authored-by: Siew Kam Onn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more short-circuit optimization scenarios for OR and AND
4 participants