-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Apply pre-selection and computation skipping to short-circuit optimization #15694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The error in
To illustrate this, I’ve drawn a diagram (it’s quite rough since I used a mouse, so I hope you don’t mind 😂): The current code only handles up to the second step, which is why the error occurs. If you use I’m currently working on implementing the fourth step to fix the issue, but it may take some time. |
The relevant bug fixes have been completed, and corresponding performance tests have been conducted. The results show that pre-selection has achieved significant gains! @Dandandan @alamb Compare the current optimization with the main branch using Performance Comparison of the AND Logic Group
Performance Comparison of the OR Logic Group
Possible next step(extend to nulls)Short-circuit optimization cannot be extended to nullsThe current short-circuit optimization is only applicable to cases without null values. However, based on the calculation principles of "and" and "or", if the left-hand side (lhs) evaluates to null, then the final result can only be determined by continuing to calculate the right-hand side (rhs). Therefore, optimization for this scenario is not feasible. Below is an example of a calculation where lhs is null: ❯ select null and true;
+------------------------+
| NULL AND Boolean(true) |
+------------------------+
| |
+------------------------+
1 row in set. Query took 0.000 seconds.
❯ select null and false;
+-------------------------+
| NULL AND Boolean(false) |
+-------------------------+
| false |
+-------------------------+
1 row in set. Query took 0.000 seconds.
❯ select null or false;
+------------------------+
| NULL OR Boolean(false) |
+------------------------+
| |
+------------------------+
1 row in set. Query took 0.000 seconds.
❯ select null or true;
+-----------------------+
| NULL OR Boolean(true) |
+-----------------------+
| true |
+-----------------------+
1 row in set. Query took 0.000 seconds. Pre-selection can be extended to include nullsAs I explained earlier: #15694 (comment), pre-selection can actually be extended to cover cases involving null values. However, one point needs to be confirmed: filter_record_batch will retain rows that are null.
Afterward, we only need to modify the fourth step of the pre-selection process mentioned earlier to complete the extension that supports nulls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acking-you ,
pre_selection_scatter is a GREAT idea!!
/// Based on the results calculated from the left side of the short-circuit operation, | ||
/// if the proportion of `true` is less than 0.2 and the current operation is an `and`, | ||
/// the `RecordBatch` will be filtered in advance. | ||
const PRE_SELECTIO_THRESHOLD: f32 = 0.2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const PRE_SELECTIO_THRESHOLD: f32 = 0.2; | |
const PRE_SELECTION_THRESHOLD: f32 = 0.2; |
/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy? | ||
/// In practice, `left_result` should have only one owner, so making changes should be safe. | ||
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy? | |
/// In practice, `left_result` should have only one owner, so making changes should be safe. | |
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`]. | |
/// FIXME: Perhaps it would be better to modify `left_result` directly without creating a copy? | |
/// In practice, `left_result` should have only one owner, so making changes should be safe. | |
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`]. | |
/// Creates a new boolean array based on the evaluation of the right expression, | |
/// but only for positions where the left_result is true. | |
/// | |
/// This function is used for short-circuit evaluation optimization of logical AND operations: | |
/// - When left_result has few true values, we only evaluate the right expression for those positions | |
/// - Values are copied from right_array where left_result is true | |
/// - All other positions are filled with false values | |
/// | |
/// @param left_result Boolean array with selection mask (typically from left side of AND) | |
/// @param right_result Result of evaluating right side of expression (only for selected positions) | |
/// @return A combined ColumnarValue with values from right_result where left_result is true |
let right_array = if let ColumnarValue::Array(array) = right_result { | ||
array | ||
} else { | ||
return Ok(right_result); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Return scalar values as-is
if let ColumnarValue::Scalar(_) = right_result {
return Ok(right_result);
}
let right_array = right_result.into_array(left_result.len())?;
Maybe this reads easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps changing it to match
would be better? Calling into_array(left_result.len())
might make people think that right_result
has the same length as left_result
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
let mut filled = 0; | ||
// keep track of current position we have in right boolean array | ||
let mut right_array_pos = 0; | ||
|
||
let mut result_array_builder = BooleanArray::builder(result_len); | ||
SlicesIterator::new(left_result).for_each(|(start, end)| { | ||
// the gap needs to be filled with false | ||
if start > filled { | ||
(filled..start).for_each(|_| result_array_builder.append_value(false)); | ||
} | ||
// fill with right_result values | ||
let len = end - start; | ||
right_boolean_array | ||
.slice(right_array_pos, len) | ||
.iter() | ||
.for_each(|v| result_array_builder.append_option(v)); | ||
|
||
right_array_pos += len; | ||
filled = end; | ||
}); | ||
// the remaining part is falsy | ||
if filled < result_len { | ||
(filled..result_len).for_each(|_| result_array_builder.append_value(false)); | ||
} | ||
let boolean_result = result_array_builder.finish(); | ||
|
||
Ok(ColumnarValue::Array(Arc::new(boolean_result))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an alternative approach using last_end, instead of filled.
last_end tracks the end of the last filled region.
// Build the resulting array efficiently
let mut result_array_builder = BooleanArray::builder(result_len);
let mut right_pos = 0;
// Process the array using iterator of slices (contiguous true regions)
let mut last_end = 0;
for (start, end) in SlicesIterator::new(left_result) {
// Fill the gap between last processed position and current slice start with false
result_array_builder.append_n(start - last_end, false);
// Copy values from right array for this slice
let slice_len = end - start;
right_boolean_array
.slice(right_pos, slice_len)
.iter()
.for_each(|v| result_array_builder.append_option(v));
right_pos += slice_len;
last_end = end;
}
// Fill any remaining positions with false
if last_end < result_len {
result_array_builder.append_n(result_len - last_end, false);
}
Ok(ColumnarValue::Array(Arc::new(result_array_builder.finish())))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append_n
is indeed a more efficient choice. I checked its source code and found that it doesn't handle append_n(0, true)
separately to skip it. However, for append_n(0, false)
, it will be detected and skipped during self.advance
. Since we will only append false
, we don't need a separate check, but we should add a comment to clarify this behavior.
I continued to check the source code and found that its self.null_buffer_builder.append_n_non_nulls(additional)
will append true
, so we still need an if statement to check whether it is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also do something like this (set bits on left side based on set_indices
(pseudocode)
for (index, left_index) in left.values().set_indices().iter().enumerate() {
left_map.set_bit(left_map.get_bit(left_index) & right_map.get_bit(index)))
}
This avoids building a new map and only does some bit indexing / efficiently iterating the left side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also do something like this (set bits on left side based on
set_indices
(pseudocode)
for (index, left_index) in left.values().set_indices().iter().enumerate() { left_map.set_bit(left_map.get_bit(left_index) & right_map.get_bit(index))) }
This avoids building a new map and only does some bit indexing / efficiently iterating the left side.
Yes, I mentioned this in the comments, but the immutability of Arc and BooleanArray makes this operation difficult to complete:
/// Perhaps it would be better to modify `left_result` directly without creating a copy?
/// In practice, `left_result` should have only one owner, so making changes should be safe.
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is copying the left buffer and then modifying it, this is faster than creating it from scratch using a builder :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is copying the left buffer and then modifying it, this is faster than creating it from scratch using a builder :)
I think it should be about the same, because append won't allocate new memory—it just assigns values at the corresponding positions. There shouldn't be much difference compared to directly copying and then assigning values. This corresponds to the pre-allocation logic of the builder:
let mut result_array_builder = BooleanArray::builder(result_len);
| However, one point needs to be confirmed: filter_record_batch will retain rows that are null.
|
Very cool. |
If that's the case, we may need to do some hacky things to extend it to nulls. |
I tried running clickbench, and there wasn't a significant improvement. The optimization is still relatively selective about the cost of expression evaluation. For example, if the computation on the right involves only simple integer comparisons or short string comparisons, skipping a large amount of computation will not significantly improve performance.. New ideasOne potential area for significant improvement that I thought of is: heuristically replacing the computation process of The current state of this function is:
ProsSince the filtering process involves copying to produce a new batch, calculating a boolean array based on the predicate only requires a single copy. If filtering is performed during the predicate computation, multiple copies will occur. ConsIf the predicate is an AND expression, after the left side is computed, it may already have filtered out most of the batch and computed a new batch. However, the final batch generation is still based on the previous batch and the boolean array, which introduces unnecessary copy overhead. ConclusionConsidering both aspects, I believe we can strike a balance. For example, if it is discovered during the AND computation process that most of the batch can be filtered out, then early filtering can be performed as the final result. The amount of code involved in implementing this optimization may be substantial,and it would only apply to predicates in the form of |
🤖 |
🤖: Benchmark completed Details
|
The result met expectations. This optimization primarily reduces the subsequent filter computation load by making early decisions. This is useful when the filter computation is heavy and short-circuiting fails. But I believe there are at least three areas for further optimization:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @acking-you -- I think this is looking really nice -- thank you and @kosiew . It is amazing to see this level of optimization
The only thing I think is needed in this PR is a few more tests for the pre_selection_scatter
function and then it will be ready to go
I think we should plan to merge this once we create a datafusion 47 release
Again, really nice work and thank you
|
||
/// Based on the results calculated from the left side of the short-circuit operation, | ||
/// if the proportion of `true` is less than 0.2 and the current operation is an `and`, | ||
/// the `RecordBatch` will be filtered in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems reasonable to me
/// # Note | ||
/// Perhaps it would be better to modify `left_result` directly without creating a copy? | ||
/// In practice, `left_result` should have only one owner, so making changes should be safe. | ||
/// However, this is difficult to achieve under the immutable constraints of [`Arc`] and [`BooleanArray`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could potentially use kernels like unary_mut
to reuse the allocation: https://docs.rs/arrow/latest/arrow/compute/fn.unary_mut.html
However, I don't think that is currently supported in arrow-rs for BooleanArray
-- we could add it upstream if we wanted to explore this optimization
However each BooleanArray is likely to have 8192 elements , so 1K bytes, where maybe the overhead is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also think that copying 1KB of data is perfectly acceptable, as it should only incur a nanosecond-level overhead.
- Delay evaluation of the right-hand side (RHS) unless necessary. - Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first. - Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS. - Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations.
…esult and update assertions - Renamed the test function for clarity. - Updated assertions to use get_short_circuit_result instead of check_short_circuit. - Added additional test cases for AND and OR operations with expected results.
…lt function for null - Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array. - Adjusted test case to reflect the change in expected output.
…HS is only evaluated when necessary
…ort_circuit_result function
… BinaryExpr evaluation
…d enhance documentation for get_short_circuit_result
…t evaluation in BinaryExpr
…values in BinaryExpr
…tors - Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function. - Updated handling of Boolean data types to return `None` for null values. - Simplified short-circuit checks for AND/OR operations by consolidating logic. - Enhanced readability and maintainability of the code by restructuring match statements.
…optimize logical operations
…Expr to optimize logical operations" This reverts commit a62df47.
…circuit evaluation
…luation in BinaryExpr - Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic. - Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators. - Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS. - Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations.
…nction - Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance. - Updated documentation to reflect changes in array handling techniques.
…_short_circuit logic - Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance. - Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations. - Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism.
…ze check_short_circuit logic" This reverts commit e2b9f77.
…-selection scatter function documentation
eb50121
to
983e81e
Compare
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you again @acking-you and @kosiew -- really nice work
fn create_bool_array(bools: Vec<bool>) -> BooleanArray { | ||
BooleanArray::from(bools.into_iter().map(Some).collect::<Vec<_>>()) | ||
} | ||
// Test sparse left with interleaved true/false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
…ation (apache#15694) * Enhance short-circuit evaluation for binary expressions - Delay evaluation of the right-hand side (RHS) unless necessary. - Optimize short-circuiting for `Operator::And` and `Operator::Or` by checking LHS alone first. - Introduce `get_short_circuit_result` function to determine short-circuit conditions based on LHS and RHS. - Update tests to cover various short-circuit scenarios for both `AND` and `OR` operations. * refactor: rename test_check_short_circuit to test_get_short_circuit_result and update assertions - Renamed the test function for clarity. - Updated assertions to use get_short_circuit_result instead of check_short_circuit. - Added additional test cases for AND and OR operations with expected results. * fix: enhance short-circuit evaluation logic in get_short_circuit_result function for null - Updated AND and OR short-circuit conditions to only trigger when all values are either false or true, respectively, and there are no nulls in the array. - Adjusted test case to reflect the change in expected output. * feat: add debug logging for binary expression evaluation and short-circuit checks * fix: improve short-circuit evaluation logic in BinaryExpr to ensure RHS is only evaluated when necessary * fix: restrict short-circuit evaluation to logical operators in get_short_circuit_result function * add more println!("==> "); * fix: remove duplicate data type checks for left and right operands in BinaryExpr evaluation * feat: add debug prints for dictionary values and keys in binary expression tests * Tests pass * fix: remove redundant short-circuit evaluation check in BinaryExpr and enhance documentation for get_short_circuit_result * refactor: remove unnecessary debug prints and streamline short-circuit evaluation in BinaryExpr * test: enhance short-circuit evaluation tests for nullable and scalar values in BinaryExpr * add benchmark * refactor: improve short-circuit logic in BinaryExpr for logical operators - Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` function. - Updated handling of Boolean data types to return `None` for null values. - Simplified short-circuit checks for AND/OR operations by consolidating logic. - Enhanced readability and maintainability of the code by restructuring match statements. * refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations * Revert "refactor: enhance short-circuit evaluation strategy in BinaryExpr to optimize logical operations" This reverts commit a62df47. * bench: add benchmark for OR operation with all false values in short-circuit evaluation * refactor: add ShortCircuitStrategy enum to optimize short-circuit evaluation in BinaryExpr - Replaced the lazy evaluation of the right-hand side (RHS) with immediate evaluation based on short-circuiting logic. - Introduced a new function `check_short_circuit` to determine if short-circuiting can be applied for logical operators. - Updated the logic to return early for `Operator::And` and `Operator::Or` based on the evaluation of the left-hand side (LHS) and the conditions of the RHS. - Improved clarity and efficiency of the short-circuit evaluation process by eliminating unnecessary evaluations. * refactor: simplify short-circuit evaluation logic in check_short_circuit function * datafusion_expr::lit as expr_lit * refactor: optimize short-circuit evaluation in check_short_circuit function - Simplified logic for AND/OR operations by prioritizing false/true counts to enhance performance. - Updated documentation to reflect changes in array handling techniques. * refactor: add count_boolean_values helper function and optimize check_short_circuit logic - Introduced a new helper function `count_boolean_values` to count true and false values in a BooleanArray, improving readability and performance. - Updated `check_short_circuit` to utilize the new helper function for counting, reducing redundant operations and enhancing clarity in the evaluation logic for AND/OR operations. - Adjusted comments for better understanding of the short-circuiting conditions based on the new counting mechanism. * Revert "refactor: add count_boolean_values helper function and optimize check_short_circuit logic" This reverts commit e2b9f77. * optimise evaluate * optimise evaluate 2 * refactor op:AND, lhs all false op:OR, lhs all true to be faster * fix clippy warning * refactor: optimize short-circuit evaluation logic in check_short_circuit function * fix clippy warning * add pre selection * add some comments * [WIP] fix pre-selection result * fix: Error in calculating the ratio * fix: Correct typo in pre-selection threshold constant and improve pre-selection scatter function documentation * fix doctest error * fix cargo doc * fix cargo doc * test: Add unit tests for pre_selection_scatter function --------- Co-authored-by: Siew Kam Onn <[email protected]>
Which issue does this PR close?
OR
andAND
#15636You can learn the reason behind this PR by reviewing the discussion in this issue: #15631 (comment)
Rationale for this change
Many thanks to @kosiew for doing a tremendous amount of work. Based on his PR #15648, I have made the following changes:
false_count
andtrue_count
can be unified asvalues().count_set_bits()
in non-null cases, which would eliminate two calls tonulls_count
and one call totrue_count
.and
operation. For more details on the principle, you can refer to: Improve the performance of early exit evaluation in binary_expr #15631 (comment).What changes are included in this PR?
Wonderful work done by kosiew: #15648
Discussion of the related issue for this PR: #15631 (comment)
Are these changes tested?
Are there any user-facing changes?