-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve the performance of early exit evaluation in binary_expr #15631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This might require manual SIMD for optimization, but that would increase the porting difficulty(As duckdb says). However, perhaps an alternative approach could be tried to make it easier for the compiler to optimize. If feasible, it also seems capable of improving the performance of related calls in the arrow-rs library. Some explorationIn ClickHouse's filter implementation, there is a classic manual SIMD implementation approach: code The function involves loading multiple boolean values at once using SIMD instructions to increase the loop step.
For other cases, the performance degrades to a handling method similar to when SIMD is not used (the additional overhead being the preparation of SIMD variables). If this approach is applied to check whether a bit is 1 or 0, it should incur almost no overhead (only requiring a comparison with At the same time, could DataFusion's filter process also be optimized using this method? Alternatively, could we find another form of vectorization that does not involve manual unrolling? |
Interesting! I think we probably can take some inspiration from arrow-rs aggregate code, e.g. doing something like (?): /// Counts the number of ones
pub fn count_ones(&self) -> usize {
// platform specific
const LANES: usize = 8;
let mut chunks = self.chunks.chunks_exact(LANES);
let mut sum: usize = 0;
chunks.borrow_mut().for_each(|chunk| {
let chunk: [u64; LANES] = chunk.try_into().unwrap();
for i in 0..LANES {
sum += chunk[i].count_ones() as usize;
}
});
let remainder = chunks.remainder();
for chunk in remainder {
sum += chunk.count_ones() as usize;
}
sum
} (Didn't test this to be faster). |
Maybe simply manually unrolling the loop to check 1024 bits at a time would let llvm make the best code 🤔 |
Would be good to compare it with a boolean version of this as well, like this, to see if it vectorizes better:
(And then maybe do it in certain chunks like @alamb suggests) |
I have an idea that might improve the effectiveness of short-circuit optimization, and it seems necessary to use The current issue with DataFusion's execution of where left and right
Current:
batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false]
batch:[1,2,3,4] -> execute right -> bool array: [true,true,false,false]
Might be better:
batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] -> batch:[1,3]
batch:[1,3] -> execute right -> bool array: [true,false] -> batch:[1] I tried implementing this process using evaluate_selection, but the performance regressed in many cases because its internal implementation requires copying to create a new However, perhaps we could heuristically decide whether to pre-filter the By the way, I recently looked into ClickHouse's execution logic for I don't know if you think it's a good idea? @alamb @Dandandan The following experiments and conclusions are outdated. For more details, please refer to: #15631 (comment) My PracticeRelevant code implementation: code1 code2 Performance test results:
short_circuit/and/one_true_first
time: [479.48 µs 480.13 µs 480.96 µs]
change: [-5.8267% -4.4620% -3.2309%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low mild
4 (4.00%) high mild
8 (8.00%) high severe
short_circuit/and/one_true_last
time: [480.46 µs 481.17 µs 482.05 µs]
change: [-6.2128% -5.0143% -3.9098%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) high mild
7 (7.00%) high severe
short_circuit/and/one_true_middle_left
time: [480.49 µs 481.32 µs 482.32 µs]
change: [-2.1930% -1.5907% -1.0753%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
short_circuit/and/one_true_middle_right
time: [480.96 µs 481.92 µs 483.15 µs]
change: [-5.7694% -3.1496% -1.1136%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe ConclusionShortCircuitStrategy::PreSelection optimization did not have the effect I imagined, and it seems unnecessary. Early exit from computations may still be a better optimization approach. |
In my opinion, as long as the code is easy to understand, makes realistic benchmarks faster, and doesn't regress existing performance it is good thing to try My biggest concern with this type of optimization is that it will cause some queries to go faster and some to go slower -- this tradeoff is not great in my mind as then we have to judge if the tradeoff is worth it. Given how many people use DataFusion I think it likely that not all users have the same opinion |
I think it is a pretty good idea given that evaluation is so important. |
To show the potential, I tested this yesterday to reduce execution time of short circuiting all false / all true cases by -25% compared to fn all_zero(array: &BooleanArray) -> bool {
// TODO: nulls
// match array.nulls() {
// Some(nulls) => {
// }
// None => {}
// }
// platform specific
let bit_chunks = array.values().bit_chunks();
let mut all_zero = true;
for i in bit_chunks {
all_zero &= i == 0;
}
all_zero
}
fn all_one(array: &BooleanArray) -> bool {
let bit_chunks = array.values().bit_chunks();
let mut all_one = true;
for i in bit_chunks {
all_one &= i == 0xFFFFFFFFFFFFFFFF;
}
all_one
} |
hi @Dandandan I am getting failed tests with #[test]
fn test_all_one() -> Result<()> {
// Helper function to run tests and report failures
let run_test = |array: &BooleanArray, name: &str, expected: bool| -> Result<()> {
let result = all_one(array);
if result != expected {
println!(
"Test case '{}' failed: expected {}, got {}",
name, expected, result
);
println!("Array contents: {:?}", array);
}
assert_eq!(result, expected, "all_one failed for test case '{}'", name);
Ok(())
};
// Basic cases - uniform arrays
let all_one_array = BooleanArray::from(vec![true, true, true, true]);
run_test(&all_one_array, "all true array", true)?;
let all_zero_array = BooleanArray::from(vec![false, false, false, false]);
run_test(&all_zero_array, "all false array", false)?;
// Mixed values
let mixed_array = BooleanArray::from(vec![true, true, false, true]);
run_test(&mixed_array, "mixed array with one false", false)?;
// Edge cases
let empty_array = BooleanArray::from(vec![] as Vec<bool>);
run_test(&empty_array, "empty array", true)?;
let single_true = BooleanArray::from(vec![true]);
run_test(&single_true, "single true", true)?;
let single_false = BooleanArray::from(vec![false]);
run_test(&single_false, "single false", false)?;
// Arrays with nulls
let array_with_nulls: BooleanArray =
vec![Some(true), None, Some(true)].into_iter().collect();
run_test(&array_with_nulls, "nulls with true", true)?;
let nulls_with_false: BooleanArray =
vec![Some(true), Some(false), None].into_iter().collect();
run_test(&nulls_with_false, "nulls with false", false)?;
// Large arrays
let large_all_one = BooleanArray::from(vec![true; 128]);
run_test(&large_all_one, "large all true (128)", true)?;
// Test with a single false at different positions
let mut values = vec![true; 128];
values[63] = false; // Last bit in first chunk
let large_with_boundary_false = BooleanArray::from(values);
run_test(
&large_with_boundary_false,
"large with false at bit 63",
false,
)?;
let mut values = vec![true; 128];
values[64] = false; // First bit in second chunk
let large_with_second_chunk_false = BooleanArray::from(values);
run_test(
&large_with_second_chunk_false,
"large with false at bit 64",
false,
)?;
// Specific sizes that might trigger edge cases
let exact_chunk_size = BooleanArray::from(vec![true; 64]);
run_test(&exact_chunk_size, "exact chunk size (64)", true)?;
let just_under_chunk = BooleanArray::from(vec![true; 63]);
run_test(&just_under_chunk, "just under chunk size (63)", true)?;
let just_over_chunk = BooleanArray::from(vec![true; 65]);
run_test(&just_over_chunk, "just over chunk size (65)", true)?;
Ok(())
}
#[test]
fn test_all_zero() -> Result<()> {
// Helper function to run tests and report failures
let run_test = |array: &BooleanArray, name: &str, expected: bool| -> Result<()> {
let result = all_zero(array);
if result != expected {
println!(
"Test case '{}' failed: expected {}, got {}",
name, expected, result
);
println!("Array contents: {:?}", array);
}
assert_eq!(result, expected, "all_zero failed for test case '{}'", name);
Ok(())
};
// Basic cases - uniform arrays
let all_zero_array = BooleanArray::from(vec![false, false, false, false]);
run_test(&all_zero_array, "all false array", true)?;
let all_one_array = BooleanArray::from(vec![true, true, true, true]);
run_test(&all_one_array, "all true array", false)?;
// Mixed values
let mixed_array = BooleanArray::from(vec![false, false, true, false]);
run_test(&mixed_array, "mixed array with one true", false)?;
// Edge cases
let empty_array = BooleanArray::from(vec![] as Vec<bool>);
run_test(&empty_array, "empty array", true)?;
let single_false = BooleanArray::from(vec![false]);
run_test(&single_false, "single false", true)?;
let single_true = BooleanArray::from(vec![true]);
run_test(&single_true, "single true", false)?;
// Arrays with nulls
let array_with_nulls: BooleanArray =
vec![Some(false), None, Some(false)].into_iter().collect();
run_test(&array_with_nulls, "nulls with false", true)?;
let nulls_with_true: BooleanArray =
vec![Some(false), Some(true), None].into_iter().collect();
run_test(&nulls_with_true, "nulls with true", false)?;
// Large arrays
let large_all_zero = BooleanArray::from(vec![false; 128]);
run_test(&large_all_zero, "large all false (128)", true)?;
// Test with a single true at different positions
let mut values = vec![false; 128];
values[63] = true; // Last bit in first chunk
let large_with_boundary_true = BooleanArray::from(values);
run_test(
&large_with_boundary_true,
"large with true at bit 63",
false,
)?;
let mut values = vec![false; 128];
values[64] = true; // First bit in second chunk
let large_with_second_chunk_true = BooleanArray::from(values);
run_test(
&large_with_second_chunk_true,
"large with true at bit 64",
false,
)?;
// Specific sizes that might trigger edge cases
let exact_chunk_size = BooleanArray::from(vec![false; 64]);
run_test(&exact_chunk_size, "exact chunk size (64)", true)?;
let just_under_chunk = BooleanArray::from(vec![false; 63]);
run_test(&just_under_chunk, "just under chunk size (63)", true)?;
let just_over_chunk = BooleanArray::from(vec![false; 65]);
run_test(&just_over_chunk, "just over chunk size (65)", true)?;
Ok(())
}
|
As in the above code, but if you do that, you won't be able to do an early exit, and there should be no difference from using true_count/false_count directly |
It is not doing early exit indeed (it could be changed to do so for a chunck of values), but I benchmarked than true_count / false_count (it is generating code that completes faster). |
@acking-you the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently). |
I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code |
@Dandandan You're absolutely right. At this point, we should no longer focus on optimizing the early return for the My AttemptI experimented with optimizing the When I applied this optimization to the execution of
Other Optimization Scenarios
I realized that my earlier conclusions about the benefits of early selection execution were incorrect. After testing with the latest code changes (on a machine with an AMD 9950x CPU), the results are as follows: all scenarios that benefited from this optimization saw a reduction in execution time from short_circuit/and/one_true_first
time: [188.36 µs 188.96 µs 189.55 µs]
change: [-58.870% -58.652% -58.420%] (p = 0.00 < 0.05)
Performance has improved.
short_circuit/and/one_true_last
time: [170.10 µs 171.04 µs 171.97 µs]
change: [-62.788% -62.520% -62.258%] (p = 0.00 < 0.05)
Performance has improved.
short_circuit/and/one_true_middle
time: [165.38 µs 165.97 µs 166.57 µs]
change: [-64.176% -63.968% -63.784%] (p = 0.00 < 0.05)
Performance has improved.
short_circuit/and/one_true_middle_left
time: [165.02 µs 165.48 µs 165.96 µs]
change: [-63.671% -63.458% -63.204%] (p = 0.00 < 0.05)
Performance has improved.
short_circuit/and/one_true_middle_right
time: [169.09 µs 169.61 µs 170.13 µs]
change: [-63.298% -63.089% -62.868%] (p = 0.00 < 0.05)
Performance has improved.
short_circuit/and/one_true_middle_right #2
time: [166.43 µs 167.12 µs 167.85 µs]
change: [-64.092% -63.916% -63.751%] (p = 0.00 < 0.05)
Performance has improved. I further optimized the work done by @kosiew. Relevant PR: #15694 (currently, there are still some issues with |
Is your feature request related to a problem or challenge?
@acking-you 's wonderful PR #15462 adds short circuiting to boolean operation evaluation which makes evaluating some complex expressions much faster.
However, to do so it uses the
count_ones
function to checkWe have theorized it should be faster to simply check if there are any set bits in the array rather than using count_zeros, however @acking-you found that this is not easy to do as Rust generates very optimized code for count_ones
You can see an example of this analysis
#15462 (comment)
Describe the solution you'd like
Now that we have benchmarks for binary_op it would be great to see if we can optimize this codepath more
Describe alternatives you've considered
Roughly speaking you can run the benchmarks with
And then profile that with your favorite tool
For example, you can use samply like this:
Additional context
No response
The text was updated successfully, but these errors were encountered: