-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add short circuit evaluation for AND
and OR
#15462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some tests failed, so let me take a look at what exactly is going on. |
I think one issue is that the short-circuit logic is not handling cases where the the |
Thank you very much for your hint, it will be very helpful for me to fix these tests! |
After taking a closer look, in fact, the situation you mentioned does not actually lead to the short-circuit optimization logic. |
You're absolutely right, I got my logic wrong there. Embarrasing! |
It's okay. You've also taught me a lot. When I first started writing this, I really didn't consider the case of null |
Hello @alamb, the optimization SQL and documentation related to this PR have been completed, and all tests have passed. We may need to formally verify the performance, but I'm not quite sure how to do that (I can only run it locally). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @acking-you -- this looks really nice.
I think this needs some tests but otherwise it is looking quite nice.
Also, could you please add the new Q6 benchmark in a separate PR so I can more easily run my benchmark scripts before/after your code change?
|
||
### Q6: How many social shares meet complex multi-stage filtering criteria? | ||
**Question**: What is the count of sharing actions from iPhone mobile users on specific social networks, within common timezones, participating in seasonal campaigns, with high screen resolutions and closely matched UTM parameters? | ||
**Important Query Properties**: Simple filter with high-selectivity, Costly string matching, A large number of filters with high overhead are positioned relatively later in the process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { | |||
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { | |||
use arrow::compute::kernels::numeric::*; | |||
|
|||
fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @acking-you -- this looks great
Is there any reason to have this function defined in the evaluate
method? I think we could just make it a normal function and reduce the nesting level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count
/ true_count
-- like for example if the rhs arg is "complex" (not a Column for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to have this function defined in the evaluate method?
There was no particular reason. Maybe I couldn't find a suitable place to write it at the time, haha. Where do you think this function should be placed?
If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)
I also agree that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the function outside and added some comments.
Okey,I got it.Do you mean that Q6 and its related description in the current branch need to be completed with a separate PR? But this way, it seems that you would still need to cherry-pick Q6 to the corresponding branch when testing. |
027a772
to
5de0f36
Compare
…ultiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]>
…ultiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]>
5de0f36
to
575c3f3
Compare
match arg { | ||
ColumnarValue::Array(array) => { | ||
if let Ok(array) = as_boolean_array(&array) { | ||
return array.false_count() == array.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this had some overhead (for calculating the counts) from a previous try.
I wonder if it helps to short optimize this expression (e.g. match until we get a chunk of the bitmap != 0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it helps to short optimize this expression (e.g. match until we get a chunk of the bitmap != 0)
I think the overhead added here should be very small (the compiler optimization should work well), and the test results we discussed before were sometimes fast and sometimes slow (maybe noise).
Your suggestion of making an early judgment and returning false seems like a good idea, but I'm not sure if it will be effective.
The concern I have with this approach is that it requires adding an if
condition inside the for
loop, which will most likely disable the compiler's SIMD instruction optimization (I've encountered a similar situation before, and I had to manually unroll the SIMD...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, we can use bool_and
(https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) and bool_or
which operates on u64
values to test performance changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, we can use
bool_and
(https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) andbool_or
which operates onu64
values to test performance changes.
Thank you for your suggestion. I will try it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be overkill, but one could try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop.
Almost seems like something the compiler could automagically do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be overkill, but one could try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop.
Thank you for your suggestion, but if we're only applying conditional checks to the first few blocks, then I feel this optimization might not be meaningful. If nearly all blocks can be filtered out by the preceding filter, the optimization will no longer be effective.
If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)
I tend to agree with @alamb's point that if the overhead of verification is somewhat unacceptable, adopting some heuristic approaches would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more carefully at bool_or and I do think it would be faster than this implementation on the case where there are some true values (as it stops as soon as it finds a single non zero): https://docs.rs/arrow/latest/arrow/compute/fn.bool_or.html
…ultiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]>
I sincerely apologize for the delay in updating this PR. I have now designed a detailed comparative test for the
It can be seen that when However, in other cases, using
# test true_count/false_count
TEST_BOOL_COUNT=1 cargo bench --bench boolean_op
# test bool_or/bool_and
cargo bench --bench boolean_op detail benchmark code: https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5e |
Thank you for your diligence @acking-you I think we should merge this PR in (with the count bits and the benchmarks) and file a follow on ticket to potentially improve the performance in some other way. I don't fully understand your performance results given that the two functions seem very similar -- maybe it has to do with option hanlding messing auto vectorization or something https://docs.rs/arrow-array/54.3.1/src/arrow_array/array/boolean_array.rs.html#160 |
Security audit failure is not related to this PR |
Yes, I was also quite surprised. The only difference in the implementation of these two functions might be whether or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @acking-you
I think the code in this PR is ready to go -- can you please adjust the benchmark code and then I'll merge this one in and file a follow on PR to explore other potential improvements?
Thanks again!
cases | ||
} | ||
|
||
fn benchmark_boolean_ops(c: &mut Criterion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update these benchmarks for the performance of boolean expression evaluation? As of now, these are benchmarks for bool_and
and bool_or
?
In other words, make micro benchmarks for expressions like A AND B AND C AND D
similar to what you added to the clickbench extended suite that shows the performance of expressions where the short circuit code improves performance.
We can then use those benchmarks to try out different ways to implementing the short circuting logic
cases | ||
} | ||
|
||
fn benchmark_boolean_ops(c: &mut Criterion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update these benchmarks for the performance of boolean expression evaluation? As of now, these are benchmarks for bool_and
and bool_or
?
In other words, make micro benchmarks for expressions like A AND B AND C AND D
similar to what you added to the clickbench extended suite that shows the performance of expressions where the short circuit code improves performance.
We can then use those benchmarks to try out different ways to implementing the short circuting logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other words, make micro benchmarks for expressions like A AND B AND C AND D similar to what you added to the clickbench extended suite that shows the performance of expressions where the short circuit code improves performance.
This is a good idea
How can this be resolved? |
I think the likely part of it getting slower is short-circuiting whenever a For this case it might be interesting to compare it with For |
There are ideas on there of how to resolve the isseu To be clear I don't think this is blocking this PR from merging The only thing I think is needed for this PR is to either
|
done |
I’m certain that the semantics of the RecordsRust code as follows: /// Scene 1: Counting all set bits (the number of 1s)
#[no_mangle]
#[inline(never)]
pub fn count_ones(ba: &[u8]) -> usize {
ba.iter().map(|x| x.count_ones() as usize).sum()
}
/// Scene 2: Check if a bit is set
#[no_mangle]
#[inline(never)]
pub fn find_any(ba: &[u8]) -> bool {
ba.iter().any(|&x| x != 0)
} Assembly code as follows: count_ones:
test rsi, rsi
je .LBB0_1
cmp rsi, 4
jae .LBB0_5
xor ecx, ecx
xor eax, eax
jmp .LBB0_8
.LBB0_1:
xor eax, eax
ret
.LBB0_5:
mov rcx, rsi
and rcx, -4
pxor xmm0, xmm0
xor eax, eax
movdqa xmm2, xmmword ptr [rip + .LCPI0_0]
movdqa xmm3, xmmword ptr [rip + .LCPI0_1]
movdqa xmm5, xmmword ptr [rip + .LCPI0_2]
pxor xmm4, xmm4
pxor xmm1, xmm1
.LBB0_6:
movzx edx, word ptr [rdi + rax]
movd xmm7, edx
movzx edx, word ptr [rdi + rax + 2]
movd xmm6, edx
punpcklbw xmm7, xmm0
punpcklwd xmm7, xmm0
punpckldq xmm7, xmm0
movdqa xmm8, xmm7
psrlw xmm8, 1
pand xmm8, xmm2
psubb xmm7, xmm8
movdqa xmm8, xmm7
pand xmm8, xmm3
psrlw xmm7, 2
pand xmm7, xmm3
paddb xmm7, xmm8
movdqa xmm8, xmm7
psrlw xmm8, 4
paddb xmm8, xmm7
pand xmm8, xmm5
psadbw xmm8, xmm0
paddq xmm4, xmm8
punpcklbw xmm6, xmm0
punpcklwd xmm6, xmm0
punpckldq xmm6, xmm0
movdqa xmm7, xmm6
psrlw xmm7, 1
pand xmm7, xmm2
psubb xmm6, xmm7
movdqa xmm7, xmm6
pand xmm7, xmm3
psrlw xmm6, 2
pand xmm6, xmm3
paddb xmm6, xmm7
movdqa xmm7, xmm6
psrlw xmm7, 4
paddb xmm7, xmm6
pand xmm7, xmm5
psadbw xmm7, xmm0
paddq xmm1, xmm7
add rax, 4
cmp rcx, rax
jne .LBB0_6
paddq xmm1, xmm4
pshufd xmm0, xmm1, 238
paddq xmm0, xmm1
movq rax, xmm0
cmp rcx, rsi
je .LBB0_2
.LBB0_8:
movzx edx, byte ptr [rdi + rcx]
imul edx, edx, 134480385
shr edx, 3
and edx, 286331153
imul edx, edx, 286331153
shr edx, 28
add rax, rdx
inc rcx
cmp rsi, rcx
jne .LBB0_8
.LBB0_2:
ret
find_any:
xor ecx, ecx
.LBB1_1:
mov rax, rcx
cmp rsi, rcx
je .LBB1_3
lea rcx, [rax + 1]
cmp byte ptr [rdi + rax], 0
je .LBB1_1
.LBB1_3:
cmp rsi, rax
setne al
ret Conclusion
|
Thanks again for the wonderful work @acking-you -- I also confirmed that without the short circuit optimization the benchmarks get much much slower
Let's merge this PR in and I will file a follow on ticket to track potentially improving performance even more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again everyone -- this is a good improvement. Also really nice analysis work by @acking-you
/// # test bool_or/bool_and | ||
/// cargo bench --bench binary_op -- boolean_ops | ||
/// ``` | ||
fn benchmark_boolean_ops(c: &mut Criterion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since these are basically arrow benchmarks, I made a small follow on PR to propose removing them for now:
the extended tests appear to be failing after this PR: |
* [draft] add shot circuit in BinaryExpr * refactor: add check_short_circuit function * refactor: change if condition to match * feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option * Chore/Add additional FFI unit tests (apache#14802) * Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]> * Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203) * add extend sql & docs * feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option * Chore/Add additional FFI unit tests (apache#14802) * Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]> * Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203) * fix: incorrect false judgment * add test * separate q6 to new PR * feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option * Chore/Add additional FFI unit tests (apache#14802) * Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]> * Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203) * feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642) * Add support --mem-pool-type and --memory-limit options for all benchmarks * Add --sort-spill-reservation-bytes option * Chore/Add additional FFI unit tests (apache#14802) * Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <[email protected]> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <[email protected]> * Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203) * add benchmark for boolean_op * fix cargo doc * add binary_op bench * Better comments --------- Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Tim Saucer <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
BinaryExpr
evaluate lacks optimization forOr
andAnd
scenarios #11212.Rationale for this change
What changes are included in this PR?
BinaryExpr
evaluate lacks optimization forOr
andAnd
scenarios #11212 (comment) .)Below is the performance comparison of running the extended SQL locally. It seems there is also some improvement in Q4 (maybe noise).
At the same time, while creating this SQL, I also discovered a bug — one of the filter caused a panic: #15461
Are these changes tested?
Are there any user-facing changes?