Skip to content

Add short circuit evaluation for AND and OR #15462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 8, 2025

Conversation

acking-you
Copy link
Contributor

@acking-you acking-you commented Mar 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  1. Add an extended SQL statement targeting this optimization.
  2. Provide a description of this extended SQL.
  3. Using this optimization, the extended SQL achieves a 2X performance improvement. (If the string length in the filter is longer, the effect can increase linearly,such as 500X in BinaryExpr evaluate lacks optimization for Or and And scenarios #11212 (comment) .)

Below is the performance comparison of running the extended SQL locally. It seems there is also some improvement in Q4 (maybe noise).

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ add_short_circuit ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  964.48ms │         1006.79ms │     no change │
│ QQuery 1     │  409.89ms │          419.52ms │     no change │
│ QQuery 2     │  838.25ms │          868.74ms │     no change │
│ QQuery 3     │  408.15ms │          396.88ms │     no change │
│ QQuery 4     │ 1029.80ms │          783.40ms │ +1.31x faster │
│ QQuery 5     │ 9429.76ms │         8835.06ms │ +1.07x faster │
│ QQuery 6     │ 4096.47ms │         1382.42ms │ +2.96x faster │
└──────────────┴───────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                │ 17176.80ms │
│ Total Time (add_short_circuit)   │ 13692.80ms │
│ Average Time (main)              │  2453.83ms │
│ Average Time (add_short_circuit) │  1956.11ms │
│ Queries Faster                   │          3 │
│ Queries Slower                   │          0 │
│ Queries with No Change           │          4 │
└──────────────────────────────────┴────────────┘

At the same time, while creating this SQL, I also discovered a bug — one of the filter caused a panic: #15461

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Mar 27, 2025
@acking-you
Copy link
Contributor Author

Some tests failed, so let me take a look at what exactly is going on.

@ctsk
Copy link
Contributor

ctsk commented Mar 28, 2025

I think one issue is that the short-circuit logic is not handling cases where the the rhs contains NULLs. E.g. true OR NULL needs to evaluate to NULL

@acking-you
Copy link
Contributor Author

I think one issue is that the short-circuit logic is not handling cases where the the rhs contains NULLs. E.g. true OR NULL needs to evaluate to NULL

Thank you very much for your hint, it will be very helpful for me to fix these tests!

@acking-you
Copy link
Contributor Author

I think one issue is that the short-circuit logic is not handling cases where the the rhs contains NULLs. E.g. true OR NULL needs to evaluate to NULL

After taking a closer look, in fact, the situation you mentioned does not actually lead to the short-circuit optimization logic.
The error I’m currently seeing is the use of true_count() == 0 to determine if it is false, but in reality, it could also be null.

@ctsk
Copy link
Contributor

ctsk commented Mar 28, 2025

You're absolutely right, I got my logic wrong there. Embarrasing!

@acking-you
Copy link
Contributor Author

You're absolutely right, I got my logic wrong there. Embarrasing!

It's okay. You've also taught me a lot. When I first started writing this, I really didn't consider the case of null

@acking-you
Copy link
Contributor Author

Hello @alamb, the optimization SQL and documentation related to this PR have been completed, and all tests have passed. We may need to formally verify the performance, but I'm not quite sure how to do that (I can only run it locally).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @acking-you -- this looks really nice.

I think this needs some tests but otherwise it is looking quite nice.

Also, could you please add the new Q6 benchmark in a separate PR so I can more easily run my benchmark scripts before/after your code change?


### Q6: How many social shares meet complex multi-stage filtering criteria?
**Question**: What is the count of sharing actions from iPhone mobile users on specific social networks, within common timezones, participating in seasonal campaigns, with high screen resolutions and closely matched UTM parameters?
**Important Query Properties**: Simple filter with high-selectivity, Costly string matching, A large number of filters with high overhead are positioned relatively later in the process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
use arrow::compute::kernels::numeric::*;

fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @acking-you -- this looks great

Is there any reason to have this function defined in the evaluate method? I think we could just make it a normal function and reduce the nesting level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)

Copy link
Contributor Author

@acking-you acking-you Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to have this function defined in the evaluate method?

There was no particular reason. Maybe I couldn't find a suitable place to write it at the time, haha. Where do you think this function should be placed?

If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)

I also agree that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the function outside and added some comments.

@alamb alamb changed the title Add short circuit Add short circuit evaluation for AND and OR Mar 28, 2025
@acking-you
Copy link
Contributor Author

Also, could you please add the new Q6 benchmark in a separate PR so I can more easily run my benchmark scripts before/after your code change?

Okey,I got it.Do you mean that Q6 and its related description in the current branch need to be completed with a separate PR? But this way, it seems that you would still need to cherry-pick Q6 to the corresponding branch when testing.

acking-you and others added 12 commits March 31, 2025 10:51
…ultiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>
…ultiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>
@acking-you
Copy link
Contributor Author

Also, could you please add the new Q6 benchmark in a separate PR so I can more easily run my benchmark scripts before/after your code change?

I have successfully split Q6 into: #15500
Thank you very much for your CR. @alamb

match arg {
ColumnarValue::Array(array) => {
if let Ok(array) = as_boolean_array(&array) {
return array.false_count() == array.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this had some overhead (for calculating the counts) from a previous try.
I wonder if it helps to short optimize this expression (e.g. match until we get a chunk of the bitmap != 0)

Copy link
Contributor Author

@acking-you acking-you Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it helps to short optimize this expression (e.g. match until we get a chunk of the bitmap != 0)

I think the overhead added here should be very small (the compiler optimization should work well), and the test results we discussed before were sometimes fast and sometimes slow (maybe noise).

Your suggestion of making an early judgment and returning false seems like a good idea, but I'm not sure if it will be effective.
The concern I have with this approach is that it requires adding an if condition inside the for loop, which will most likely disable the compiler's SIMD instruction optimization (I've encountered a similar situation before, and I had to manually unroll the SIMD...).

Copy link
Contributor

@Dandandan Dandandan Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, we can use bool_and (https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) and bool_or which operates on u64 values to test performance changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, we can use bool_and (https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) and bool_or which operates on u64 values to test performance changes.

Thank you for your suggestion. I will try it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be overkill, but one could try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop.

Almost seems like something the compiler could automagically do...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be overkill, but one could try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop.

Thank you for your suggestion, but if we're only applying conditional checks to the first few blocks, then I feel this optimization might not be meaningful. If nearly all blocks can be filtered out by the preceding filter, the optimization will no longer be effective.

If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example)

I tend to agree with @alamb's point that if the overhead of verification is somewhat unacceptable, adopting some heuristic approaches would be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more carefully at bool_or and I do think it would be faster than this implementation on the case where there are some true values (as it stops as soon as it finds a single non zero): https://docs.rs/arrow/latest/arrow/compute/fn.bool_or.html

Kontinuation and others added 4 commits April 7, 2025 15:37
…ultiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option
* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>
@acking-you
Copy link
Contributor Author

I sincerely apologize for the delay in updating this PR. I have now designed a detailed comparative test for the bool_or/bool_and issue described by @Dandandan. The related results are as follows. I’m not sure what you think about these results — should we continue using true_count/false_count, or look for a better solution? @alamb @Dandandan

  1. I designed six distribution scenarios(code) for the true_count/false_count and bool_or/bool_and of the boolean array to conduct performance testing comparisons. The final results are as follows:
Test Case true_count/false_count bool_or/bool_and Change Performance Impact
all_false 1.6593 µs - 1.6695 µs 4.1013 µs - 4.1774 µs ​150.91%​​ 🚫 Regressed
one_true_first 1.6726 µs - 1.6771 µs 1.6885 ns - 1.7430 ns ​-99.898%​ ✅ Improved
one_true_last 1.6663 µs - 1.6714 µs 4.1096 µs - 4.1554 µs +147.14% 🚫 Regressed
one_true_middle 1.6723 µs - 1.6819 µs 2.1505 µs - 2.2117 µs +30.180% 🚫 Regressed
one_true_middle_left 1.6672 µs - 1.6727 µs 1.1088 µs - 1.1483 µs -32.995% ✅ Improved
one_true_middle_right 1.6689 µs - 1.6741 µs 3.1562 µs - 3.2521 µs +93.762% 🚫 Regressed
all_true 1.6711 µs - 1.6747 µs 4.2779 µs - 4.4291 µs +155.35% 🚫 Regressed
one_false_first 1.6722 µs - 1.6782 µs 1.6278 ns - 1.6360 ns ​-99.903%​ ✅ Improved
one_false_last 1.6818 µs - 1.7512 µs 4.2175 µs - 4.2930 µs +153.50% 🚫 Regressed
one_false_middle 1.8437 µs - 1.9665 µs 2.0575 µs - 2.0871 µs +11.931% 🚫 Regressed
one_false_middle_left 2.0004 µs - 2.3194 µs 1.0243 µs - 1.0398 µs -57.059% ✅ Improved
one_false_middle_right 2.0770 µs - 2.2721 µs 3.0275 µs - 3.0582 µs +47.668% 🚫 Regressed

It can be seen that when false/true is located slightly to the left of the middle, bool_or/bool_and has a significant advantage, with up to 10^3 times the performance lead when in the first position.

However, in other cases, using true_count/false_count performs better, showing relatively stable behavior across various scenarios.

  1. The test can be reproduced with the following command:
# test true_count/false_count 
TEST_BOOL_COUNT=1 cargo bench --bench boolean_op
# test bool_or/bool_and 
cargo bench --bench boolean_op

detail benchmark code: https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5e

@alamb alamb mentioned this pull request Apr 7, 2025
12 tasks
@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

The related results are as follows. I’m not sure what you think about these results — should we continue using true_count/false_count, or look for a better solution? @alamb @Dandandan

Thank you for your diligence @acking-you

I think we should merge this PR in (with the count bits and the benchmarks) and file a follow on ticket to potentially improve the performance in some other way.

I don't fully understand your performance results given that the two functions seem very similar -- maybe it has to do with option hanlding messing auto vectorization or something

https://docs.rs/arrow-array/54.3.1/src/arrow_array/array/boolean_array.rs.html#160
https://docs.rs/arrow-arith/54.3.1/src/arrow_arith/aggregate.rs.html#751

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

Security audit failure is not related to this PR

@acking-you
Copy link
Contributor Author

I don't fully understand your performance results given that the two functions seem very similar -- maybe it has to do with option hanlding messing auto vectorization or something

Yes, I was also quite surprised. The only difference in the implementation of these two functions might be whether or not find was used.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @acking-you

I think the code in this PR is ready to go -- can you please adjust the benchmark code and then I'll merge this one in and file a follow on PR to explore other potential improvements?

Thanks again!

cases
}

fn benchmark_boolean_ops(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update these benchmarks for the performance of boolean expression evaluation? As of now, these are benchmarks for bool_and and bool_or?

In other words, make micro benchmarks for expressions like A AND B AND C AND D similar to what you added to the clickbench extended suite that shows the performance of expressions where the short circuit code improves performance.

We can then use those benchmarks to try out different ways to implementing the short circuting logic

cases
}

fn benchmark_boolean_ops(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update these benchmarks for the performance of boolean expression evaluation? As of now, these are benchmarks for bool_and and bool_or?

In other words, make micro benchmarks for expressions like A AND B AND C AND D similar to what you added to the clickbench extended suite that shows the performance of expressions where the short circuit code improves performance.

We can then use those benchmarks to try out different ways to implementing the short circuting logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other words, make micro benchmarks for expressions like A AND B AND C AND D similar to what you added to the clickbench extended suite that shows the performance of expressions where the short circuit code improves performance.

This is a good idea

@acking-you
Copy link
Contributor Author

Security audit failure is not related to this PR

How can this be resolved?

@Dandandan
Copy link
Contributor

Dandandan commented Apr 7, 2025

The related results are as follows. I’m not sure what you think about these results — should we continue using true_count/false_count, or look for a better solution? @alamb @Dandandan

Thank you for your diligence @acking-you

I think we should merge this PR in (with the count bits and the benchmarks) and file a follow on ticket to potentially improve the performance in some other way.

I don't fully understand your performance results given that the two functions seem very similar -- maybe it has to do with option hanlding messing auto vectorization or something

https://docs.rs/arrow-array/54.3.1/src/arrow_array/array/boolean_array.rs.html#160 https://docs.rs/arrow-arith/54.3.1/src/arrow_arith/aggregate.rs.html#751

I think the likely part of it getting slower is short-circuiting whenever a true is observed (having a branch for each item to check).

For this case it might be interesting to compare it with array.values().bit_chunks().iter_padded().fold(true, |acc, i: u64| i != 0 && acc) (e.g. fully evaluate the entire array) and see if it is any faster than the true_count implementation.

For max_boolean it might be interesting to see if we can make it faster by doing both, early exiting and generating more efficient code. Maybe it could evaluate it per n values instead of per-value to benefit from faster code / branch prediction while still exiting early enough.

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

Security audit failure is not related to this PR

How can this be resolved?

There are ideas on there of how to resolve the isseu

To be clear I don't think this is blocking this PR from merging

The only thing I think is needed for this PR is to either

  1. remove the bechmark
  2. update the benchmark ti actually test BinaryExpr evaluation time

@acking-you
Copy link
Contributor Author

The only thing I think is needed for this PR is to either

  1. remove the bechmark
  2. update the benchmark ti actually test BinaryExpr evaluation time

done

@acking-you
Copy link
Contributor Author

I think the likely part of it getting slower is short-circuiting whenever a true is observed (having a branch for each item to check).

For this case it might be interesting to compare it with array.values().bit_chunks().iter_padded().fold(true, |acc, i: u64| i != 0 && acc) (e.g. fully evaluate the entire array) and see if it is any faster than the true_count implementation.

For max_boolean it might be interesting to see if we can make it faster by doing both, early exiting and generating more efficient code. Maybe it could evaluate it per n values instead of per-value to benefit from faster code / branch prediction while still exiting early enough.

I’m certain that the semantics of the if statement, which preemptively exits based on a condition, caused the compiler to abandon SIMD optimization. To test this, I conducted the following experiment:
I wrote two simple functions to simulate the two scenarios and then examined the optimized assembly code generated by the compiler to compare the differences.
You can view the specific assembly code generated from my simulated experiment here: https://godbolt.org/z/63bEvYP4d.

Records

Rust code as follows:

/// Scene 1: Counting all set bits (the number of 1s)
#[no_mangle]
#[inline(never)]
pub fn count_ones(ba: &[u8]) -> usize {
    ba.iter().map(|x| x.count_ones() as usize).sum()
}

/// Scene 2: Check if a bit is set
#[no_mangle]
#[inline(never)]
pub fn find_any(ba: &[u8]) -> bool {
    ba.iter().any(|&x| x != 0)
}

Assembly code as follows:

count_ones:
        test    rsi, rsi
        je      .LBB0_1
        cmp     rsi, 4
        jae     .LBB0_5
        xor     ecx, ecx
        xor     eax, eax
        jmp     .LBB0_8
.LBB0_1:
        xor     eax, eax
        ret
.LBB0_5:
        mov     rcx, rsi
        and     rcx, -4
        pxor    xmm0, xmm0
        xor     eax, eax
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm3, xmmword ptr [rip + .LCPI0_1]
        movdqa  xmm5, xmmword ptr [rip + .LCPI0_2]
        pxor    xmm4, xmm4
        pxor    xmm1, xmm1
.LBB0_6:
        movzx   edx, word ptr [rdi + rax]
        movd    xmm7, edx
        movzx   edx, word ptr [rdi + rax + 2]
        movd    xmm6, edx
        punpcklbw       xmm7, xmm0
        punpcklwd       xmm7, xmm0
        punpckldq       xmm7, xmm0
        movdqa  xmm8, xmm7
        psrlw   xmm8, 1
        pand    xmm8, xmm2
        psubb   xmm7, xmm8
        movdqa  xmm8, xmm7
        pand    xmm8, xmm3
        psrlw   xmm7, 2
        pand    xmm7, xmm3
        paddb   xmm7, xmm8
        movdqa  xmm8, xmm7
        psrlw   xmm8, 4
        paddb   xmm8, xmm7
        pand    xmm8, xmm5
        psadbw  xmm8, xmm0
        paddq   xmm4, xmm8
        punpcklbw       xmm6, xmm0
        punpcklwd       xmm6, xmm0
        punpckldq       xmm6, xmm0
        movdqa  xmm7, xmm6
        psrlw   xmm7, 1
        pand    xmm7, xmm2
        psubb   xmm6, xmm7
        movdqa  xmm7, xmm6
        pand    xmm7, xmm3
        psrlw   xmm6, 2
        pand    xmm6, xmm3
        paddb   xmm6, xmm7
        movdqa  xmm7, xmm6
        psrlw   xmm7, 4
        paddb   xmm7, xmm6
        pand    xmm7, xmm5
        psadbw  xmm7, xmm0
        paddq   xmm1, xmm7
        add     rax, 4
        cmp     rcx, rax
        jne     .LBB0_6
        paddq   xmm1, xmm4
        pshufd  xmm0, xmm1, 238
        paddq   xmm0, xmm1
        movq    rax, xmm0
        cmp     rcx, rsi
        je      .LBB0_2
.LBB0_8:
        movzx   edx, byte ptr [rdi + rcx]
        imul    edx, edx, 134480385
        shr     edx, 3
        and     edx, 286331153
        imul    edx, edx, 286331153
        shr     edx, 28
        add     rax, rdx
        inc     rcx
        cmp     rsi, rcx
        jne     .LBB0_8
.LBB0_2:
        ret

find_any:
        xor     ecx, ecx
.LBB1_1:
        mov     rax, rcx
        cmp     rsi, rcx
        je      .LBB1_3
        lea     rcx, [rax + 1]
        cmp     byte ptr [rdi + rax], 0
        je      .LBB1_1
.LBB1_3:
        cmp     rsi, rax
        setne   al
        ret

Conclusion

  1. Vectorization Optimization Differences:

    • The assembly code for count_ones utilizes SIMD instructions (such as movdqa, psrlw, and paddb), achieving automatic vectorization. The compiler packs four bytes (32 bits) of data into an XMM register for parallel processing, efficiently calculating the number of set bits through bitmasking and accumulation operations.
    • This optimization reduces the computational complexity from O(n) to O(n/4), significantly decreasing the number of loop iterations.
  2. Early Exit Hinders Optimization:

    • The find_any function needs to return the position of the first non-zero byte immediately, preventing the loop from being unrolled into fixed-step SIMD operations. The assembly code reveals that it checks each byte individually (e.g., cmp byte ptr [rcx + rdx], 0), making it unable to leverage parallel comparisons.
  3. Branch Prediction Penalty:

    • The loop in find_any contains conditional jumps (e.g., je .LBB1_1), posing a risk of branch prediction failure in every iteration. In contrast, the vectorized code of count_ones performs branch-free linear operations, which are better suited for modern CPU pipelines.
  4. Compiler Optimization Limitations:

    • The early exit semantics of any() cause the compiler to conservatively avoid vectorization, opting for the safest byte-by-byte checks. Conversely, the semantics of sum() allow the compiler to aggressively optimize.

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

Thanks again for the wonderful work @acking-you --
I ran the benchmarks locally and indeed they show a non trivial amount of time being spent in count_ones.

Screenshot 2025-04-08 at 6 40 33 AM

I also confirmed that without the short circuit optimization the benchmarks get much much slower

Gnuplot not found, using plotters backend
bench_all_false_and     time:   [503.28 µs 505.29 µs 507.36 µs]
                        change: [+1711392% +1718962% +1727169%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

bench_all_true_or       time:   [504.57 µs 506.27 µs 508.10 µs]
                        change: [+1704732% +1713204% +1721665%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild

Let's merge this PR in and I will file a follow on ticket to track potentially improving performance even more

@alamb alamb merged commit 26adc8f into apache:main Apr 8, 2025
28 of 29 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again everyone -- this is a good improvement. Also really nice analysis work by @acking-you

/// # test bool_or/bool_and
/// cargo bench --bench binary_op -- boolean_ops
/// ```
fn benchmark_boolean_ops(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are basically arrow benchmarks, I made a small follow on PR to propose removing them for now:

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

the extended tests appear to be failing after this PR:

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* [draft] add shot circuit in BinaryExpr

* refactor: add check_short_circuit function

* refactor: change if condition to match

* feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option

* Chore/Add additional FFI unit tests (apache#14802)

* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203)

* add extend sql & docs

* feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option

* Chore/Add additional FFI unit tests (apache#14802)

* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203)

* fix: incorrect false judgment

* add test

* separate q6 to new PR

* feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option

* Chore/Add additional FFI unit tests (apache#14802)

* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203)

* feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks (apache#14642)

* Add support --mem-pool-type and --memory-limit options for all benchmarks

* Add --sort-spill-reservation-bytes option

* Chore/Add additional FFI unit tests (apache#14802)

* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Improve feature flag CI coverage `datafusion` and `datafusion-functions` (apache#15203)

* add benchmark for boolean_op

* fix cargo doc

* add binary_op bench

* Better comments

---------

Co-authored-by: Kristin Cowalcijk <[email protected]>
Co-authored-by: Tim Saucer <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BinaryExpr evaluate lacks optimization for Or and And scenarios
7 participants