Skip to content

Improve the performance of early exit evaluation in binary_expr #15631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue Apr 8, 2025 · 14 comments
Open

Improve the performance of early exit evaluation in binary_expr #15631

alamb opened this issue Apr 8, 2025 · 14 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

Is your feature request related to a problem or challenge?

@acking-you 's wonderful PR #15462 adds short circuiting to boolean operation evaluation which makes evaluating some complex expressions much faster.

However, to do so it uses the count_ones function to check

Image

We have theorized it should be faster to simply check if there are any set bits in the array rather than using count_zeros, however @acking-you found that this is not easy to do as Rust generates very optimized code for count_ones

You can see an example of this analysis
#15462 (comment)

Describe the solution you'd like

Now that we have benchmarks for binary_op it would be great to see if we can optimize this codepath more

Describe alternatives you've considered

Roughly speaking you can run the benchmarks with

cargo bench --profile=profiling  --bench binary_op

And then profile that with your favorite tool

For example, you can use samply like this:

samply record   target/profiling/deps/binary_op-cce23ddc74cdfa3a --bench

Additional context

No response

@alamb alamb added the enhancement New feature or request label Apr 8, 2025
@acking-you
Copy link
Contributor

This might require manual SIMD for optimization, but that would increase the porting difficulty(As duckdb says). However, perhaps an alternative approach could be tried to make it easier for the compiler to optimize. If feasible, it also seems capable of improving the performance of related calls in the arrow-rs library.

Some exploration

In ClickHouse's filter implementation, there is a classic manual SIMD implementation approach: code

The function involves loading multiple boolean values at once using SIMD instructions to increase the loop step.
The best-case scenarios are:

  • The filter does not match, skipping to the next iteration.
  • The filter fully matches, copying multiple rows at once.

For other cases, the performance degrades to a handling method similar to when SIMD is not used (the additional overhead being the preparation of SIMD variables).

If this approach is applied to check whether a bit is 1 or 0, it should incur almost no overhead (only requiring a comparison with 0 or ffff).

At the same time, could DataFusion's filter process also be optimized using this method?

Alternatively, could we find another form of vectorization that does not involve manual unrolling?

@Dandandan
Copy link
Contributor

Interesting!

I think we probably can take some inspiration from arrow-rs aggregate code, e.g. doing something like (?):

    /// Counts the number of ones
    pub fn count_ones(&self) -> usize {
        // platform specific
        const LANES: usize = 8;
        let mut chunks = self.chunks.chunks_exact(LANES);
        let mut sum: usize = 0;
        chunks.borrow_mut().for_each(|chunk| {
            let chunk: [u64; LANES] = chunk.try_into().unwrap();
            for i in 0..LANES {
                sum += chunk[i].count_ones() as usize;
            }
        });

        let remainder = chunks.remainder();
        for chunk in remainder {
            sum += chunk.count_ones() as usize;
        }

        sum
    }

(Didn't test this to be faster).
We could also define a faster version for this use case that doesn't count but returns the same as max / min boolean.

@alamb
Copy link
Contributor Author

alamb commented Apr 8, 2025

            sum += chunk[i].count_ones() as usize;

Maybe simply manually unrolling the loop to check 1024 bits at a time would let llvm make the best code
Something like checking windows of 8 u64s at a time against 0x0 -- and exiting early if there is a non zero

🤔

@Dandandan
Copy link
Contributor

Dandandan commented Apr 8, 2025

Would be good to compare it with a boolean version of this as well, like this, to see if it vectorizes better:

    pub fn all_zero(&self) -> bool {
        // platform specific
        const LANES: usize = 8;
        let mut chunks = self.chunks.chunks_exact(LANES);
        let mut all_zero = true;
        chunks.borrow_mut().for_each(|chunk| {
            let chunk: [u64; LANES] = chunk.try_into().unwrap();
            for i in 0..LANES {
                all_false &= chunk[i] == 0;
            }
        });

        let remainder = chunks.remainder();
        for chunk in remainder {
            all_false &= *chunk == 0;
        }

        all_zero
    }

(And then maybe do it in certain chunks like @alamb suggests)

@acking-you
Copy link
Contributor

acking-you commented Apr 9, 2025

I have an idea that might improve the effectiveness of short-circuit optimization, and it seems necessary to use false_count for evaluation counting.

The current issue with DataFusion's execution of BinaryExpr:
After computing the left side, the result is not immediately used to filter the batch to reduce the input size of the right batch.But can only be used for "and" operations.
Example:

where left and right

Current:
batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false]
batch:[1,2,3,4] -> execute right -> bool array: [true,true,false,false]

Might be better:
batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] -> batch:[1,3]
batch:[1,3] -> execute right -> bool array: [true,false] -> batch:[1]

I tried implementing this process using evaluate_selection, but the performance regressed in many cases because its internal implementation requires copying to create a new RecordBatch.

However, perhaps we could heuristically decide whether to pre-filter the RecordBatch based on false_count, for example, when false_count / array_len > 0.8.

By the way, I recently looked into ClickHouse's execution logic for BinaryOp. It immediately uses the result of each expression to filter and then proceeds to execute the next expression. Similarly, it involves copying, but it accelerates the checking and copying process using SIMD instructions. I also noticed that arrow-rs, which DataFusion uses, has a very efficient approach for this process: IterationStrategy.

I don't know if you think it's a good idea? @alamb @Dandandan

The following experiments and conclusions are outdated. For more details, please refer to: #15631 (comment)

My Practice

Relevant code implementation: code1 code2

Performance test results:

  1. No improvement observed in clickbench.
  2. The following improvements were seen in binary_op benches:
short_circuit/and/one_true_first
                        time:   [479.48 µs 480.13 µs 480.96 µs]
                        change: [-5.8267% -4.4620% -3.2309%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe

short_circuit/and/one_true_last
                        time:   [480.46 µs 481.17 µs 482.05 µs]
                        change: [-6.2128% -5.0143% -3.9098%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

short_circuit/and/one_true_middle_left
                        time:   [480.49 µs 481.32 µs 482.32 µs]
                        change: [-2.1930% -1.5907% -1.0753%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

short_circuit/and/one_true_middle_right
                        time:   [480.96 µs 481.92 µs 483.15 µs]
                        change: [-5.7694% -3.1496% -1.1136%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

Conclusion

ShortCircuitStrategy::PreSelection optimization did not have the effect I imagined, and it seems unnecessary.

Early exit from computations may still be a better optimization approach.

@alamb
Copy link
Contributor Author

alamb commented Apr 10, 2025

ShortCircuitStrategy is a pretty neat idea

In my opinion, as long as the code is easy to understand, makes realistic benchmarks faster, and doesn't regress existing performance it is good thing to try

My biggest concern with this type of optimization is that it will cause some queries to go faster and some to go slower -- this tradeoff is not great in my mind as then we have to judge if the tradeoff is worth it. Given how many people use DataFusion I think it likely that not all users have the same opinion

@Dandandan
Copy link
Contributor

I don't know if you think it's a good idea? @alamb @Dandandan

I think it is a pretty good idea given that evaluation is so important.

@Dandandan
Copy link
Contributor

Dandandan commented Apr 11, 2025

To show the potential, I tested this yesterday to reduce execution time of short circuiting all false / all true cases by -25% compared to true_count / false_count:

fn all_zero(array: &BooleanArray) -> bool {
    // TODO: nulls
    // match array.nulls() {
    //     Some(nulls) => {
    //     }
    //     None => {}
    // }
    // platform specific
    let bit_chunks = array.values().bit_chunks();
    let mut all_zero = true;
    for i in bit_chunks {
        all_zero &= i == 0;
    }
    
    all_zero
}

fn all_one(array: &BooleanArray) -> bool {
    let bit_chunks = array.values().bit_chunks();
    let mut all_one = true;
    for i in bit_chunks {
        all_one &= i == 0xFFFFFFFFFFFFFFFF;
    }
    
    all_one
}

@kosiew
Copy link
Contributor

kosiew commented Apr 12, 2025

hi @Dandandan

I am getting failed tests with

    #[test]
    fn test_all_one() -> Result<()> {
        // Helper function to run tests and report failures
        let run_test = |array: &BooleanArray, name: &str, expected: bool| -> Result<()> {
            let result = all_one(array);
            if result != expected {
                println!(
                    "Test case '{}' failed: expected {}, got {}",
                    name, expected, result
                );
                println!("Array contents: {:?}", array);
            }
            assert_eq!(result, expected, "all_one failed for test case '{}'", name);
            Ok(())
        };

        // Basic cases - uniform arrays
        let all_one_array = BooleanArray::from(vec![true, true, true, true]);
        run_test(&all_one_array, "all true array", true)?;

        let all_zero_array = BooleanArray::from(vec![false, false, false, false]);
        run_test(&all_zero_array, "all false array", false)?;

        // Mixed values
        let mixed_array = BooleanArray::from(vec![true, true, false, true]);
        run_test(&mixed_array, "mixed array with one false", false)?;

        // Edge cases
        let empty_array = BooleanArray::from(vec![] as Vec<bool>);
        run_test(&empty_array, "empty array", true)?;

        let single_true = BooleanArray::from(vec![true]);
        run_test(&single_true, "single true", true)?;

        let single_false = BooleanArray::from(vec![false]);
        run_test(&single_false, "single false", false)?;

        // Arrays with nulls
        let array_with_nulls: BooleanArray =
            vec![Some(true), None, Some(true)].into_iter().collect();
        run_test(&array_with_nulls, "nulls with true", true)?;

        let nulls_with_false: BooleanArray =
            vec![Some(true), Some(false), None].into_iter().collect();
        run_test(&nulls_with_false, "nulls with false", false)?;

        // Large arrays
        let large_all_one = BooleanArray::from(vec![true; 128]);
        run_test(&large_all_one, "large all true (128)", true)?;

        // Test with a single false at different positions
        let mut values = vec![true; 128];
        values[63] = false; // Last bit in first chunk
        let large_with_boundary_false = BooleanArray::from(values);
        run_test(
            &large_with_boundary_false,
            "large with false at bit 63",
            false,
        )?;

        let mut values = vec![true; 128];
        values[64] = false; // First bit in second chunk
        let large_with_second_chunk_false = BooleanArray::from(values);
        run_test(
            &large_with_second_chunk_false,
            "large with false at bit 64",
            false,
        )?;

        // Specific sizes that might trigger edge cases
        let exact_chunk_size = BooleanArray::from(vec![true; 64]);
        run_test(&exact_chunk_size, "exact chunk size (64)", true)?;

        let just_under_chunk = BooleanArray::from(vec![true; 63]);
        run_test(&just_under_chunk, "just under chunk size (63)", true)?;

        let just_over_chunk = BooleanArray::from(vec![true; 65]);
        run_test(&just_over_chunk, "just over chunk size (65)", true)?;

        Ok(())
    }

    #[test]
    fn test_all_zero() -> Result<()> {
        // Helper function to run tests and report failures
        let run_test = |array: &BooleanArray, name: &str, expected: bool| -> Result<()> {
            let result = all_zero(array);
            if result != expected {
                println!(
                    "Test case '{}' failed: expected {}, got {}",
                    name, expected, result
                );
                println!("Array contents: {:?}", array);
            }
            assert_eq!(result, expected, "all_zero failed for test case '{}'", name);
            Ok(())
        };

        // Basic cases - uniform arrays
        let all_zero_array = BooleanArray::from(vec![false, false, false, false]);
        run_test(&all_zero_array, "all false array", true)?;

        let all_one_array = BooleanArray::from(vec![true, true, true, true]);
        run_test(&all_one_array, "all true array", false)?;

        // Mixed values
        let mixed_array = BooleanArray::from(vec![false, false, true, false]);
        run_test(&mixed_array, "mixed array with one true", false)?;

        // Edge cases
        let empty_array = BooleanArray::from(vec![] as Vec<bool>);
        run_test(&empty_array, "empty array", true)?;

        let single_false = BooleanArray::from(vec![false]);
        run_test(&single_false, "single false", true)?;

        let single_true = BooleanArray::from(vec![true]);
        run_test(&single_true, "single true", false)?;

        // Arrays with nulls
        let array_with_nulls: BooleanArray =
            vec![Some(false), None, Some(false)].into_iter().collect();
        run_test(&array_with_nulls, "nulls with false", true)?;

        let nulls_with_true: BooleanArray =
            vec![Some(false), Some(true), None].into_iter().collect();
        run_test(&nulls_with_true, "nulls with true", false)?;

        // Large arrays
        let large_all_zero = BooleanArray::from(vec![false; 128]);
        run_test(&large_all_zero, "large all false (128)", true)?;

        // Test with a single true at different positions
        let mut values = vec![false; 128];
        values[63] = true; // Last bit in first chunk
        let large_with_boundary_true = BooleanArray::from(values);
        run_test(
            &large_with_boundary_true,
            "large with true at bit 63",
            false,
        )?;

        let mut values = vec![false; 128];
        values[64] = true; // First bit in second chunk
        let large_with_second_chunk_true = BooleanArray::from(values);
        run_test(
            &large_with_second_chunk_true,
            "large with true at bit 64",
            false,
        )?;

        // Specific sizes that might trigger edge cases
        let exact_chunk_size = BooleanArray::from(vec![false; 64]);
        run_test(&exact_chunk_size, "exact chunk size (64)", true)?;

        let just_under_chunk = BooleanArray::from(vec![false; 63]);
        run_test(&just_under_chunk, "just under chunk size (63)", true)?;

        let just_over_chunk = BooleanArray::from(vec![false; 65]);
        run_test(&just_over_chunk, "just over chunk size (65)", true)?;

        Ok(())
    }
---- utils::tests::test_all_zero stdout ----
Test case 'all true array' failed: expected false, got true
Array contents: BooleanArray
[
  true,
  true,
  true,
  true,
]

...

---- utils::tests::test_all_one stdout ----
Test case 'all false array' failed: expected false, got true
Array contents: BooleanArray
[
  false,
  false,
  false,
  false,
]

@acking-you
Copy link
Contributor

let mut all_zero = true;
for i in bit_chunks {
    all_zero &= i == 0;
}

all_zero

As in the above code, but if you do that, you won't be able to do an early exit, and there should be no difference from using true_count/false_count directly

@Dandandan
Copy link
Contributor

let mut all_zero = true;
for i in bit_chunks {
    all_zero &= i == 0;
}

all_zero

As in the above code, but if you do that, you won't be able to do an early exit, and there should be no difference from using true_count/false_count directly

It is not doing early exit indeed (it could be changed to do so for a chunck of values), but I benchmarked than true_count / false_count (it is generating code that completes faster).

@Dandandan
Copy link
Contributor

Dandandan commented Apr 12, 2025

@acking-you the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently).

@acking-you
Copy link
Contributor

@acking-you the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently).

I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code

@acking-you
Copy link
Contributor

@acking-you the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently).

I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code

@Dandandan You're absolutely right. At this point, we should no longer focus on optimizing the early return for the true_count calculation. Instead, we should consider extending support to new optimization scenarios, such as handling nulls and other cases. Here’s why:

My Attempt

I experimented with optimizing the true_count calculation using SIMD instructions, performing 256-bit comparisons and computations while incorporating early returns. This significantly reduced the overhead of checking boolean arrays in non-short-circuit scenarios (with a performance boost of up to 6x in extreme cases). However, in short-circuit scenarios like all false or all true, it still performed about 20% slower than the existing false_count/true_count implementation.

When I applied this optimization to the execution of BinaryExpr and benchmarked the entire BinaryExpr, I observed a performance regression. The reasons are as follows:

  1. The computation of true_count for boolean arrays is only on the order of 60ns, and even with early exit optimizations, it remains at the 10ns level (in extreme cases).
  2. When BinaryExpr does not trigger short-circuiting, its execution time is on the order of 400us (depending on the complexity of the expression). Optimizing the true_count calculation for boolean arrays becomes negligible in this context.
  3. When short-circuiting occurs, the time complexity of BinaryExpr is on par with the computation of the boolean array. In this case, calling true_count remains the optimal solution.

Other Optimization Scenarios

I realized that my earlier conclusions about the benefits of early selection execution were incorrect. After testing with the latest code changes (on a machine with an AMD 9950x CPU), the results are as follows: all scenarios that benefited from this optimization saw a reduction in execution time from 450us to 170us. This is a significant improvement, and it did not slow down any other cases!

short_circuit/and/one_true_first
                        time:   [188.36 µs 188.96 µs 189.55 µs]
                        change: [-58.870% -58.652% -58.420%] (p = 0.00 < 0.05)
                        Performance has improved.

short_circuit/and/one_true_last
                        time:   [170.10 µs 171.04 µs 171.97 µs]
                        change: [-62.788% -62.520% -62.258%] (p = 0.00 < 0.05)
                        Performance has improved.

short_circuit/and/one_true_middle
                        time:   [165.38 µs 165.97 µs 166.57 µs]
                        change: [-64.176% -63.968% -63.784%] (p = 0.00 < 0.05)
                        Performance has improved.

short_circuit/and/one_true_middle_left
                        time:   [165.02 µs 165.48 µs 165.96 µs]
                        change: [-63.671% -63.458% -63.204%] (p = 0.00 < 0.05)
                        Performance has improved.

short_circuit/and/one_true_middle_right
                        time:   [169.09 µs 169.61 µs 170.13 µs]
                        change: [-63.298% -63.089% -62.868%] (p = 0.00 < 0.05)
                        Performance has improved.

short_circuit/and/one_true_middle_right #2
                        time:   [166.43 µs 167.12 µs 167.85 µs]
                        change: [-64.092% -63.916% -63.751%] (p = 0.00 < 0.05)
                        Performance has improved.

I further optimized the work done by @kosiew. Relevant PR: #15694 (currently, there are still some issues with cargo test, and I'm working on fixing them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants