Skip to content

BinaryExpr evaluate lacks optimization for Or and And scenarios #11212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
acking-you opened this issue Jul 2, 2024 · 20 comments · Fixed by #15462
Closed

BinaryExpr evaluate lacks optimization for Or and And scenarios #11212

acking-you opened this issue Jul 2, 2024 · 20 comments · Fixed by #15462
Labels
bug Something isn't working performance Make DataFusion faster

Comments

@acking-you
Copy link
Contributor

Describe the bug

As shown in the code link, BinaryExpr directly calculates the left and right operands without considering the possibility that left might already be false and op is And, or that left might be true and op is Or.

For example, in this complex filtering statement, unnecessary calculations will be performed:

SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    total_purchases
FROM 
    customers
WHERE 
    1 = 0  -- This condition is always false
    AND (
        (total_purchases > 1000 AND EXTRACT(YEAR FROM registration_date) > 2018)
        OR (LOWER(email) LIKE '%@gmail.com' AND loyalty_points > 500)
        OR (
            LENGTH(first_name) + LENGTH(last_name) > 15
            AND SUBSTR(last_name, 1, 1) IN ('M', 'N', 'O', 'P')
            AND DATE_PART('month', last_purchase_date) BETWEEN 6 AND 8
        )
    )
    AND (
        customer_id % 3 = 0
        OR EXISTS (
            SELECT 1 
            FROM orders 
            WHERE orders.customer_id = customers.customer_id
            AND orders.order_status = 'Completed'
            AND orders.order_total > 250
        )
    );

To Reproduce

When evaluating AND or OR expressions

Expected behavior

When an And expression is executed, a false occurs to stop the computation of the result.
When an Or expression is executed, a true occurs and the result is stopped.

Additional context

No

@acking-you acking-you added the bug Something isn't working label Jul 2, 2024
@Dandandan
Copy link
Contributor

Could you give an example where this would make a difference? We have an optimization pass SimplifyExpressions which already takes care of true and false constants in AND and OR during planning.

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

possibility that left might already be false and op is And, or that left might be true and op is Or.

In general I think there is a tradeoff between doing short circuiting (what I think this ticket is describing) and having to check if each row should be evaluted

So for a predicate like (a = 5) AND (b = 10) it is very likely faster to evaluate (a = b) and (b = 10) with very tight loops / SIMD instructions and then && the resutlting boolean together that it would be to evalute (a = b) and then have a loop that checked the result for each row when evaluting b = 10.

However, for an example like (a = 5) AND (very_expensive_udf(b) = 10) it may well be faster to do short circuting execution

Note DataFusion already does short circuting for evaluating CASE

let return_type = self.data_type(&batch.schema())?;
let expr = self.expr.as_ref().unwrap();
let base_value = expr.evaluate(batch)?;
let base_value = base_value.into_array(batch.num_rows())?;
let base_nulls = is_null(base_value.as_ref())?;
// start with nulls as default output
let mut current_value = new_null_array(&return_type, batch.num_rows());
// We only consider non-null values while comparing with whens
let mut remainder = not(&base_nulls)?;
for i in 0..self.when_then_expr.len() {
let when_value = self.when_then_expr[i]
.0
.evaluate_selection(batch, &remainder)?;
let when_value = when_value.into_array(batch.num_rows())?;
// build boolean array representing which rows match the "when" value
let when_match = eq(&when_value, &base_value)?;
// Treat nulls as false
let when_match = match when_match.null_count() {
0 => Cow::Borrowed(&when_match),
_ => Cow::Owned(prep_null_mask_filter(&when_match)),
};
// Make sure we only consider rows that have not been matched yet
let when_match = and(&when_match, &remainder)?;
// When no rows available for when clause, skip then clause
if when_match.true_count() == 0 {
continue;
}
let then_value = self.when_then_expr[i]
.1
.evaluate_selection(batch, &when_match)?;
current_value = match then_value {
ColumnarValue::Scalar(ScalarValue::Null) => {
nullif(current_value.as_ref(), &when_match)?
}
ColumnarValue::Scalar(then_value) => {
zip(&when_match, &then_value.to_scalar()?, &current_value)?
}
ColumnarValue::Array(then_value) => {
zip(&when_match, &then_value, &current_value)?
}
};
remainder = and_not(&remainder, &when_match)?;
}
if let Some(e) = &self.else_expr {
// keep `else_expr`'s data type and return type consistent
let expr = try_cast(e.clone(), &batch.schema(), return_type.clone())
.unwrap_or_else(|_| e.clone());
// null and unmatched tuples should be assigned else value
remainder = or(&base_nulls, &remainder)?;
let else_ = expr
.evaluate_selection(batch, &remainder)?
.into_array(batch.num_rows())?;
current_value = zip(&remainder, &else_, &current_value)?;
}
Ok(ColumnarValue::Array(current_value))
}

SO TLDR I think this would be an interesting optimization to explore and as @Dandandan notes finding some benchmarks where it matters is likely a good first step

@acking-you
Copy link
Contributor Author

Could you give an example where this would make a difference? We have an optimization pass SimplifyExpressions which already takes care of true and false constants in AND and OR during planning.

My apologies, my example might not have been the best illustration because cases with constants are indeed optimized by the rule. What I'm trying to get at is that there's currently unnecessary computation happening within BinaryExpr (perhaps in many cases, just calculating the left side would be sufficient to determine the result).

@jayzhan211
Copy link
Contributor

What I'm trying to get at is that there's currently unnecessary computation happening within BinaryExpr (perhaps in many cases, just calculating the left side would be sufficient to determine the result).

I would be surprise if there is any, could you provide the example that shows optimizer failed to do what you expect?

Here is the optimize rule that skip the right hand side if left is false

// false AND A --> false (even if A is null)
Expr::BinaryExpr(BinaryExpr {
left,
op: And,
right: _,
}) if is_false(&left) => Transformed::yes(*left),

@acking-you
Copy link
Contributor Author

What I'm trying to get at is that there's currently unnecessary computation happening within BinaryExpr (perhaps in many cases, just calculating the left side would be sufficient to determine the result).

I would be surprise if there is any, could you provide the example that shows optimizer failed to do what you expect?

Here is the optimize rule that skip the right hand side if left is false

// false AND A --> false (even if A is null)
Expr::BinaryExpr(BinaryExpr {
left,
op: And,
right: _,
}) if is_false(&left) => Transformed::yes(*left),

What I'm trying to describe is probably something like this.

SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    total_purchases
FROM 
    customers
WHERE 
    some_udf(first_name)  -- if this condition is mostly false at runtime.
    AND (
        (total_purchases > 1000 AND EXTRACT(YEAR FROM registration_date) > 2018)
        OR (LOWER(email) LIKE '%@gmail.com' AND loyalty_points > 500)
        OR (
            LENGTH(first_name) + LENGTH(last_name) > 15
            AND SUBSTR(last_name, 1, 1) IN ('M', 'N', 'O', 'P')
            AND DATE_PART('month', last_purchase_date) BETWEEN 6 AND 8
        )
    )
    AND (
        customer_id % 3 = 0
        OR EXISTS (
            SELECT 1 
            FROM orders 
            WHERE orders.customer_id = customers.customer_id
            AND orders.order_status = 'Completed'
            AND orders.order_total > 250
        )
    );

@acking-you
Copy link
Contributor Author

acking-you commented Jul 3, 2024

possibility that left might already be false and op is And, or that left might be true and op is Or.

In general I think there is a tradeoff between doing short circuiting (what I think this ticket is describing) and having to check if each row should be evaluted

So for a predicate like (a = 5) AND (b = 10) it is very likely faster to evaluate (a = b) and (b = 10) with very tight loops / SIMD instructions and then && the resutlting boolean together that it would be to evalute (a = b) and then have a loop that checked the result for each row when evaluting b = 10.

However, for an example like (a = 5) AND (very_expensive_udf(b) = 10) it may well be faster to do short circuting execution

Note DataFusion already does short circuting for evaluating CASE

let return_type = self.data_type(&batch.schema())?;
let expr = self.expr.as_ref().unwrap();
let base_value = expr.evaluate(batch)?;
let base_value = base_value.into_array(batch.num_rows())?;
let base_nulls = is_null(base_value.as_ref())?;
// start with nulls as default output
let mut current_value = new_null_array(&return_type, batch.num_rows());
// We only consider non-null values while comparing with whens
let mut remainder = not(&base_nulls)?;
for i in 0..self.when_then_expr.len() {
let when_value = self.when_then_expr[i]
.0
.evaluate_selection(batch, &remainder)?;
let when_value = when_value.into_array(batch.num_rows())?;
// build boolean array representing which rows match the "when" value
let when_match = eq(&when_value, &base_value)?;
// Treat nulls as false
let when_match = match when_match.null_count() {
0 => Cow::Borrowed(&when_match),
_ => Cow::Owned(prep_null_mask_filter(&when_match)),
};
// Make sure we only consider rows that have not been matched yet
let when_match = and(&when_match, &remainder)?;
// When no rows available for when clause, skip then clause
if when_match.true_count() == 0 {
continue;
}
let then_value = self.when_then_expr[i]
.1
.evaluate_selection(batch, &when_match)?;
current_value = match then_value {
ColumnarValue::Scalar(ScalarValue::Null) => {
nullif(current_value.as_ref(), &when_match)?
}
ColumnarValue::Scalar(then_value) => {
zip(&when_match, &then_value.to_scalar()?, &current_value)?
}
ColumnarValue::Array(then_value) => {
zip(&when_match, &then_value, &current_value)?
}
};
remainder = and_not(&remainder, &when_match)?;
}
if let Some(e) = &self.else_expr {
// keep `else_expr`'s data type and return type consistent
let expr = try_cast(e.clone(), &batch.schema(), return_type.clone())
.unwrap_or_else(|_| e.clone());
// null and unmatched tuples should be assigned else value
remainder = or(&base_nulls, &remainder)?;
let else_ = expr
.evaluate_selection(batch, &remainder)?
.into_array(batch.num_rows())?;
current_value = zip(&remainder, &else_, &current_value)?;
}
Ok(ColumnarValue::Array(current_value))
}

SO TLDR I think this would be an interesting optimization to explore and as @Dandandan notes finding some benchmarks where it matters is likely a good first step

Adding short-circuiting after the left calculation alone resulted in performance improvements in my local TPCH tests.
commit link

Here are the test results for my local TPCH run (on my M3Pro chip with 36GB of RAM):

Comparing main and add_short_circuit
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ add_short_circuit ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  89.54ms │           90.81ms │     no change │
│ QQuery 2     │  24.09ms │           20.56ms │ +1.17x faster │
│ QQuery 3     │  37.53ms │           38.00ms │     no change │
│ QQuery 4     │  23.99ms │           23.73ms │     no change │
│ QQuery 5     │  61.97ms │           54.52ms │ +1.14x faster │
│ QQuery 6     │  21.62ms │           20.98ms │     no change │
│ QQuery 7     │  75.46ms │           75.51ms │     no change │
│ QQuery 8     │  49.76ms │           50.09ms │     no change │
│ QQuery 9     │  68.53ms │           68.13ms │     no change │
│ QQuery 10    │  61.78ms │           60.36ms │     no change │
│ QQuery 11    │  13.16ms │           13.55ms │     no change │
│ QQuery 12    │  34.62ms │           34.35ms │     no change │
│ QQuery 13    │  41.50ms │           42.54ms │     no change │
│ QQuery 14    │  32.07ms │           30.29ms │ +1.06x faster │
│ QQuery 15    │  43.47ms │           43.82ms │     no change │
│ QQuery 16    │  18.59ms │           16.59ms │ +1.12x faster │
│ QQuery 17    │  97.63ms │           95.67ms │     no change │
│ QQuery 18    │ 115.07ms │          111.64ms │     no change │
│ QQuery 19    │  53.83ms │           57.37ms │  1.07x slower │
│ QQuery 20    │  45.42ms │           44.09ms │     no change │
│ QQuery 21    │  82.97ms │           82.69ms │     no change │
│ QQuery 22    │  16.46ms │           13.59ms │ +1.21x faster │
└──────────────┴──────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                │ 1109.06ms │
│ Total Time (add_short_circuit)   │ 1088.88ms │
│ Average Time (main)              │   50.41ms │
│ Average Time (add_short_circuit) │   49.49ms │
│ Queries Faster                   │         5 │
│ Queries Slower                   │         1 │
│ Queries with No Change           │        16 │
└──────────────────────────────────┴───────────┘

@Dandandan
Copy link
Contributor

Thanks for providing some benchmarks / example @acking-you !

I think the missing optimizations seem to be:

  • Checking for array.true_count() to be all zero / all true. I think this is an optimization that might be better to implement in the arrow-rs and/or (and_kleene, or_kleene) kernels as a special case instead of in DataFusion.
  • UDF -> Immutable/non-volatile are executed during optimization in the constant evaluator, so the extra optimization will only apply for non-volatile UDFs.

@acking-you
Copy link
Contributor Author

Checking for array.true_count() to be all zero / all true. I think this is an optimization that might be better to implement in the arrow-rs and/or (and_kleene, or_kleene) kernels as a special case instead of in DataFusion.

I think the root cause is that left and right are calculated first, and then further calculations are performed based on op. The specific code is in the link below:
https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/expressions/binary.rs#L260-L261

To make changes here, you'll need to modify the DataFusion code.
Possible Solutions: commit link

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 3, 2024

Thanks for providing some benchmarks / example @acking-you !

I think the missing optimizations seem to be:

  • Checking for array.true_count() to be all zero / all true. I think this is an optimization that might be better to implement in the arrow-rs and/or (and_kleene, or_kleene) kernels as a special case instead of in DataFusion.
  • UDF -> Immutable/non-volatile are executed during optimization in the constant evaluator, so the extra optimization will only apply for non-volatile UDFs.

I'm not sure if modifying and / or kleene logic help, since it needs two boolean array, but in this case we expect to short circuit if the left hand side has boolean array with all false for AND clause.

I think we might need to extend expression simplify logic for array case in Simplifier 🤔 Not a good idea to compute ColumnValue::Array in const_evalutor

@Dandandan
Copy link
Contributor

I'm not sure if modifying and / or kleene logic help, since it needs two boolean array, but in this case we expect to short circuit if the left hand side has boolean array with all false for AND clause.

In the kernel we can add a fast case as well, just as @acking-you has done in his code . The benefit of this is that it will apply for any place that uses and/ and_kleene/or/or_kleene rather than in BinaryExpr alone.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 3, 2024

I'm not sure if modifying and / or kleene logic help, since it needs two boolean array, but in this case we expect to short circuit if the left hand side has boolean array with all false for AND clause.

In the kernel we can add a fast case as well, just as @acking-you has done in his code . The benefit of this is that it will apply for any place that uses and/ and_kleene/or/or_kleene rather than in BinaryExpr alone.

I agree that we can also optimize the kleene kernel, so we can early return the result.
But besides that we also need to find a way to avoid computation in datafusion.

Current code is like

let left: BooleanArray = expr.evalute()
let right: BooleanArray = expr.evalute()
let result = and_kleene(left, right) // we can optimize this too!

But ideally we could avoid computation of right

let left: BooleanArray = expr.evalute()
// left is always false
skip right evaluation // right evaluation is possible to be expensive
skip and_kleene

#11247 indeed did a good job, but it seems like a specialize hacking code, not sure if it is the best solution 🤔
I feel like the solution is optimizing query that has most of the false statement in the left hand side,
given the queries like a AND b AND false where most of the false statement is in the right hand side, the optimization here does not help, it is highly correlated to the benchmark we run i.e. tpch in your case

@Dandandan
Copy link
Contributor

Hm I get where you're going at, there is some more optimization possible if we add it to BinaryExpr as well

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

Hm I get where you're going at, there is some more optimization possible if we add it to BinaryExpr as well

I believe @Dandandan filed #11262 to discuss this idea further

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

I did some more looking into the idea of using true_count in boolean kernels in arrow to speed things up (e..g special casing when BooleanArray::true_count and BooleanArray::row_count were the same to avoid actually computing the binary result

However, I concluded that since BooleanArray::true_count has to check all the values of the array anyways it might be nearly as fast to just do the binary operation (which is already quite fast) than to count the true/faslse values

https://docs.rs/arrow-array/52.0.0/src/arrow_array/array/boolean_array.rs.html#142

@acking-you
Copy link
Contributor Author

acking-you commented Mar 26, 2025

@alamb I sincerely apologize for not revisiting this issue or pushing forward with that PR for such a long time. However, this optimization has led to significant performance improvements in one of our internal use cases—nearly 100 times faster. Recently, while seeing the community working on runtime filters, this issue came to mind again. I’m truly grateful for the community’s friendliness and active engagement.

Application Scenarios

Scenario 1

I attempted to reproduce our usage scenario using the hits dataset from clickbench hits:

SELECT count(*)
FROM hits
WHERE 
    -- Core filtering criteria: In most cases, the data can be filtered down to 0 records.
    "UserID" > 7350907328996404000
    AND "URL" = 'http://smeshariki.ru/recipes/search/cuZ29vZ2xlLzcwODthZHdvcmRz&page/make=43;1334313116'
    
    -- Other filters
    AND "EventTime" BETWEEN 
        EXTRACT(epoch FROM '2014-03-23 00:00:00'::timestamp)::BIGINT 
        AND 
        EXTRACT(epoch FROM '2014-04-22 23:59:59'::timestamp)::BIGINT
    AND (("Age" BETWEEN 18 AND 35 AND "Sex" = 'female') 
         OR ("Income" > 50000 AND ("Interests" & 128) = 128))
    AND (
        ("IsMobile" = 1 AND "MobilePhoneModel" LIKE 'iPhone%') 
        OR ("IsMobile" = 0 AND "ResolutionWidth" >= 1920)
    )
    AND "IsDownload" = 0
    AND "IsNotBounce" = 1
    AND "DontCountHits" = 0
    AND split_part(split_part("URL", 'make=', 2), ';', 1) = '43'
    AND "UTMSource" = 'google_ads'
    AND "UTMCampaign" = 'spring_promo'
    AND "HTTPError" = CAST(200 AS SMALLINT)
    AND "JavascriptEnable" = 1
    AND "CookieEnable" = 1
    AND EXTRACT(hour FROM to_timestamp("EventTime")) BETWEEN 14 AND 18
    AND (
        regexp_match("UserAgent", 'Chrome/(9[0-9]|10[0-2])') IS NOT NULL
        OR regexp_match("UserAgent", 'Safari/[6-8]\.') IS NOT NULL
    )
    AND "SocialSourceNetworkID" IN (CAST(5 AS SMALLINT),CAST(12 AS SMALLINT))
    AND "SocialAction" = 'share'
    AND (
        ("ClientTimeZone" BETWEEN -5 AND 5 AND "BrowserCountry" <> 'RU')
        OR ("ClientTimeZone" NOT BETWEEN -5 AND 5 AND "BrowserCountry" = 'RU')
    )
    AND (
        strpos("OriginalURL", 'utm_id=') > 0 
        OR "OpenstatCampaignID" IS NOT NULL
    )
    AND "Robotness" < CAST(0.3 AS FLOAT)
    AND "IsArtifical" = 0
    AND "IsEvent" = 1
    AND "IsParameter" = 0;

The performance difference of this SQL compared to the main branch without short-circuit optimization is as follows, achieving a nearly 500X improvement:

Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ add_short_circuit ┃          Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 5024.39ms │           10.16ms │ +494.40x faster │
└──────────────┴───────────┴───────────────────┴─────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                │ 5024.39ms │
│ Total Time (add_short_circuit)   │   10.16ms │
│ Average Time (main)              │ 5024.39ms │
│ Average Time (add_short_circuit) │   10.16ms │
│ Queries Faster                   │         1 │
│ Queries Slower                   │         0 │
│ Queries with No Change           │         0 │
└──────────────────────────────────┴───────────┘

Scenario 2

I also thought of a scenario where where in (subquery) could be used:

SELECT count(*) FROM hits WHERE "UserID" > 7350909328996404000 and  "URL" in (select "URL" from hits where "URL" like '%tt%');

However, this scenario will not optimize performance, because the statement is actually a join:

LogicalPlan:
Projection: count(Int64(1)) AS count(*)
└─ Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
   └─ Projection
      └─ LeftSemi Join: hits.URL = __correlated_sq_1.URL
         ├─ Projection: hits.URL
         │  └─ Filter: hits.UserID > Int64(9050909328996404000)
         │     └─ TableScan: hits
         │        (projection=[UserID, URL], filter=UserID > 9e18)
         └─ SubqueryAlias: __correlated_sq_1
            └─ Filter: hits.URL LIKE "%tt%"
               └─ TableScan: hits
                  (projection=[URL], filter=URL contains "tt")

PhysicalPlan:
ProjectionExec: count(*) 
└─ AggregateExec (Final)
   └─ CoalescePartitionsExec
      └─ AggregateExec (Partial)
         └─ ProjectionExec
            └─ CoalesceBatchesExec
               └─ HashJoinExec (LeftSemi)
                  ├─ RepartitionExec (Hash)
                  │  └─ CoalesceBatchesExec
                  │     └─ FilterExec (UserID > 9e18)
                  │        └─ DataSourceExec
                  │           (parquet/hits, 12 partitions)
                  │           [projection: UserID, URL]
                  │           [pruning: UserID_max > 9e18]
                  └─ RepartitionExec (Hash)
                     └─ CoalesceBatchesExec
                        └─ FilterExec (URL LIKE "%tt%")
                           └─ DataSourceExec
                              (parquet/hits, 12 partitions)
                              [projection: URL]
                              [predicate: URL contains "tt"]

This SQL query will perform extremely slowly with or without this optimization. I believe this scenario should benefit from the optimization of Dynamic Join Predicates, as mentioned by @alamb in this #7955 , referencing the duckdb blog.

Why TPCH or ClickBench Did Not Improve

TPCH

I went through each of the test SQL statements in TPCH and found that most of them focus on joins rather than point queries. Additionally, the computational load of the predicate conditions does not yield significant benefits, as illustrated by the following SQL:

select
    sum(l_extendedprice * l_discount) as revenue
from
    lineitem
where
        l_shipdate >= date '1994-01-01'
  and l_shipdate < date '1995-01-01'
  and l_discount between 0.06 - 0.01 and 0.06 + 0.01
  and l_quantity < 24;

Each predicate is based on simple comparisons and additions of numbers, which does not yield any profit.

ClickBench

There are not many SQL with more than one predicate in ClickBench, including Q21, Q22, Q36, Q37, Q38, Q39, Q40, Q41, and Q42:

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10;

SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;

SELECT "TraficSourceID", "SearchEngineID", "AdvEngineID", CASE WHEN ("SearchEngineID" = 0 AND "AdvEngineID" = 0) THEN "Referer" ELSE '' END AS Src, "URL" AS Dst, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 GROUP BY "TraficSourceID", "SearchEngineID", "AdvEngineID", Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100;

SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;

SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000;

From the above SQLs, we can observe that:

  1. Most of them are simple predicate conditions.
  2. Some of the slightly more complex predicate conditions are followed by limit 10, and the computational cost of these predicate conditions is still relatively low (especially when compared to many string-matching scenarios).

Summary

Scenarios that benefit from short-circuit optimization:

  1. The initial filter can filter out most of the data.
  2. Subsequent filters involve time-consuming computations (e.g., operations that mostly involve matching strings longer than 1k characters, or predicates with a large number of string manipulations).

@jayzhan211
Copy link
Contributor

If we can optimize the specialized query you mentioned and not slowing down other queries, it would be nice to have it.

@alamb
Copy link
Contributor

alamb commented Mar 26, 2025

Thank you for bringing this up again @acking-you

If we can optimize the specialized query you mentioned and not slowing down other queries, it would be nice to have it.

I agree with @jayzhan211 -- if we can improve performance in your scenario without slowing down other queries this is a great thing to do

I attempted to reproduce our usage scenario using the hits dataset from clickbench hits:

Nice! Thank you for this analysis

Perhaps as a first step you could add this query (or perhaps a derivative one) to the "extended" clickbench suite (that is DataFusion specific but covers some important cases): https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries

The next step would be to make a PR with the optimization -- we can then use the query in the first step to verify performance is improved and also verify other queries are not slowed down.

@alamb alamb added the performance Make DataFusion faster label Mar 26, 2025
@acking-you
Copy link
Contributor Author

Thank you for your guidance and advice @alamb .
I will try to work on these later today (I might be a bit busy right now).

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

Thanks @acking-you -- I think the benchmark is the most important thing at first

@acking-you
Copy link
Contributor Author

Thank you very much for your reply. These are some updates on this issue. @alamb:

  1. I have added the extended SQL in this PR Add short circuit evaluation for AND and OR #15462, you can check the details there:SQL link
  2. The test results of the local extend SQL are as follows:
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ add_short_circuit ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  964.48ms │         1006.79ms │     no change │
│ QQuery 1     │  409.89ms │          419.52ms │     no change │
│ QQuery 2     │  838.25ms │          868.74ms │     no change │
│ QQuery 3     │  408.15ms │          396.88ms │     no change │
│ QQuery 4     │ 1029.80ms │          783.40ms │ +1.31x faster │
│ QQuery 5     │ 9429.76ms │         8835.06ms │ +1.07x faster │
│ QQuery 6     │ 4096.47ms │         1382.42ms │ +2.96x faster │
└──────────────┴───────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                │ 17176.80ms │
│ Total Time (add_short_circuit)   │ 13692.80ms │
│ Average Time (main)              │  2453.83ms │
│ Average Time (add_short_circuit) │  1956.11ms │
│ Queries Faster                   │          3 │
│ Queries Slower                   │          0 │
│ Queries with No Change           │          4 │
└──────────────────────────────────┴────────────┘

My question: Is there a pipeline in DataFusion that can automatically perform benchmark comparisons?

  1. I'm currently encountering some issues mentioned in Add short circuit evaluation for AND and OR #15462 (comment) and can't pass the tests yet. I'm trying to fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Make DataFusion faster
Projects
None yet
4 participants