-
Notifications
You must be signed in to change notification settings - Fork 1.5k
BinaryExpr
evaluate lacks optimization for Or
and And
scenarios
#11212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you give an example where this would make a difference? We have an optimization pass |
In general I think there is a tradeoff between doing short circuiting (what I think this ticket is describing) and having to check if each row should be evaluted So for a predicate like However, for an example like Note DataFusion already does short circuting for evaluating datafusion/datafusion/physical-expr/src/expressions/case.rs Lines 125 to 187 in 4f4cd81
SO TLDR I think this would be an interesting optimization to explore and as @Dandandan notes finding some benchmarks where it matters is likely a good first step |
My apologies, my example might not have been the best illustration because cases with constants are indeed optimized by the rule. What I'm trying to get at is that there's currently unnecessary computation happening within |
I would be surprise if there is any, could you provide the example that shows optimizer failed to do what you expect? Here is the optimize rule that skip the right hand side if left is false datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs Lines 848 to 853 in 3421b52
|
What I'm trying to describe is probably something like this. SELECT
customer_id,
first_name,
last_name,
email,
total_purchases
FROM
customers
WHERE
some_udf(first_name) -- if this condition is mostly false at runtime.
AND (
(total_purchases > 1000 AND EXTRACT(YEAR FROM registration_date) > 2018)
OR (LOWER(email) LIKE '%@gmail.com' AND loyalty_points > 500)
OR (
LENGTH(first_name) + LENGTH(last_name) > 15
AND SUBSTR(last_name, 1, 1) IN ('M', 'N', 'O', 'P')
AND DATE_PART('month', last_purchase_date) BETWEEN 6 AND 8
)
)
AND (
customer_id % 3 = 0
OR EXISTS (
SELECT 1
FROM orders
WHERE orders.customer_id = customers.customer_id
AND orders.order_status = 'Completed'
AND orders.order_total > 250
)
); |
Adding short-circuiting after the left calculation alone resulted in performance improvements in my local TPCH tests. Here are the test results for my local TPCH run (on my M3Pro chip with 36GB of RAM):
|
Thanks for providing some benchmarks / example @acking-you ! I think the missing optimizations seem to be:
|
I think the root cause is that To make changes here, you'll need to modify the |
I'm not sure if modifying
|
In the kernel we can add a fast case as well, just as @acking-you has done in his code . The benefit of this is that it will apply for any place that uses |
I agree that we can also optimize the kleene kernel, so we can early return the result. Current code is like let left: BooleanArray = expr.evalute()
let right: BooleanArray = expr.evalute()
let result = and_kleene(left, right) // we can optimize this too! But ideally we could avoid computation of let left: BooleanArray = expr.evalute()
// left is always false
skip right evaluation // right evaluation is possible to be expensive
skip and_kleene #11247 indeed did a good job, but it seems like a specialize hacking code, not sure if it is the best solution 🤔 |
Hm I get where you're going at, there is some more optimization possible if we add it to |
I believe @Dandandan filed #11262 to discuss this idea further |
I did some more looking into the idea of using However, I concluded that since https://docs.rs/arrow-array/52.0.0/src/arrow_array/array/boolean_array.rs.html#142 |
@alamb I sincerely apologize for not revisiting this issue or pushing forward with that PR for such a long time. However, this optimization has led to significant performance improvements in one of our internal use cases—nearly 100 times faster. Recently, while seeing the community working on runtime filters, this issue came to mind again. I’m truly grateful for the community’s friendliness and active engagement. Application ScenariosScenario 1I attempted to reproduce our usage scenario using the hits dataset from clickbench hits: SELECT count(*)
FROM hits
WHERE
-- Core filtering criteria: In most cases, the data can be filtered down to 0 records.
"UserID" > 7350907328996404000
AND "URL" = 'http://smeshariki.ru/recipes/search/cuZ29vZ2xlLzcwODthZHdvcmRz&page/make=43;1334313116'
-- Other filters
AND "EventTime" BETWEEN
EXTRACT(epoch FROM '2014-03-23 00:00:00'::timestamp)::BIGINT
AND
EXTRACT(epoch FROM '2014-04-22 23:59:59'::timestamp)::BIGINT
AND (("Age" BETWEEN 18 AND 35 AND "Sex" = 'female')
OR ("Income" > 50000 AND ("Interests" & 128) = 128))
AND (
("IsMobile" = 1 AND "MobilePhoneModel" LIKE 'iPhone%')
OR ("IsMobile" = 0 AND "ResolutionWidth" >= 1920)
)
AND "IsDownload" = 0
AND "IsNotBounce" = 1
AND "DontCountHits" = 0
AND split_part(split_part("URL", 'make=', 2), ';', 1) = '43'
AND "UTMSource" = 'google_ads'
AND "UTMCampaign" = 'spring_promo'
AND "HTTPError" = CAST(200 AS SMALLINT)
AND "JavascriptEnable" = 1
AND "CookieEnable" = 1
AND EXTRACT(hour FROM to_timestamp("EventTime")) BETWEEN 14 AND 18
AND (
regexp_match("UserAgent", 'Chrome/(9[0-9]|10[0-2])') IS NOT NULL
OR regexp_match("UserAgent", 'Safari/[6-8]\.') IS NOT NULL
)
AND "SocialSourceNetworkID" IN (CAST(5 AS SMALLINT),CAST(12 AS SMALLINT))
AND "SocialAction" = 'share'
AND (
("ClientTimeZone" BETWEEN -5 AND 5 AND "BrowserCountry" <> 'RU')
OR ("ClientTimeZone" NOT BETWEEN -5 AND 5 AND "BrowserCountry" = 'RU')
)
AND (
strpos("OriginalURL", 'utm_id=') > 0
OR "OpenstatCampaignID" IS NOT NULL
)
AND "Robotness" < CAST(0.3 AS FLOAT)
AND "IsArtifical" = 0
AND "IsEvent" = 1
AND "IsParameter" = 0; The performance difference of this SQL compared to the main branch without short-circuit optimization is as follows, achieving a nearly 500X improvement:
Scenario 2I also thought of a scenario where SELECT count(*) FROM hits WHERE "UserID" > 7350909328996404000 and "URL" in (select "URL" from hits where "URL" like '%tt%'); However, this scenario will not optimize performance, because the statement is actually a join:
This SQL query will perform extremely slowly with or without this optimization. I believe this scenario should benefit from the optimization of Dynamic Join Predicates, as mentioned by @alamb in this #7955 , referencing the duckdb blog. Why TPCH or ClickBench Did Not ImproveTPCHI went through each of the test SQL statements in TPCH and found that most of them focus on joins rather than point queries. Additionally, the computational load of the predicate conditions does not yield significant benefits, as illustrated by the following SQL: select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1994-01-01'
and l_shipdate < date '1995-01-01'
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 24; Each predicate is based on simple comparisons and additions of numbers, which does not yield any profit. ClickBenchThere are not many SQL with more than one predicate in ClickBench, including Q21, Q22, Q36, Q37, Q38, Q39, Q40, Q41, and Q42: SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10;
SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;
SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT "TraficSourceID", "SearchEngineID", "AdvEngineID", CASE WHEN ("SearchEngineID" = 0 AND "AdvEngineID" = 0) THEN "Referer" ELSE '' END AS Src, "URL" AS Dst, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 GROUP BY "TraficSourceID", "SearchEngineID", "AdvEngineID", Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000; From the above SQLs, we can observe that:
SummaryScenarios that benefit from short-circuit optimization:
|
If we can optimize the specialized query you mentioned and not slowing down other queries, it would be nice to have it. |
Thank you for bringing this up again @acking-you
I agree with @jayzhan211 -- if we can improve performance in your scenario without slowing down other queries this is a great thing to do
Nice! Thank you for this analysis Perhaps as a first step you could add this query (or perhaps a derivative one) to the "extended" clickbench suite (that is DataFusion specific but covers some important cases): https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries The next step would be to make a PR with the optimization -- we can then use the query in the first step to verify performance is improved and also verify other queries are not slowed down. |
Thank you for your guidance and advice @alamb . |
Thanks @acking-you -- I think the benchmark is the most important thing at first |
Thank you very much for your reply. These are some updates on this issue. @alamb:
My question: Is there a pipeline in DataFusion that can automatically perform benchmark comparisons?
|
Describe the bug
As shown in the code link,
BinaryExpr
directly calculates theleft
andright
operands without considering the possibility thatleft
might already befalse
andop
isAnd
, or thatleft
might betrue
andop
isOr
.For example, in this complex filtering statement, unnecessary calculations will be performed:
To Reproduce
When evaluating AND or OR expressions
Expected behavior
When an And expression is executed, a false occurs to stop the computation of the result.
When an Or expression is executed, a true occurs and the result is stopped.
Additional context
No
The text was updated successfully, but these errors were encountered: