feat: infer equal columns from the query #168

Gun9niR · 2024-04-25T21:51:48Z

Background

An investigation of JOB revealed that the reason optd output 1 for all queries was that the selectivities of redundant predicates are all evaluated. A set of predicate contains redundant predicates if some of them can be expressed by others. For example, consider query 1a.

SELECT mc.note AS production_note,
       t.title AS movie_title,
       t.production_year AS movie_year
FROM company_type AS ct,
     info_type AS it,
     movie_companies AS mc,
     movie_info_idx AS mi_idx,
     title AS t
WHERE ct.kind = 'production companies'
  AND it.info = 'top 250 rank'
  AND mc.note NOT LIKE '%(as Metro-Goldwyn-Mayer Pictures)%'
  AND (mc.note LIKE '%(co-production)%'
       OR mc.note LIKE '%(presents)%')
  AND ct.id = mc.company_type_id
  AND t.id = mc.movie_id
  AND t.id = mi_idx.movie_id
  AND mc.movie_id = mi_idx.movie_id
  AND it.id = mi_idx.info_type_id;

In the join plan node with t as one of the input tables, the predicates would contain both t.id = mc.movie_id and t.id = mi_idx.movie_id. But since there's already mc.movie_id = mi_idx.movie_id, one of these three predicates is redundant.

Goal

Given a set of predicates P that define the equality of N predicates, we want to pick N - 1 most selective predicates P' and remove the rest.

Implementation

Identifying Equal Columns

We identify the set of predicates in GroupColumnRefs logical property using union find, because this way we can reuse the logic of base table column ref to identify which columns are equal.

❗ However, BaseTableColumnRef does not handle table alias, so if t1 and t2 are both aliases for t, t1.a = t2.b and t1.b = t2.a will be treated as the same predicate.

Computing Selecitivty for `P'`

The difficulty is that we don't get to see all the predicates all at once. Instead, we see those predicates gradually as we move up the plan tree, and the child might have already picked some predicate that is not among P'. E.g. in the above example, even if mc.movie_id = mi_idx.movie_id (denoted p3) is the most selective predicate and thus should have been eliminated, it will be picked by mc join mi_idx because it's the only predicate it sees.

Therefore, the two predicates involving t (denote t.id = mc.movie_id as p1 and t.id = mi_idx.movie_id as p2) should introduce some "selectivity adjustment factor". More specifically, if we see p1 first we can just pick it, since p1 and p3 are not redundant. When we see p2, we need to pick two of the three predicates, but remember p3 and p1 are already picked. So in this case instead of just multiplying the total selectivity with p2's, p2 provides some sort of selectivity adjustment factor which can be computed by MSP({p1, p2, p3}) / MSP({p1, p3}), where MSP is the multiplied selectivies of the Most Selective Predicates that define the equality of the columns. MSP is computed using Kruskal.

This idea can be generalized to more predicates.

Results

This features significantly improves q-error for the JOB benchmark. Before this PR, the estimated cardinalities for all columns are just 1.

wangpatrick57

Really clean code overall. I left comments about possible bugs in core functionality, more test coverage, readability comments. If the bugs are real we should fix them before merging. The comments should be quick to add so I think you should add them in this PR as well. The test coverage can be done in a future PR since this PR is already so big.

optd-datafusion-repr/Cargo.toml

optd-datafusion-repr/src/cost/base_cost.rs

optd-datafusion-repr/src/cost/base_cost/join.rs

optd-datafusion-repr/src/properties/column_ref.rs

wangpatrick57

LGTM

Gun9niR added 3 commits April 25, 2024 14:35

refactor column ref

aa003c2

refactor BaseTableColumnRef

b1ed9a5

feat: infer eq columns for inner and cross join

1a29aaa

Gun9niR marked this pull request as draft April 25, 2024 21:51

Gun9niR added 8 commits April 25, 2024 19:22

feat: keep both input and output correlation

ff9ce90

add ut for eq column set, add join skeleton

7303a55

integrate eq columns to compute_cost

07a3d13

fix: add ut and fix bug

b97abc7

fix perftest print message for JOB

0bc8d6b

comments

0a2be18

fix line wrap

6058a0b

fix ut

fe5e70e

Gun9niR requested a review from wangpatrick57 April 26, 2024 07:20

wangpatrick57 requested changes Apr 26, 2024

View reviewed changes

Gun9niR added 7 commits April 26, 2024 12:39

address some of the comments

d76b9e4

update mst comments

1e6dca5

update redundant predicates comment

53c6046

update comment

7631f5b

update side effect

5a9799d

check disjoint set only has one set

9f64dc4

check disjoint set has N columns

5505729

Gun9niR requested a review from wangpatrick57 April 26, 2024 17:51

Gun9niR marked this pull request as ready for review April 26, 2024 18:24

wangpatrick57 approved these changes Apr 26, 2024

View reviewed changes

wangpatrick57 merged commit bd8bbe0 into main Apr 26, 2024
1 check passed

wangpatrick57 deleted the zhidong/eq-column branch April 26, 2024 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: infer equal columns from the query #168

feat: infer equal columns from the query #168

Gun9niR commented Apr 25, 2024 •

edited

Loading

wangpatrick57 left a comment

wangpatrick57 left a comment

feat: infer equal columns from the query #168

feat: infer equal columns from the query #168

Conversation

Gun9niR commented Apr 25, 2024 • edited Loading

Background

Goal

Implementation

Identifying Equal Columns

Computing Selecitivty for P'

Results

wangpatrick57 left a comment

Choose a reason for hiding this comment

wangpatrick57 left a comment

Choose a reason for hiding this comment

Gun9niR commented Apr 25, 2024 •

edited

Loading

Computing Selecitivty for `P'`