Skip to content

Commit

Permalink
feat: infer equal columns from the query (#168)
Browse files Browse the repository at this point in the history
## Background

An investigation of JOB revealed that the reason optd output 1 for all
queries was that the selectivities of redundant predicates are all
evaluated. A set of predicate contains redundant predicates if some of
them can be expressed by others. For example, consider query 1a.

```sql
SELECT mc.note AS production_note,
       t.title AS movie_title,
       t.production_year AS movie_year
FROM company_type AS ct,
     info_type AS it,
     movie_companies AS mc,
     movie_info_idx AS mi_idx,
     title AS t
WHERE ct.kind = 'production companies'
  AND it.info = 'top 250 rank'
  AND mc.note NOT LIKE '%(as Metro-Goldwyn-Mayer Pictures)%'
  AND (mc.note LIKE '%(co-production)%'
       OR mc.note LIKE '%(presents)%')
  AND ct.id = mc.company_type_id
  AND t.id = mc.movie_id
  AND t.id = mi_idx.movie_id
  AND mc.movie_id = mi_idx.movie_id
  AND it.id = mi_idx.info_type_id;
```

In the join plan node with `t` as one of the input tables, the
predicates would contain both `t.id = mc.movie_id` and `t.id =
mi_idx.movie_id`. But since there's already `mc.movie_id =
mi_idx.movie_id`, one of these three predicates is redundant.

## Goal

Given a set of predicates `P` that define the equality of `N`
predicates, we want to pick `N - 1` most selective predicates `P'` and
remove the rest.

## Implementation

### Identifying Equal Columns

We identify the set of predicates in `GroupColumnRefs` logical property
using union find, because this way we can reuse the logic of base table
column ref to identify which columns are equal.

❗ ***However, `BaseTableColumnRef` does not handle table alias, so if
`t1` and `t2` are both aliases for `t`, `t1.a = t2.b` and `t1.b = t2.a`
will be treated as the same predicate.***

### Computing Selecitivty for `P'`

The difficulty is that we don't get to see all the predicates all at
once. Instead, we see those predicates gradually as we move up the plan
tree, and the child might have already picked some predicate that is not
among `P'`. E.g. in the above example, even if `mc.movie_id =
mi_idx.movie_id` (denoted `p3`) is the most selective predicate and thus
should have been eliminated, it will be picked by `mc` join `mi_idx`
because it's the only predicate it sees.

Therefore, the two predicates involving `t` (denote `t.id = mc.movie_id`
as `p1` and `t.id = mi_idx.movie_id` as `p2`) should introduce some
"selectivity adjustment factor". More specifically, if we see `p1` first
we can just pick it, since `p1` and `p3` are not redundant. When we see
`p2`, we need to pick two of the three predicates, but remember `p3` and
`p1` are already picked. So in this case instead of just multiplying the
total selectivity with `p2`'s, `p2` provides some sort of selectivity
adjustment factor which can be computed by `MSP({p1, p2, p3}) / MSP({p1,
p3})`, where `MSP` is the multiplied selectivies of the Most Selective
Predicates that define the equality of the columns. `MSP` is computed
using Kruskal.

This idea can be generalized to more predicates.

## Results

This features significantly improves q-error for the JOB benchmark.
Before this PR, the estimated cardinalities for all columns are just
`1`.
  • Loading branch information
Gun9niR authored Apr 26, 2024
1 parent cabce13 commit bd8bbe0
Show file tree
Hide file tree
Showing 10 changed files with 1,036 additions and 289 deletions.
6 changes: 6 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions optd-datafusion-repr/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,4 @@ assert_approx_eq = "1.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_with = {version = "3.7.0", features = ["json"]}
bincode = "1.3.3"
union-find = { git = "https://github.com/Gun9niR/union-find-rs.git", rev = "794821514f7daefcbb8d5f38ef04e62fc18b5665" }
45 changes: 42 additions & 3 deletions optd-datafusion-repr/src/cost/base_cost.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@ mod join;
mod limit;
pub(crate) mod stats;

use crate::{plan_nodes::OptRelNodeTyp, properties::column_ref::ColumnRef};
use crate::{
plan_nodes::OptRelNodeTyp,
properties::column_ref::{BaseTableColumnRef, ColumnRef},
};
use itertools::Itertools;
use optd_core::{
cascades::{CascadesOptimizer, RelNodeContext},
Expand Down Expand Up @@ -207,7 +210,7 @@ impl<
&self,
col_ref: &ColumnRef,
) -> Option<&ColumnCombValueStats<M, D>> {
if let ColumnRef::BaseTableColumnRef { table, col_idx } = col_ref {
if let ColumnRef::BaseTableColumnRef(BaseTableColumnRef { table, col_idx }) = col_ref {
self.get_column_comb_stats(table, &[*col_idx])
} else {
None
Expand Down Expand Up @@ -314,6 +317,7 @@ mod tests {

pub const TABLE1_NAME: &str = "table1";
pub const TABLE2_NAME: &str = "table2";
pub const TABLE3_NAME: &str = "table3";

// one column is sufficient for all filter selectivity tests
pub fn create_one_column_cost_model(per_column_stats: TestPerColumnStats) -> TestOptCostModel {
Expand All @@ -327,7 +331,7 @@ mod tests {
)
}

/// Two columns is sufficient for all join selectivity tests
/// Create a cost model with two columns, one for each table. Each column has 100 values.
pub fn create_two_table_cost_model(
tbl1_per_column_stats: TestPerColumnStats,
tbl2_per_column_stats: TestPerColumnStats,
Expand All @@ -340,6 +344,41 @@ mod tests {
)
}

/// Create a cost model with three columns, one for each table. Each column has 100 values.
pub fn create_three_table_cost_model(
tbl1_per_column_stats: TestPerColumnStats,
tbl2_per_column_stats: TestPerColumnStats,
tbl3_per_column_stats: TestPerColumnStats,
) -> TestOptCostModel {
OptCostModel::new(
vec![
(
String::from(TABLE1_NAME),
TableStats::new(
100,
vec![(vec![0], tbl1_per_column_stats)].into_iter().collect(),
),
),
(
String::from(TABLE2_NAME),
TableStats::new(
100,
vec![(vec![0], tbl2_per_column_stats)].into_iter().collect(),
),
),
(
String::from(TABLE3_NAME),
TableStats::new(
100,
vec![(vec![0], tbl3_per_column_stats)].into_iter().collect(),
),
),
]
.into_iter()
.collect(),
)
}

/// We need custom row counts because some join algorithms rely on the row cnt
pub fn create_two_table_cost_model_custom_row_cnts(
tbl1_per_column_stats: TestPerColumnStats,
Expand Down
15 changes: 8 additions & 7 deletions optd-datafusion-repr/src/cost/base_cost/agg.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ use optd_core::{
use serde::{de::DeserializeOwned, Serialize};

use crate::{
cost::{
base_cost::stats::{Distribution, MostCommonValues},
base_cost::DEFAULT_NUM_DISTINCT,
cost::base_cost::{
stats::{Distribution, MostCommonValues},
DEFAULT_NUM_DISTINCT,
},
plan_nodes::{ExprList, OptRelNode, OptRelNodeTyp},
properties::column_ref::{ColumnRef, ColumnRefPropertyBuilder},
properties::column_ref::{BaseTableColumnRef, ColumnRef, ColumnRefPropertyBuilder},
};

use super::{OptCostModel, DEFAULT_UNK_SEL};
Expand Down Expand Up @@ -61,13 +61,14 @@ impl<
} else {
// Multiply the n-distinct of all the group by columns.
// TODO: improve with multi-dimensional n-distinct
let base_table_col_refs = optimizer
let group_col_refs = optimizer
.get_property_by_group::<ColumnRefPropertyBuilder>(context.group_id, 1);
base_table_col_refs
group_col_refs
.column_refs()
.iter()
.take(group_by.len())
.map(|col_ref| match col_ref {
ColumnRef::BaseTableColumnRef { table, col_idx } => {
ColumnRef::BaseTableColumnRef(BaseTableColumnRef { table, col_idx }) => {
let table_stats = self.per_table_stats_map.get(table);
let column_stats = table_stats.and_then(|table_stats| {
table_stats.column_comb_stats.get(&vec![*col_idx])
Expand Down
Loading

0 comments on commit bd8bbe0

Please sign in to comment.