Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uses of BACK that cause correlated references: setup decorrelation handling #251

Open
wants to merge 17 commits into
base: kian/correlated_backref_2
Choose a base branch
from

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Feb 6, 2025

Child PR of #232 that is part of addressing #141. Adds the handling of de-correlation as a post-processing step of the hybrid tree before relational conversion starts. Handles the singular & aggregate cases as described in the issue, and handles the semi-singular and semi-aggregate cases in the same manner as their non-semi equivalents (for now), saving the optimized variants for another day.

These remaining still need to be addressed in the next follow up:

  • TPC-H Q22: error during relational conversion that likely means a bug happened during hybrid conversion or hybrid decorrelation
  • Correl 15: error during decorrelation causing over-excessive column pruning
  • Correl 18: tbd bug in decorrelation
  • Correl 19: bug in name resolution handling that causes the wrong term to be used when certain property names collide

Comment on lines +83 to +94

@abstractmethod
def equals(self, other: object) -> bool:
"""
Returns whether this operator is equal to another operator.
"""

def __eq__(self, other: object) -> bool:
return self.equals(other)

def __hash__(self) -> int:
return hash(repr(self))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing before and was causing all manner of havoc until I added these in. I'm a bit flabbergasted that this never raised a problem until now, and only on one test that was using LOWER but couldn't match it (these operators get used as dictionary keys). I'll chalk it up to one of those "sometimes bad things work in Python until they don't" moments 🤷

@knassre-bodo
Copy link
Contributor Author

knassre-bodo commented Feb 9, 2025

As of this stage, all queries are passing except these 3 (which seem like high priority bugs to fix):

  • TPC-H Q22: error during relational conversion that likely means a bug happened during hybrid conversion or hybrid decorrelation
  • Correl 3: error during decorrelation causing over-excessive column pruning
  • Correl 15: not being decorrelated

SCAN(table=tpch.CUSTOMER, columns={'key': c_custkey, 'nation_key': c_nationkey})
AGGREGATE(keys={'key': key}, aggregations={'agg_0': SUM(value)})
PROJECT(columns={'key': key, 'value': extended_price * 1:int64 - discount})
FILTER(condition=name_15 == name, columns={'discount': discount, 'extended_price': extended_price, 'key': key})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice here, the filter name_9 == corr10.name has been replaced with name_15 == name, where name_15 refers to the supplier nation name & name refers to the customer nation name. Previously, corr10.name was required since the customer natior name was on the left side of the join, and the supplier nation name was on the right side. But now they are both on the right side, so no correlation is required.

FILTER(condition=order_date >= datetime.date(1994, 1, 1):date & order_date < datetime.date(1995, 1, 1):date, columns={'key_5': key_5, 'nation_key': nation_key})
JOIN(conditions=[t0.key == t1.customer_key], types=['inner'], columns={'key_5': t1.key, 'nation_key': t0.nation_key, 'order_date': t1.order_date})
SCAN(table=tpch.CUSTOMER, columns={'key': c_custkey, 'nation_key': c_nationkey})
AGGREGATE(keys={'key': key}, aggregations={'agg_0': SUM(value)})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregation keys now changed. Previously we could aggregate on the customer's nation_key value, but now we aggregate on the nation's key value. In this case those are the same thing, but the distinction matters because now we need to guarantee that the join between the left side & right side is a 1:1 zippering without any warping of cardinality.

@knassre-bodo knassre-bodo marked this pull request as ready for review February 10, 2025 20:18
Comment on lines +187 to +188
if levels == 0:
return self
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a useful trick to have rather than having to case on this behavior elsewhere, so x.shift_back(0) returns x.

@@ -895,6 +897,7 @@ def __init__(
self._is_connection_root: bool = is_connection_root
self._agg_keys: list[HybridExpr] | None = None
self._join_keys: list[tuple[HybridExpr, HybridExpr]] | None = None
self._correlated_children: set[int] = set()
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of indices indicates which of the elements in _children contain a correlated reference specifically to self. This is important because only those children should have the de-correlation procedure run on them, and only if they are of a certain type.

Comment on lines -1603 to -1606
case HybridRefExpr():
parent_result = HybridBackRefExpr(
parent_result.expr.name, 1, parent_result.typ
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This weird extra-shifting trick is no longer needed, and was actually incorrect.

Comment on lines +1631 to +1632
if not isinstance(parent_result, HybridCorrelExpr):
parent_tree.correlated_children.add(len(parent_tree.children))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we know which children of the hybrid tree are correlated. If parent_result is a correl expr, it means that the correlated reference is actually to the tree containing parent_tree.

@@ -1,8 +1,10 @@
ROOT(columns=[('name', name), ('n_prefix_nations', n_prefix_nations)], orderings=[(ordering_1):asc_first])
PROJECT(columns={'n_prefix_nations': n_prefix_nations, 'name': name, 'ordering_1': name})
PROJECT(columns={'n_prefix_nations': DEFAULT_TO(agg_0, 0:int64), 'name': name})
JOIN(conditions=[t0.key == t1.region_key], types=['left'], columns={'agg_0': t1.agg_0, 'name': t0.name}, correl_name='corr1')
JOIN(conditions=[t0.key == t1.key], types=['left'], columns={'agg_0': t1.agg_0, 'name': t0.name})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "simple" example of how the de-correaltion works.

  • The left subtree (region) became a part of the right subtree (nation) before the correlated reference (the filter) occurred. This results in selecting region twice, but that's an efficiency problem for another day.
  • The join is now on the uniquness keys of the left subtree: the region's .key property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant