-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support uses of BACK that cause correlated references: setup decorrelation handling #251
base: kian/correlated_backref_2
Are you sure you want to change the base?
Support uses of BACK that cause correlated references: setup decorrelation handling #251
Conversation
… correl queries 1/2/3/8/15
|
||
@abstractmethod | ||
def equals(self, other: object) -> bool: | ||
""" | ||
Returns whether this operator is equal to another operator. | ||
""" | ||
|
||
def __eq__(self, other: object) -> bool: | ||
return self.equals(other) | ||
|
||
def __hash__(self) -> int: | ||
return hash(repr(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing before and was causing all manner of havoc until I added these in. I'm a bit flabbergasted that this never raised a problem until now, and only on one test that was using LOWER
but couldn't match it (these operators get used as dictionary keys). I'll chalk it up to one of those "sometimes bad things work in Python until they don't" moments 🤷
|
SCAN(table=tpch.CUSTOMER, columns={'key': c_custkey, 'nation_key': c_nationkey}) | ||
AGGREGATE(keys={'key': key}, aggregations={'agg_0': SUM(value)}) | ||
PROJECT(columns={'key': key, 'value': extended_price * 1:int64 - discount}) | ||
FILTER(condition=name_15 == name, columns={'discount': discount, 'extended_price': extended_price, 'key': key}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice here, the filter name_9 == corr10.name
has been replaced with name_15 == name
, where name_15
refers to the supplier nation name & name
refers to the customer nation name. Previously, corr10.name
was required since the customer natior name was on the left side of the join, and the supplier nation name was on the right side. But now they are both on the right side, so no correlation is required.
FILTER(condition=order_date >= datetime.date(1994, 1, 1):date & order_date < datetime.date(1995, 1, 1):date, columns={'key_5': key_5, 'nation_key': nation_key}) | ||
JOIN(conditions=[t0.key == t1.customer_key], types=['inner'], columns={'key_5': t1.key, 'nation_key': t0.nation_key, 'order_date': t1.order_date}) | ||
SCAN(table=tpch.CUSTOMER, columns={'key': c_custkey, 'nation_key': c_nationkey}) | ||
AGGREGATE(keys={'key': key}, aggregations={'agg_0': SUM(value)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregation keys now changed. Previously we could aggregate on the customer's nation_key value, but now we aggregate on the nation's key value. In this case those are the same thing, but the distinction matters because now we need to guarantee that the join between the left side & right side is a 1:1 zippering without any warping of cardinality.
if levels == 0: | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a useful trick to have rather than having to case on this behavior elsewhere, so x.shift_back(0)
returns x
.
@@ -895,6 +897,7 @@ def __init__( | |||
self._is_connection_root: bool = is_connection_root | |||
self._agg_keys: list[HybridExpr] | None = None | |||
self._join_keys: list[tuple[HybridExpr, HybridExpr]] | None = None | |||
self._correlated_children: set[int] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set of indices indicates which of the elements in _children
contain a correlated reference specifically to self
. This is important because only those children should have the de-correlation procedure run on them, and only if they are of a certain type.
case HybridRefExpr(): | ||
parent_result = HybridBackRefExpr( | ||
parent_result.expr.name, 1, parent_result.typ | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This weird extra-shifting trick is no longer needed, and was actually incorrect.
if not isinstance(parent_result, HybridCorrelExpr): | ||
parent_tree.correlated_children.add(len(parent_tree.children)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we know which children of the hybrid tree are correlated. If parent_result
is a correl expr, it means that the correlated reference is actually to the tree containing parent_tree
.
@@ -1,8 +1,10 @@ | |||
ROOT(columns=[('name', name), ('n_prefix_nations', n_prefix_nations)], orderings=[(ordering_1):asc_first]) | |||
PROJECT(columns={'n_prefix_nations': n_prefix_nations, 'name': name, 'ordering_1': name}) | |||
PROJECT(columns={'n_prefix_nations': DEFAULT_TO(agg_0, 0:int64), 'name': name}) | |||
JOIN(conditions=[t0.key == t1.region_key], types=['left'], columns={'agg_0': t1.agg_0, 'name': t0.name}, correl_name='corr1') | |||
JOIN(conditions=[t0.key == t1.key], types=['left'], columns={'agg_0': t1.agg_0, 'name': t0.name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "simple" example of how the de-correaltion works.
- The left subtree (region) became a part of the right subtree (nation) before the correlated reference (the filter) occurred. This results in selecting region twice, but that's an efficiency problem for another day.
- The join is now on the uniquness keys of the left subtree: the region's
.key
property.
Child PR of #232 that is part of addressing #141. Adds the handling of de-correlation as a post-processing step of the hybrid tree before relational conversion starts. Handles the singular & aggregate cases as described in the issue, and handles the semi-singular and semi-aggregate cases in the same manner as their non-semi equivalents (for now), saving the optimized variants for another day.
These remaining still need to be addressed in the next follow up: