-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support uses of BACK that cause correlated references during hybrid/relational/SQLGlot conversion #232
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Hadia Ahmed <[email protected]>
Revision 2 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 3 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 4 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 5 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 6 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 7 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 8 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 9 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 10 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 11 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 12 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 13 Co-authored-by: Hadia Ahmed <[email protected]>
@@ -0,0 +1,21 @@ | |||
ROOT(columns=[('N_NAME', N_NAME), ('REVENUE', REVENUE)], orderings=[(ordering_1):desc_last]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correlation of this query: finds lines where the supplier & customer are from the same nation.
- The join in question is the one with the name
corr10
. - The left side has all of the nations whose region is asia
- The right side aggregates all of the lineitems from customers in that region.
- Important filter is
name_9 == corr10.name
(meaning that the name of the supplier from the right side is the same as the name of the nation from the left side)
@@ -0,0 +1,21 @@ | |||
ROOT(columns=[('S_NAME', S_NAME), ('NUMWAIT', NUMWAIT)], orderings=[(ordering_1):desc_last, (ordering_2):asc_first]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correlation of this query: goes from lineitem (L1) -> order -> lineitem (L2) to find all instances where a lineitem is in an order with multiple suppliers but that specific supplier's lineitem entries in the order are the only ones that are late, so it needs to know if the supplier key from L1 and L2 are different to determine if two lineitems in the same order have different suppliers.
- First important join in question is the one with the name
corr5
.- The left side has L1 & order, filtered to only include late lines
- The right side has L2.
- Important filter is
supplier_key != corr5.supplier_key
, used to semi-join L1 and L2 when there is a matching L2 entry that does not have the same supplier key as L1.
- Second important join in question is the one with the name
corr6
.- The left side has the result of the first correlate.
- The right side has another L2.
- Important filter is
supplier_key != corr6.supplier_key
, used to anti-join L1 and L2 when there is a matching L2 entry that does not have the same supplier key as L1 (as well as some extra properties).
@@ -0,0 +1,18 @@ | |||
ROOT(columns=[('CNTRY_CODE', CNTRY_CODE), ('NUM_CUSTS', NUM_CUSTS), ('TOTACCTBAL', TOTACCTBAL)], orderings=[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correlation of this query: first derives a global average of the selected customers' account balances, then partitions the customers but only the ones who are above the global average.
- Important join in question is the one with the name
corr1
. - The left side has the global aggregation
- The right side has the selected customers who have been aggregated.
- Inside the RHS, before the aggregation happens, the selected customers get further filtered to only include the ones where
acctbal > corr1.avg_balance
def visit_correlated_reference( | ||
self, correlated_reference: CorrelatedReference | ||
) -> None: | ||
raise NotImplementedError("TODO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next PR in the stack deals with this part: #234
pydough/logger/logger.py
Outdated
@@ -26,17 +26,30 @@ def get_logger( | |||
`logging.Logger` : Configured logger instance. | |||
""" | |||
logger: logging.Logger = logging.getLogger(name) | |||
level_env: str = os.getenv("PYDOUGH_LOG_LEVEL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatter & mypy were acting up here, not sure way.
Class for HybridExpr terms that are expressions from a parent hybrid tree | ||
rather than an ancestor, which requires a correlated reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it needs to go out several levels, then it will be HybridCorrelExpr(HybridCorrelExpr(...))
@@ -1269,108 +1295,7 @@ def populate_children( | |||
for con_typ in reference_types: | |||
connection_type = connection_type.reconcile_connection_types(con_typ) | |||
child_idx_mapping[child_idx] = hybrid.add_child(subtree, connection_type) | |||
|
|||
def make_hybrid_agg_expr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is no longer needed/useful; can merge with make_hybrid_expr
& have an argument for if we are inside of an aggregation.
# Do special casing for operators that an have collection | ||
# arguments. | ||
# TODO: (gh #148) handle collection-level NDISTINCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO already existed, just got moved around.
raise NotImplementedError( | ||
f"TODO: support converting {expr.__class__.__name__} in aggregations" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO already existed, just got moved around.
partition_by = ( | ||
node.child_access.ancestor_context.starting_predecessor | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary for cases like this:
a = PARTITION(data, name="cluster", by=(x,y))
b = a.WHERE(COUNT(cluster) > 10)
c = b(n=AVG(cluster.z))
d = c.cluster.WHERE(z > BACK(1).n)
In this case, whenever dealing with c
or d
, we need partition_by
to refer to a
, even though node.child_access.ancestor_context
can refer to b
or c
. That's why .starting_prdecessor
is used (keeps calling .predecessor
until we get back to a
).
Resolves #141. See issue for more details. This PR deals with the hybrid & relational conversion, including the creation of new types of hybrid/relational nodes to express correlation. Child PRs deal with the rest of the issue: