Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uses of BACK that cause correlated references during hybrid/relational/SQLGlot conversion #232

Open
wants to merge 107 commits into
base: main
Choose a base branch
from

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Jan 23, 2025

Resolves #141. See issue for more details. This PR deals with the hybrid & relational conversion, including the creation of new types of hybrid/relational nodes to express correlation. Child PRs deal with the rest of the issue:

knassre-bodo and others added 30 commits January 13, 2025 11:59
Revision 2

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 3

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 4

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 5

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 6

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 7

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 8

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 9

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 10

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 11

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 12

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 13

Co-authored-by: Hadia Ahmed <[email protected]>
@@ -0,0 +1,21 @@
ROOT(columns=[('N_NAME', N_NAME), ('REVENUE', REVENUE)], orderings=[(ordering_1):desc_last])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correlation of this query: finds lines where the supplier & customer are from the same nation.

  • The join in question is the one with the name corr10.
  • The left side has all of the nations whose region is asia
  • The right side aggregates all of the lineitems from customers in that region.
  • Important filter is name_9 == corr10.name (meaning that the name of the supplier from the right side is the same as the name of the nation from the left side)

@@ -0,0 +1,21 @@
ROOT(columns=[('S_NAME', S_NAME), ('NUMWAIT', NUMWAIT)], orderings=[(ordering_1):desc_last, (ordering_2):asc_first])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correlation of this query: goes from lineitem (L1) -> order -> lineitem (L2) to find all instances where a lineitem is in an order with multiple suppliers but that specific supplier's lineitem entries in the order are the only ones that are late, so it needs to know if the supplier key from L1 and L2 are different to determine if two lineitems in the same order have different suppliers.

  • First important join in question is the one with the name corr5.
    • The left side has L1 & order, filtered to only include late lines
    • The right side has L2.
    • Important filter is supplier_key != corr5.supplier_key, used to semi-join L1 and L2 when there is a matching L2 entry that does not have the same supplier key as L1.
  • Second important join in question is the one with the name corr6.
    • The left side has the result of the first correlate.
    • The right side has another L2.
    • Important filter is supplier_key != corr6.supplier_key, used to anti-join L1 and L2 when there is a matching L2 entry that does not have the same supplier key as L1 (as well as some extra properties).

@@ -0,0 +1,18 @@
ROOT(columns=[('CNTRY_CODE', CNTRY_CODE), ('NUM_CUSTS', NUM_CUSTS), ('TOTACCTBAL', TOTACCTBAL)], orderings=[])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correlation of this query: first derives a global average of the selected customers' account balances, then partitions the customers but only the ones who are above the global average.

  • Important join in question is the one with the name corr1.
  • The left side has the global aggregation
  • The right side has the selected customers who have been aggregated.
  • Inside the RHS, before the aggregation happens, the selected customers get further filtered to only include the ones where acctbal > corr1.avg_balance

Comment on lines +137 to +140
def visit_correlated_reference(
self, correlated_reference: CorrelatedReference
) -> None:
raise NotImplementedError("TODO")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next PR in the stack deals with this part: #234

@@ -26,17 +26,30 @@ def get_logger(
`logging.Logger` : Configured logger instance.
"""
logger: logging.Logger = logging.getLogger(name)
level_env: str = os.getenv("PYDOUGH_LOG_LEVEL")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatter & mypy were acting up here, not sure way.

Comment on lines +234 to +235
Class for HybridExpr terms that are expressions from a parent hybrid tree
rather than an ancestor, which requires a correlated reference.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it needs to go out several levels, then it will be HybridCorrelExpr(HybridCorrelExpr(...))

@@ -1269,108 +1295,7 @@ def populate_children(
for con_typ in reference_types:
connection_type = connection_type.reconcile_connection_types(con_typ)
child_idx_mapping[child_idx] = hybrid.add_child(subtree, connection_type)

def make_hybrid_agg_expr(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is no longer needed/useful; can merge with make_hybrid_expr & have an argument for if we are inside of an aggregation.

Comment on lines +1702 to +1704
# Do special casing for operators that an have collection
# arguments.
# TODO: (gh #148) handle collection-level NDISTINCT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO already existed, just got moved around.

Comment on lines +1507 to +1509
raise NotImplementedError(
f"TODO: support converting {expr.__class__.__name__} in aggregations"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO already existed, just got moved around.

Comment on lines +1925 to +1927
partition_by = (
node.child_access.ancestor_context.starting_predecessor
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for cases like this:

a = PARTITION(data, name="cluster", by=(x,y))
b = a.WHERE(COUNT(cluster) > 10)
c = b(n=AVG(cluster.z)) 
d = c.cluster.WHERE(z > BACK(1).n)

In this case, whenever dealing with c or d, we need partition_by to refer to a, even though node.child_access.ancestor_context can refer to b or c. That's why .starting_prdecessor is used (keeps calling .predecessor until we get back to a).

@knassre-bodo knassre-bodo marked this pull request as ready for review February 10, 2025 20:18
@knassre-bodo knassre-bodo requested review from vineetg3, hadia206 and amullerbodo and removed request for vineetg3 February 10, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support correlated backreferences in QDAG to Relational conversion
1 participant