Support uses of BACK that cause correlated references during hybrid/relational/SQLGlot conversion #232

knassre-bodo · 2025-01-23T18:07:04Z

Resolves #141. See issue for more details. This PR deals with the hybrid & relational conversion, including the creation of new types of hybrid/relational nodes to express correlation. Child PRs deal with the rest of the issue:

Support uses of BACK that cause correlated references: SQLGlot conversion #234
Support uses of BACK that cause correlated references: setup decorrelation handling #251
Support uses of BACK that cause correlated references: fix remaining decorrelation edge cases #254

Co-authored-by: Hadia Ahmed <[email protected]>

Revision 2 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 3 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 4 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 5 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 6 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 7 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 8 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 9 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 10 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 11 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 12 Co-authored-by: Hadia Ahmed <[email protected]>

Revision 13 Co-authored-by: Hadia Ahmed <[email protected]>

knassre-bodo · 2025-02-06T05:58:54Z

tests/test_plan_refsols/tpch_q5.txt

@@ -0,0 +1,21 @@
+ROOT(columns=[('N_NAME', N_NAME), ('REVENUE', REVENUE)], orderings=[(ordering_1):desc_last])


Correlation of this query: finds lines where the supplier & customer are from the same nation.

The join in question is the one with the name corr10.

The left side has all of the nations whose region is asia

The right side aggregates all of the lineitems from customers in that region.

Important filter is name_9 == corr10.name (meaning that the name of the supplier from the right side is the same as the name of the nation from the left side)

knassre-bodo · 2025-02-06T06:05:44Z

tests/test_plan_refsols/tpch_q21.txt

@@ -0,0 +1,21 @@
+ROOT(columns=[('S_NAME', S_NAME), ('NUMWAIT', NUMWAIT)], orderings=[(ordering_1):desc_last, (ordering_2):asc_first])


Correlation of this query: goes from lineitem (L1) -> order -> lineitem (L2) to find all instances where a lineitem is in an order with multiple suppliers but that specific supplier's lineitem entries in the order are the only ones that are late, so it needs to know if the supplier key from L1 and L2 are different to determine if two lineitems in the same order have different suppliers.

First important join in question is the one with the name corr5.

The left side has L1 & order, filtered to only include late lines

The right side has L2.

Important filter is supplier_key != corr5.supplier_key, used to semi-join L1 and L2 when there is a matching L2 entry that does not have the same supplier key as L1.

Second important join in question is the one with the name corr6.

The left side has the result of the first correlate.

The right side has another L2.

Important filter is supplier_key != corr6.supplier_key, used to anti-join L1 and L2 when there is a matching L2 entry that does not have the same supplier key as L1 (as well as some extra properties).

knassre-bodo · 2025-02-06T06:09:51Z

tests/test_plan_refsols/tpch_q22.txt

@@ -0,0 +1,18 @@
+ROOT(columns=[('CNTRY_CODE', CNTRY_CODE), ('NUM_CUSTS', NUM_CUSTS), ('TOTACCTBAL', TOTACCTBAL)], orderings=[])


Correlation of this query: first derives a global average of the selected customers' account balances, then partitions the customers but only the ones who are above the global average.

Important join in question is the one with the name corr1.

The left side has the global aggregation

The right side has the selected customers who have been aggregated.

Inside the RHS, before the aggregation happens, the selected customers get further filtered to only include the ones where acctbal > corr1.avg_balance

knassre-bodo · 2025-02-06T06:14:34Z

pydough/sqlglot/sqlglot_relational_expression_visitor.py

+    def visit_correlated_reference(
+        self, correlated_reference: CorrelatedReference
+    ) -> None:
+        raise NotImplementedError("TODO")


The next PR in the stack deals with this part: #234

knassre-bodo · 2025-02-06T06:23:00Z

pydough/logger/logger.py

@@ -26,17 +26,30 @@ def get_logger(
        `logging.Logger` : Configured logger instance.
    """
    logger: logging.Logger = logging.getLogger(name)
-    level_env: str = os.getenv("PYDOUGH_LOG_LEVEL")


Formatter & mypy were acting up here, not sure way.

knassre-bodo · 2025-02-06T06:23:53Z