Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Feb 11, 2025

Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating BACK references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with.

from collections.abc import Callable

import pytest
from test_utils import (
graph_fetcher,
)
from tpch_test_functions import (
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.

@knassre-bodo knassre-bodo changed the title Overhaul BACK and CALC to use downstreaming of aliases Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method Feb 13, 2025
Comment on lines -20 to -23
lps_back_lines_impl,
lps_back_lines_price_impl,
lps_back_supplier_impl,
lps_back_supplier_name_impl,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples were no longer valid

@knassre-bodo knassre-bodo marked this pull request as ready for review February 13, 2025 08:25
result = People(
result = People.CALCULATE(
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure all of these got updated, in all of our documentation, notebooks, tests, etc.

@@ -101,7 +101,7 @@
"source": [
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least 1 reviewer should re-run all 5 notebooks to confirm they behave as expected.

" O_ORDERPRIORITY=order_priority,\n",
" ORDER_COUNT=COUNT(o),\n",
").ORDER_BY(O_ORDERPRIORITY.ASC())\n",
"pydough.to_df(output)"
]
},
{
"cell_type": "markdown",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added in 5/21/22 since, by the time this is merged, correlation will have already been added.

@@ -875,21 +875,23 @@ def rel_translation(
@staticmethod
def preprocess_root(
node: PyDoughCollectionQDAG,
output_cols: list[tuple[str, str]] | None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses output_cols as an override of the normal behavior.

rel_expr = output.expressions[hybrid_expr]
ordered_columns.append((original_name, rel_expr))
ordered_columns.sort(key=lambda col: node.get_expression_position(col[0]))
if columns is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -66,6 +66,47 @@ def _load_session_info(
return metadata, config, database, bindings


def _load_column_selection(kwargs: dict[str, object]) -> list[tuple[str, str]] | None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the testing changes were specifically to make sure this kwarg works as-expected (most tests just pass in None)

Comment on lines 426 to +430
for name in collection.calc_terms:
# Skip columns that are overloaded with a name from an ancestor,
# since they should not be used.
if name in collection.ancestral_mapping:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See case Reference(): further down below for where this comes into play. Skipping them now allows us to prevent unnecessarily including an ancestral term until it is specifically requested (otherwise, we'll get tons of unnecessary correlations)

hybrid.add_successor(successor_hybrid)
self.populate_children(successor_hybrid, node, child_ref_mapping)
partition_child_idx: int = child_ref_mapping[0]
back_exprs: dict[str, HybridExpr] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick here with PARTITION_BY is that we need all of the ancestral terms that would be lot by the change-in-ancestry to be flushed downward into the current child context, that way all of them are later accessible if needed. This does not cause correlations since it is only propagating down-streamed terms from within the single vertical-spine of the child node to the Partition.

E.g. consider the following snippet:

selected_lines = parts.CALCULATE(
  container
).supply_records.WHERE(
  supplier.nation.region.name == "EUROPE"
).lines
global_info = TPCH(global_avg_qty=AVG(selected_lines.quantity))
result = global_info.PARTITION(
  selected_lines, name="lines", by=container
).CALCULATE(
  container_avg_qty
).lines.WHERE(
  (quantity > global_avg_qty) & 
  (quantity > container_avg_qty)
)

The container term, made available for down-streaming, is never used by selected_lines. However, when it is partitioned, we first flush it downward into the terms of selected_lines so it can be used as an implicit back-reference to the original ancestry. Then, when the ancestry is changed, lines now gains access to global_avg_qty and container_avg_qty from its new ancestry. However, we do not want global_avg_qty to be similarly flushed (even though global_info is part of the original ancestry of selected_lines) because that will cause an unnecessary correlated reference. We can deal with global_avg_qty by just relying on the new ancestry to provide it for us, which is why line 1852 checks to see if it is part of the new ancestry.

@@ -474,23 +478,6 @@ def explain_unqualified(node: UnqualifiedNode, verbose: bool) -> str:
"\nThe collection does not have any terms that can be included in a result if it is executed."
)

# Identify the number of BACK levels that are accessible
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists.

@@ -80,13 +79,7 @@ def collection_in_context_string(
Returns:
The desired string representation of context and collection combined.
"""
if isinstance(collection, BackReferenceCollection):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists.

@@ -69,8 +68,8 @@ def collection_in_context_string(
"""
Converts a collection in the context of another collection into a single
string in a way that elides back collection references. For example,
if the context is A.B.C.D, and the collection is BACK(2).E.F, the result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists.

Comment on lines -114 to -116
# Build a back reference collection node
# Equivalent PyDough code: `BACK(1).subcollection`
back_reference_collection_node = builder.build_back_reference_collection(table_collection, "subcollection", 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax
1 participant