-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256
base: main
Are you sure you want to change the base?
Conversation
from collections.abc import Callable | ||
|
||
import pytest | ||
from test_utils import ( | ||
graph_fetcher, | ||
) | ||
from tpch_test_functions import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.
…ition/partition_child edge cases
… for hybrid and update remainign tpch queries
…h a hack to avoid correlate generation
lps_back_lines_impl, | ||
lps_back_lines_price_impl, | ||
lps_back_supplier_impl, | ||
lps_back_supplier_name_impl, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples were no longer valid
result = People( | ||
result = People.CALCULATE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to ensure all of these got updated, in all of our documentation, notebooks, tests, etc.
@@ -101,7 +101,7 @@ | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least 1 reviewer should re-run all 5 notebooks to confirm they behave as expected.
" O_ORDERPRIORITY=order_priority,\n", | ||
" ORDER_COUNT=COUNT(o),\n", | ||
").ORDER_BY(O_ORDERPRIORITY.ASC())\n", | ||
"pydough.to_df(output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added in 5/21/22 since, by the time this is merged, correlation will have already been added.
@@ -875,21 +875,23 @@ def rel_translation( | |||
@staticmethod | |||
def preprocess_root( | |||
node: PyDoughCollectionQDAG, | |||
output_cols: list[tuple[str, str]] | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses output_cols
as an override of the normal behavior.
rel_expr = output.expressions[hybrid_expr] | ||
ordered_columns.append((original_name, rel_expr)) | ||
ordered_columns.sort(key=lambda col: node.get_expression_position(col[0])) | ||
if columns is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@@ -66,6 +66,47 @@ def _load_session_info( | |||
return metadata, config, database, bindings | |||
|
|||
|
|||
def _load_column_selection(kwargs: dict[str, object]) -> list[tuple[str, str]] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the testing changes were specifically to make sure this kwarg works as-expected (most tests just pass in None)
for name in collection.calc_terms: | ||
# Skip columns that are overloaded with a name from an ancestor, | ||
# since they should not be used. | ||
if name in collection.ancestral_mapping: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See case Reference():
further down below for where this comes into play. Skipping them now allows us to prevent unnecessarily including an ancestral term until it is specifically requested (otherwise, we'll get tons of unnecessary correlations)
hybrid.add_successor(successor_hybrid) | ||
self.populate_children(successor_hybrid, node, child_ref_mapping) | ||
partition_child_idx: int = child_ref_mapping[0] | ||
back_exprs: dict[str, HybridExpr] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick here with PARTITION_BY
is that we need all of the ancestral terms that would be lot by the change-in-ancestry to be flushed downward into the current child context, that way all of them are later accessible if needed. This does not cause correlations since it is only propagating down-streamed terms from within the single vertical-spine of the child node to the Partition.
E.g. consider the following snippet:
selected_lines = parts.CALCULATE(
container
).supply_records.WHERE(
supplier.nation.region.name == "EUROPE"
).lines
global_info = TPCH(global_avg_qty=AVG(selected_lines.quantity))
result = global_info.PARTITION(
selected_lines, name="lines", by=container
).CALCULATE(
container_avg_qty
).lines.WHERE(
(quantity > global_avg_qty) &
(quantity > container_avg_qty)
)
The container
term, made available for down-streaming, is never used by selected_lines
. However, when it is partitioned, we first flush it downward into the terms of selected_lines
so it can be used as an implicit back-reference to the original ancestry. Then, when the ancestry is changed, lines
now gains access to global_avg_qty
and container_avg_qty
from its new ancestry. However, we do not want global_avg_qty
to be similarly flushed (even though global_info
is part of the original ancestry of selected_lines
) because that will cause an unnecessary correlated reference. We can deal with global_avg_qty
by just relying on the new ancestry to provide it for us, which is why line 1852 checks to see if it is part of the new ancestry.
@@ -474,23 +478,6 @@ def explain_unqualified(node: UnqualifiedNode, verbose: bool) -> str: | |||
"\nThe collection does not have any terms that can be included in a result if it is executed." | |||
) | |||
|
|||
# Identify the number of BACK levels that are accessible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior no longer exists.
@@ -80,13 +79,7 @@ def collection_in_context_string( | |||
Returns: | |||
The desired string representation of context and collection combined. | |||
""" | |||
if isinstance(collection, BackReferenceCollection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior no longer exists.
@@ -69,8 +68,8 @@ def collection_in_context_string( | |||
""" | |||
Converts a collection in the context of another collection into a single | |||
string in a way that elides back collection references. For example, | |||
if the context is A.B.C.D, and the collection is BACK(2).E.F, the result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior no longer exists.
# Build a back reference collection node | ||
# Equivalent PyDough code: `BACK(1).subcollection` | ||
back_reference_collection_node = builder.build_back_reference_collection(table_collection, "subcollection", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior no longer exists.
Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating
BACK
references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with.