Overhaul the way back-references and CALC work to eliminate the need for BACK syntax #255

knassre-bodo · 2025-02-10T22:53:46Z

The goal of this issue is to transform the user-facing syntax for calcs & back-references to eliminate the need to keep track of BACK reference index levels; instead, once a term is defined inside of a calc, it is available to all descendants using that name (but there are no collisions allowed). Also, the implicit parentheses-based syntax is replaced with an explicit CALCULATE method, and the e2e APIs to_sql and to_df are augmented to allow refinement of column selection/names.

Consider the following PyDough snippet that lists every region/nation name combination, alphabetized by nation name:

result = regions.nations(
  region_name=BACK(1).name, nation_name=name
).ORDER_BY(
  nation_name.ASC()
)
return pydough.to_df(result)

Under the new syntax, this would become as follows:

result = regions.CALCULATE(
  region_name=name
).nations.ORDER_BY(
  name.ASC()
)
return pydough.to_df(result, columns={"region_name": "region_name", "nation_name", "name"})

# Alternative:
result = regions.CALCULATE(
  region_name=name
).nations.CALCULATE(
  nation_name=name
).ORDER_BY(
  nation_name.ASC()
)
return pydough.to_df(result, columns=["region_name", "nation_name"])

The explicit rules are:

CALCULATE replaces the ()-based CALC, which will still exist for now (but will become deprecated and eventually removed).
When a name n is defined in a CALCULATE, all references to n in a descendant of the current context are implied to be back-references. Any terms not included in the CALCULATE cannot be accessed downstream.
The name n cannot be already used by the current context, any ancestor of the current context, or any descendant of the current context. If this happens, an exception is raised.
to_sql and to_df can take in a columns argument, which can either be:
- A list of strings of column names to include, in the order they are included
- A dictionary mapping strings of column names in the final output to the name of the column from the PyDough collection that it should be sourced from.
If columns is not included, the previous behavior (most recent calc) is used.
BACK will still exist for now (but will become deprecated and eventually removed).

It is alright to define a name x in an ancestor that also exists in a descendant, so long as the descendent NEVER uses that name.

For instance, the following two snippets would be legal:

# Even from inside region, name refers to its own name because
# the definition that name=Nations.name hasn't happened YET
Nations(name, region_name=region.name)

Even though customers has a property name, which could be confused with

BACK(1).name, it is never used by customers, so no problem occurs.

Nations(name=name)(name, num_customers=COUNT(customers))


But the following snippet would be illegal (since from inside region, name has two meanings):
```py
Nations(name)(name, region_name=region.name)

So it would have to be redone as this (then change to_df to rename things)

Nations(nation_name=name)(nation_name, region_name=region.name)

Under the hood, post-qualification, BACK references will still exist, but this new syntax eliminates the need for the user/LLM to need to know how that works. Instead, they can recall that once they've defined a term in a CALCULATE clause, it is accessible to all descendants under that unique name. All QDAG colleciton nodes will help keep track of their ancestor references which get passed from predecessors/ancestors.

Down the line, window functions will also need to be refactored to get rid of their levels= syntax. Presumably we could do this in two phases:

Ensure every level has a name (change how name arg works for PARTITION so it also gives the partition itself a name)
Allow the levels arugment replacement to invoke these names (e.g. per="region" means partition by the uniquness keys of the ancestor named "region")
Figure out a way to have this name resolution work unambiguously when there are duplicate named ancestors (e.g. lines.order.lines.supplier has 2 ancestors named lines -> perhaps allow lines:1 and lines:2 to refer to the 1st & 2nd ancestor with that name?

Extensive changes to test cases, readmes, spec docs, etc. will be required to implement this change, which is why the effort is so high.

The text was updated successfully, but these errors were encountered:

knassre-bodo added documentation Improvements or additions to documentation effort - high major issue that will require multiple steps or complex design enhancement New feature or request user feature Adding a new user-facing feature/functionality labels Feb 10, 2025

knassre-bodo linked a pull request Feb 11, 2025 that will close this issue

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax #255

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax #255

knassre-bodo commented Feb 10, 2025 •

edited

Loading

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax #255

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax #255

Comments

knassre-bodo commented Feb 10, 2025 • edited Loading

Even though customers has a property name, which could be confused with

BACK(1).name, it is never used by customers, so no problem occurs.

knassre-bodo commented Feb 10, 2025 •

edited

Loading