Overhaul the way back-references and CALC work to eliminate the need for BACK syntax #255
Labels
documentation
Improvements or additions to documentation
effort - high
major issue that will require multiple steps or complex design
enhancement
New feature or request
user feature
Adding a new user-facing feature/functionality
The goal of this issue is to transform the user-facing syntax for calcs & back-references to eliminate the need to keep track of BACK reference index levels; instead, once a term is defined inside of a calc, it is available to all descendants using that name (but there are no collisions allowed). Also, the implicit parentheses-based syntax is replaced with an explicit
CALCULATE
method, and the e2e APIsto_sql
andto_df
are augmented to allow refinement of column selection/names.Consider the following PyDough snippet that lists every region/nation name combination, alphabetized by nation name:
Under the new syntax, this would become as follows:
The explicit rules are:
n
is defined in a CALCULATE, all references ton
in a descendant of the current context are implied to be back-references. Any terms not included in theCALCULATE
cannot be accessed downstream.n
cannot be already used by the current context, any ancestor of the current context, or any descendant of the current context. If this happens, an exception is raised.to_sql
andto_df
can take in acolumns
argument, which can either be:columns
is not included, the previous behavior (most recent calc) is used.It is alright to define a name
x
in an ancestor that also exists in a descendant, so long as the descendent NEVER uses that name.For instance, the following two snippets would be legal:
Even though customers has a property name, which could be confused with
BACK(1).name, it is never used by customers, so no problem occurs.
Nations(name=name)(name, num_customers=COUNT(customers))
So it would have to be redone as this (then change to_df to rename things)
Under the hood, post-qualification,
BACK
references will still exist, but this new syntax eliminates the need for the user/LLM to need to know how that works. Instead, they can recall that once they've defined a term in aCALCULATE
clause, it is accessible to all descendants under that unique name. All QDAG colleciton nodes will help keep track of their ancestor references which get passed from predecessors/ancestors.Down the line, window functions will also need to be refactored to get rid of their
levels=
syntax. Presumably we could do this in two phases:PARTITION
so it also gives the partition itself a name)per="region"
means partition by the uniquness keys of the ancestor named"region"
)lines.order.lines.supplier
has 2 ancestors namedlines
-> perhaps allowlines:1
andlines:2
to refer to the 1st & 2nd ancestor with that name?Extensive changes to test cases, readmes, spec docs, etc. will be required to implement this change, which is why the effort is so high.
The text was updated successfully, but these errors were encountered: