-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split column pruner into two phases #1501
base: p--cte--05
Are you sure you want to change the base?
Conversation
Currently, the column pruner checks the columns that are needed in each `SELECT` statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases. First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this logic looks great!
I had a bit of trouble reading the code (sorry for the slow review - that's why!), but I think this was only due to the naming of some of the classes / variables / etc. I've left some suggestions to improve readability, and most all of them are just related to naming.
f"SQL, but this is a bug and should be investigated." | ||
) | ||
return node | ||
|
||
pruned_select_columns = tuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tangential, but I've frequently read this code and found this variable name confusing (pruned_select_columns
). We frequently refer to "pruned columns" when we mean the ones that have been removed, but in this case we mean the columns that have been kept. I think the word pruned can technically be used both ways, but it typically is used to refer to what has been removed. Can we change this to a more clear variable name?
logger = logging.getLogger(__name__) | ||
|
||
|
||
class SqlTagRequiredColumnAliasesVisitor(SqlQueryPlanNodeVisitor[None]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this class, could we change the name to something like DetermineRequiredColumnAliasesVisitor
or RequiredColumnAliasesDeterminer
?
self._column_alias_tagger = tagged_column_alias_set | ||
|
||
def _search_for_expressions( | ||
self, select_node: SqlSelectStatementNode, pruned_select_columns: Tuple[SqlSelectColumn, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern re: the name pruned_select_columns
here
logger = logging.getLogger(__name__) | ||
|
||
|
||
class TaggedColumnAliasSet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it very unintuitive to understand what you meant by "tag" in this whole PR. I would recommend changing that word to something else more clear everywhere it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this class specifically - it feels like the name implies a simple dataclass / storage object. I would recommend changing the name to something like ColumnAliasCollector
or SqlNodeColumnAliasLinker
.
) source_0 | ||
""" | ||
|
||
def __init__(self, tagged_column_alias_set: TaggedColumnAliasSet) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help with readability to change this __init__
function a bit. Alone, it's not clear what this tagged_column_alias_set
represents.
I think it would be more clear if we moved the logic for building the initial TaggedColumnAliasSet
into here instead of doing that outside and passing it in. Something like this:
def __init__(self, node: SqlSelectStatementNode) -> None:
"""Collect all column aliases currently used in the node."""
self._column_alias_tagger = ColumnAliasSet()
self._column_alias_tagger.collect_all_aliases_in_node(node)
return | ||
|
||
def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None: # noqa: D102 | ||
# Based on column aliases that are tagged in this SELECT statement, tag corresponding column aliases in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a docstring?
if select_column.column_alias in updated_required_column_aliases_in_this_node | ||
) | ||
|
||
# TODO: don't prune columns used in join condition! Tricky to derive since the join condition can be any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be updated to say something like "tag columns used in join condition" since the pruning doesn't happen here
return | ||
|
||
# Create a mapping from the source alias to the column aliases needed from the corresponding source. | ||
source_alias_to_required_column_alias: Dict[str, Set[str]] = defaultdict(set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to source_alias_to_required_column_aliases
(plural aliases
)?
) | ||
# TODO: Handle CTEs parent nodes. | ||
|
||
# For all string columns, assume that they are needed from all sources since we don't have a table alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SqlStringExpressions
aren't intended to ever reference parent nodes, right? Is this handling more intended to be "just in case", i.e., if some future dev uses a SqlStringExpression
inappropriately?
for parent_node in node.parent_nodes: | ||
self._column_alias_tagger.tag_alias(parent_node, column_alias) | ||
|
||
# Same with unqualified column references - it's hard to tell which source it came from, so it's safest to say |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was noticing the limitations of this class the other day. Can we just remove SqlColumnAliasReferenceExpression
altogether and replace uses with SqlColumnReferenceExpression
? It looks like this class is only used in the SqlRewritingSubQueryReducerVisitor
. Maybe there is some context I'm missing as to why it's necessary.
Currently, the column pruner checks the columns that are needed in each
SELECT
statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases.First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.
The logic in
SqlTagRequiredColumnAliasesVisitor
has been copied from the original implementation.