Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split column pruner into two phases #1501

Open
wants to merge 1 commit into
base: p--cte--05
Choose a base branch
from
Open

Split column pruner into two phases #1501

wants to merge 1 commit into from

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Nov 4, 2024

Currently, the column pruner checks the columns that are needed in each SELECT statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases.

First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.

The logic in SqlTagRequiredColumnAliasesVisitor has been copied from the original implementation.

Currently, the column pruner checks the columns that are needed in each `SELECT`
statement and generates the pruned SQL in a single pass. For better readability
and easier modification, this splits the column pruner into two phases.

First, the SQL nodes are traversed to figure out which columns are required and
which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.
@cla-bot cla-bot bot added the cla:yes label Nov 4, 2024
@plypaul plypaul marked this pull request as ready for review November 4, 2024 17:07
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this logic looks great!
I had a bit of trouble reading the code (sorry for the slow review - that's why!), but I think this was only due to the naming of some of the classes / variables / etc. I've left some suggestions to improve readability, and most all of them are just related to naming.

f"SQL, but this is a bug and should be investigated."
)
return node

pruned_select_columns = tuple(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tangential, but I've frequently read this code and found this variable name confusing (pruned_select_columns). We frequently refer to "pruned columns" when we mean the ones that have been removed, but in this case we mean the columns that have been kept. I think the word pruned can technically be used both ways, but it typically is used to refer to what has been removed. Can we change this to a more clear variable name?

logger = logging.getLogger(__name__)


class SqlTagRequiredColumnAliasesVisitor(SqlQueryPlanNodeVisitor[None]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this class, could we change the name to something like DetermineRequiredColumnAliasesVisitor or RequiredColumnAliasesDeterminer?

self._column_alias_tagger = tagged_column_alias_set

def _search_for_expressions(
self, select_node: SqlSelectStatementNode, pruned_select_columns: Tuple[SqlSelectColumn, ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern re: the name pruned_select_columns here

logger = logging.getLogger(__name__)


class TaggedColumnAliasSet:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it very unintuitive to understand what you meant by "tag" in this whole PR. I would recommend changing that word to something else more clear everywhere it's used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this class specifically - it feels like the name implies a simple dataclass / storage object. I would recommend changing the name to something like ColumnAliasCollector or SqlNodeColumnAliasLinker.

) source_0
"""

def __init__(self, tagged_column_alias_set: TaggedColumnAliasSet) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help with readability to change this __init__ function a bit. Alone, it's not clear what this tagged_column_alias_set represents.
I think it would be more clear if we moved the logic for building the initial TaggedColumnAliasSet into here instead of doing that outside and passing it in. Something like this:

def __init__(self, node: SqlSelectStatementNode) -> None:
    """Collect all column aliases currently used in the node."""
    self._column_alias_tagger = ColumnAliasSet()
    self._column_alias_tagger.collect_all_aliases_in_node(node)

return

def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None: # noqa: D102
# Based on column aliases that are tagged in this SELECT statement, tag corresponding column aliases in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a docstring?

if select_column.column_alias in updated_required_column_aliases_in_this_node
)

# TODO: don't prune columns used in join condition! Tricky to derive since the join condition can be any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be updated to say something like "tag columns used in join condition" since the pruning doesn't happen here

return

# Create a mapping from the source alias to the column aliases needed from the corresponding source.
source_alias_to_required_column_alias: Dict[str, Set[str]] = defaultdict(set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to source_alias_to_required_column_aliases (plural aliases)?

)
# TODO: Handle CTEs parent nodes.

# For all string columns, assume that they are needed from all sources since we don't have a table alias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SqlStringExpressions aren't intended to ever reference parent nodes, right? Is this handling more intended to be "just in case", i.e., if some future dev uses a SqlStringExpression inappropriately?

for parent_node in node.parent_nodes:
self._column_alias_tagger.tag_alias(parent_node, column_alias)

# Same with unqualified column references - it's hard to tell which source it came from, so it's safest to say
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was noticing the limitations of this class the other day. Can we just remove SqlColumnAliasReferenceExpression altogether and replace uses with SqlColumnReferenceExpression? It looks like this class is only used in the SqlRewritingSubQueryReducerVisitor. Maybe there is some context I'm missing as to why it's necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants