Split column pruner into two phases #1501

plypaul · 2024-11-04T16:56:12Z

Currently, the column pruner checks the columns that are needed in each SELECT statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases.

First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.

The logic in SqlTagRequiredColumnAliasesVisitor has been copied from the original implementation.

Currently, the column pruner checks the columns that are needed in each `SELECT` statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases. First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.

courtneyholcomb

Overall, this logic looks great!
I had a bit of trouble reading the code (sorry for the slow review - that's why!), but I think this was only due to the naming of some of the classes / variables / etc. I've left some suggestions to improve readability, and most all of them are just related to naming.

courtneyholcomb · 2024-11-05T05:02:35Z

metricflow/sql/optimizer/column_pruner.py

+                f"SQL, but this is a bug and should be investigated."
+            )
+            return node
+
        pruned_select_columns = tuple(


This is tangential, but I've frequently read this code and found this variable name confusing (pruned_select_columns). We frequently refer to "pruned columns" when we mean the ones that have been removed, but in this case we mean the columns that have been kept. I think the word pruned can technically be used both ways, but it typically is used to refer to what has been removed. Can we change this to a more clear variable name?

courtneyholcomb · 2024-11-05T21:59:43Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+logger = logging.getLogger(__name__)
+
+
+class SqlTagRequiredColumnAliasesVisitor(SqlQueryPlanNodeVisitor[None]):


For this class, could we change the name to something like DetermineRequiredColumnAliasesVisitor or RequiredColumnAliasesDeterminer?

courtneyholcomb · 2024-11-05T22:43:26Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+        self._column_alias_tagger = tagged_column_alias_set
+
+    def _search_for_expressions(
+        self, select_node: SqlSelectStatementNode, pruned_select_columns: Tuple[SqlSelectColumn, ...]


Same concern re: the name pruned_select_columns here

courtneyholcomb · 2024-11-05T23:07:16Z

metricflow/sql/optimizer/tag_column_aliases.py

+logger = logging.getLogger(__name__)
+
+
+class TaggedColumnAliasSet:


I found it very unintuitive to understand what you meant by "tag" in this whole PR. I would recommend changing that word to something else more clear everywhere it's used.

For this class specifically - it feels like the name implies a simple dataclass / storage object. I would recommend changing the name to something like ColumnAliasCollector or SqlNodeColumnAliasLinker.

courtneyholcomb · 2024-11-05T23:19:58Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+        ) source_0
+    """
+
+    def __init__(self, tagged_column_alias_set: TaggedColumnAliasSet) -> None:


I think it would help with readability to change this __init__ function a bit. Alone, it's not clear what this tagged_column_alias_set represents.
I think it would be more clear if we moved the logic for building the initial TaggedColumnAliasSet into here instead of doing that outside and passing it in. Something like this:

def __init__(self, node: SqlSelectStatementNode) -> None: """Collect all column aliases currently used in the node.""" self._column_alias_tagger = ColumnAliasSet() self._column_alias_tagger.collect_all_aliases_in_node(node)

courtneyholcomb · 2024-11-05T23:33:09Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+        return
+
+    def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None:  # noqa: D102
+        # Based on column aliases that are tagged in this SELECT statement, tag corresponding column aliases in


Should this be a docstring?

courtneyholcomb · 2024-11-05T23:40:52Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+            if select_column.column_alias in updated_required_column_aliases_in_this_node
+        )
+
+        # TODO: don't prune columns used in join condition! Tricky to derive since the join condition can be any


This comment should be updated to say something like "tag columns used in join condition" since the pruning doesn't happen here

courtneyholcomb · 2024-11-05T23:44:23Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+            return
+
+        # Create a mapping from the source alias to the column aliases needed from the corresponding source.
+        source_alias_to_required_column_alias: Dict[str, Set[str]] = defaultdict(set)


Can we rename this to source_alias_to_required_column_aliases (plural aliases)?

courtneyholcomb · 2024-11-05T23:49:08Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+                )
+        # TODO: Handle CTEs parent nodes.
+
+        # For all string columns, assume that they are needed from all sources since we don't have a table alias


SqlStringExpressions aren't intended to ever reference parent nodes, right? Is this handling more intended to be "just in case", i.e., if some future dev uses a SqlStringExpression inappropriately?

courtneyholcomb · 2024-11-05T23:53:19Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+                    for parent_node in node.parent_nodes:
+                        self._column_alias_tagger.tag_alias(parent_node, column_alias)
+
+        # Same with unqualified column references - it's hard to tell which source it came from, so it's safest to say


I was noticing the limitations of this class the other day. Can we just remove SqlColumnAliasReferenceExpression altogether and replace uses with SqlColumnReferenceExpression? It looks like this class is only used in the SqlRewritingSubQueryReducerVisitor. Maybe there is some context I'm missing as to why it's necessary.

plypaul added the Skip Changelog label Nov 4, 2024

cla-bot bot added the cla:yes label Nov 4, 2024

plypaul marked this pull request as ready for review November 4, 2024 17:07

courtneyholcomb reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split column pruner into two phases #1501

Split column pruner into two phases #1501

plypaul commented Nov 4, 2024

courtneyholcomb left a comment

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 6, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

courtneyholcomb Nov 5, 2024

		logger = logging.getLogger(__name__)


		class SqlTagRequiredColumnAliasesVisitor(SqlQueryPlanNodeVisitor[None]):

		logger = logging.getLogger(__name__)


		class TaggedColumnAliasSet:

Split column pruner into two phases #1501

Are you sure you want to change the base?

Split column pruner into two phases #1501

Conversation

plypaul commented Nov 4, 2024

courtneyholcomb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment