-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional functions to PyDough #197
Conversation
…upport in PyDough with SQLGlot conversion [RUN CI]
if int(step.this) != 1: | ||
raise NotImplementedError( | ||
"SLICE function currently only supports a step of 1" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, does not support reverse order or skipping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because those can't be supported with the basic SQL SUBSTRING
function (which is ANSI).
For more dialect-specific bindings we can support some other cases. E.g. s[-n:]
becomes RIGHT(s, n)
and s[a:b:-1]
, but only for some dialects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the goal of the slice functionality? Match Python builtin? Numpy indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python builtin (e.g. Customers(prefix=name[:2]
gets the first 2 characters of every customer's name).
if int(start.this) < 0: | ||
raise NotImplementedError( | ||
"SLICE function currently only supports non-negative start indices" | ||
) | ||
start_idx = int(start.this) | ||
if isinstance(stop, sqlglot_expressions.Literal): | ||
if int(stop.this) < 0: | ||
raise NotImplementedError( | ||
"SLICE function currently only supports non-negative stop indices" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, does not support negative indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code makes sense and doing is what it's saying.
Most of my comments are questions and minor suggestions.
"KEEP_IF", False, RequireNumArgs(2), SelectArgumentType(0) | ||
) | ||
JOIN_STRINGS = ExpressionFunctionOperator( | ||
"JOIN_STRINGS", False, RequireMinArgs(1), ConstantType(StringType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this accept only 1 arg the delimiter only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires at least one argument (the class has a docstring)
Returns: | ||
The SQLGlot case expression equivalent to the `KEEP_IF` call. | ||
""" | ||
return sqlglot_expressions.Case().when(sql_glot_args[1], sql_glot_args[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the else
condition (i.e. null
value) case happens automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You don't need the else
in SQL. You can just write CASE WHEN A THEN B END
(its implied that there is a ELSE NULL
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knassre-bodo Can you feed this into convert_iff_case
instead? If we want to avoid case statements that seems like the most sustainable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdym by avoid case statements? convert_iff_case
will always create one, it's just that the bindings will sometimes toggle that function on/off.
if int(step.this) != 1: | ||
raise NotImplementedError( | ||
"SLICE function currently only supports a step of 1" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why?
case (_, None): | ||
assert start_idx is not None | ||
return sqlglot_expressions.Substring( | ||
this=sql_glot_args[0], start=sqlglot_expressions.convert(start_idx + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why start_idx + 1
and not just start_idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because SUBSTRING
in SQL is 1-indexed, not 0-indexed.
case (None, None): | ||
raise sql_glot_args[0] | ||
case (_, None): | ||
assert start_idx is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I thought case (None, None)
will handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is so mypy
can infer that it is non-null (apparently the pattern matching isn't fully sufficient for that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @knassre-bodo! See my suggestions.
@@ -79,10 +86,15 @@ These functions must be called on singular data as a function. | |||
|
|||
- `IFF`: if the first argument is true returns the second argument, otherwise returns the third argument. | |||
- `DEFAULT_TO`: returns the first of its arguments that is non-null. | |||
- `PRESENT`: returns True if the argument is non-null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should open a backlog ticket to see if we want to "standardize" any of these names before open source publication. Calling this PRESENT
instead of IS_NOT_NULL
may not be useful and may just increase the learning barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open that issue. I'm going for human readable names that sound closer to how the question will be phrased (to help the LLM), e.g.:
- How many people don't have a middle name (absent)
- What is the percentage of patients at the hospital have been diagnosed (present)
- How much money is saved by users in each region if all taxes are removed when the shipping date coincides with a holiday (keep_if)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# (e.g. a COUNT(*)), arbitrarily mark one of them as used. | ||
# TODO: (gh #196) optimize this functionality so it doesn't keep an | ||
# unnecessary column. | ||
if isinstance(node, Aggregate) and len(found_identifiers) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this from the other PR? If so maybe we can merge main back into this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is, will re-merge.
@@ -37,7 +37,7 @@ def apply_parens(expression: SQLGlotExpression) -> SQLGlotExpression: | |||
Returns: | |||
The expression, wrapped in parentheses if necessary. | |||
""" | |||
if isinstance(expression, (Binary, Concat)): | |||
if isinstance(expression, (Binary, Concat, Is, Case)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sustainability maybe we should make this a global variable set/tuple and then this could become isinstance(expression, PAREN_EXPRESSIONS)
. Not necessary at this stage, but just a thought in case its ever reused. Feel free to ignore this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no... I think this makes sense.
Returns: | ||
The SQLGlot case expression equivalent to the `KEEP_IF` call. | ||
""" | ||
return sqlglot_expressions.Case().when(sql_glot_args[1], sql_glot_args[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knassre-bodo Can you feed this into convert_iff_case
instead? If we want to avoid case statements that seems like the most sustainable way.
if int(step.this) != 1: | ||
raise NotImplementedError( | ||
"SLICE function currently only supports a step of 1" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the goal of the slice functionality? Match Python builtin? Numpy indexing?
Adds the following functions:
PRESENT
,ABSENT
,KEEP_IF
,ROUND
,JOIN_STRINGS
, as well implementing SQLGlot conversion support for string slicing. Part of the larger effort for #156.