Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional functions to PyDough #197

Merged
merged 10 commits into from
Jan 10, 2025
Merged

Conversation

knassre-bodo
Copy link
Contributor

Adds the following functions: PRESENT, ABSENT, KEEP_IF, ROUND, JOIN_STRINGS, as well implementing SQLGlot conversion support for string slicing. Part of the larger effort for #156.

@knassre-bodo knassre-bodo changed the base branch from main to kian/count_star_bug January 9, 2025 18:46
Comment on lines +266 to +269
if int(step.this) != 1:
raise NotImplementedError(
"SLICE function currently only supports a step of 1"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, does not support reverse order or skipping

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because those can't be supported with the basic SQL SUBSTRING function (which is ANSI).

For more dialect-specific bindings we can support some other cases. E.g. s[-n:] becomes RIGHT(s, n) and s[a:b:-1], but only for some dialects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of the slice functionality? Match Python builtin? Numpy indexing?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python builtin (e.g. Customers(prefix=name[:2] gets the first 2 characters of every customer's name).

Comment on lines +254 to +263
if int(start.this) < 0:
raise NotImplementedError(
"SLICE function currently only supports non-negative start indices"
)
start_idx = int(start.this)
if isinstance(stop, sqlglot_expressions.Literal):
if int(stop.this) < 0:
raise NotImplementedError(
"SLICE function currently only supports non-negative stop indices"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, does not support negative indexing

Copy link

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code makes sense and doing is what it's saying.
Most of my comments are questions and minor suggestions.

pydough/pydough_operators/__init__.py Outdated Show resolved Hide resolved
"KEEP_IF", False, RequireNumArgs(2), SelectArgumentType(0)
)
JOIN_STRINGS = ExpressionFunctionOperator(
"JOIN_STRINGS", False, RequireMinArgs(1), ConstantType(StringType())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this accept only 1 arg the delimiter only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires at least one argument (the class has a docstring)

pydough/sqlglot/transform_bindings.py Outdated Show resolved Hide resolved
Returns:
The SQLGlot case expression equivalent to the `KEEP_IF` call.
"""
return sqlglot_expressions.Case().when(sql_glot_args[1], sql_glot_args[0])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the else condition (i.e. null value) case happens automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You don't need the else in SQL. You can just write CASE WHEN A THEN B END (its implied that there is a ELSE NULL).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knassre-bodo Can you feed this into convert_iff_case instead? If we want to avoid case statements that seems like the most sustainable way.

Copy link
Contributor Author

@knassre-bodo knassre-bodo Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdym by avoid case statements? convert_iff_case will always create one, it's just that the bindings will sometimes toggle that function on/off.

pydough/sqlglot/transform_bindings.py Show resolved Hide resolved
Comment on lines +266 to +269
if int(step.this) != 1:
raise NotImplementedError(
"SLICE function currently only supports a step of 1"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why?

case (_, None):
assert start_idx is not None
return sqlglot_expressions.Substring(
this=sql_glot_args[0], start=sqlglot_expressions.convert(start_idx + 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why start_idx + 1 and not just start_idx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because SUBSTRING in SQL is 1-indexed, not 0-indexed.

case (None, None):
raise sql_glot_args[0]
case (_, None):
assert start_idx is not None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I thought case (None, None) will handle that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion is so mypy can infer that it is non-null (apparently the pattern matching isn't fully sufficient for that).

pydough/sqlglot/transform_bindings.py Outdated Show resolved Hide resolved
tests/simple_pydough_functions.py Show resolved Hide resolved
Base automatically changed from kian/count_star_bug to main January 10, 2025 14:24
Copy link
Contributor

@njriasan njriasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @knassre-bodo! See my suggestions.

@@ -79,10 +86,15 @@ These functions must be called on singular data as a function.

- `IFF`: if the first argument is true returns the second argument, otherwise returns the third argument.
- `DEFAULT_TO`: returns the first of its arguments that is non-null.
- `PRESENT`: returns True if the argument is non-null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should open a backlog ticket to see if we want to "standardize" any of these names before open source publication. Calling this PRESENT instead of IS_NOT_NULL may not be useful and may just increase the learning barrier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open that issue. I'm going for human readable names that sound closer to how the question will be phrased (to help the LLM), e.g.:

  • How many people don't have a middle name (absent)
  • What is the percentage of patients at the hospital have been diagnosed (present)
  • How much money is saved by users in each region if all taxes are removed when the shipping date coincides with a holiday (keep_if)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# (e.g. a COUNT(*)), arbitrarily mark one of them as used.
# TODO: (gh #196) optimize this functionality so it doesn't keep an
# unnecessary column.
if isinstance(node, Aggregate) and len(found_identifiers) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from the other PR? If so maybe we can merge main back into this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is, will re-merge.

@@ -37,7 +37,7 @@ def apply_parens(expression: SQLGlotExpression) -> SQLGlotExpression:
Returns:
The expression, wrapped in parentheses if necessary.
"""
if isinstance(expression, (Binary, Concat)):
if isinstance(expression, (Binary, Concat, Is, Case)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sustainability maybe we should make this a global variable set/tuple and then this could become isinstance(expression, PAREN_EXPRESSIONS). Not necessary at this stage, but just a thought in case its ever reused. Feel free to ignore this suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no... I think this makes sense.

Returns:
The SQLGlot case expression equivalent to the `KEEP_IF` call.
"""
return sqlglot_expressions.Case().when(sql_glot_args[1], sql_glot_args[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knassre-bodo Can you feed this into convert_iff_case instead? If we want to avoid case statements that seems like the most sustainable way.

Comment on lines +266 to +269
if int(step.this) != 1:
raise NotImplementedError(
"SLICE function currently only supports a step of 1"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of the slice functionality? Match Python builtin? Numpy indexing?

pydough/sqlglot/transform_bindings.py Show resolved Hide resolved
@knassre-bodo knassre-bodo merged commit d1a796f into main Jan 10, 2025
4 checks passed
@knassre-bodo knassre-bodo deleted the kian/mandatory_functions branch January 10, 2025 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants