-
Notifications
You must be signed in to change notification settings - Fork 109
Add Window Functions for use with function builder #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Window Functions for use with function builder #808
Conversation
Note for when you rebase: I yanked deprecate Expr::display_name from #802. I must have pinned the deps a little later than what was released because the upstream change wasn't included. |
I think the window functions are now all good to go. I've updated the documentation. Next up is the aggregate functions, but most of them have a jump start already. I'll rebase on main once the DF41 updates get merged |
35547c8
to
dff27ad
Compare
Since this is already getting a bit long, I'd like to merge it with just the window functions and deal with the aggregate functions in a different PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a blocking thought but still worth considering.
The Builder
pattern in rust is mostly a workaround for not having default / keyword arguments.
So I wonder if the more pythonic API to expose would be
f.rank().agg(partition_by=..., order_by=...)
class Expr:
def agg(self, ** partition_by=None, order_by=None):
...
The 2 things that can cause build
to fail are
- Not using a sort expression in
order_by
- Not using a function that can be a Window or Aggregate
Can we prevent such misuse from our python wrappers?
.order_by(col('"Attack"').sort(ascending=True)) | ||
.build() | ||
.alias("rank"), | ||
).sort(col('"Type 1"').sort(), col('"Attack"').sort()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any requirement or constraint on the final sort
and the partition_by / order_by
args to the window function?
Or is this just for a nicer, more interpretable output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, but it's an annoyance I've been thinking about fixing. I've moved this to draft so I can tighten that up.
python/datafusion/functions.py
Outdated
) | ||
|
||
return Expr( | ||
f.approx_percentile_cont( | ||
expression.expr, percentile.expr, distinct=distinct, num_centroids=num_centroids.expr | ||
expression.expr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a black
formatting change?
If so, is our CI not catching these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was auto applied, so I'm not completely sure what happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it looks like we are not using ci/scripts/python_lint.sh
which I'm guessing we were at one point. Instead the ruff checks happen in .github/workflows/build.yml
. This only runs ruff check
and not ruff format
but pre-commit does run ruff format
. That's why these changes are happening in my commits but not failing CI.
src/functions.rs
Outdated
builder = builder.null_treatment(null_treatment.map(DFNullTreatment::from)); | ||
|
||
Ok(builder.build()?.into()) | ||
pub fn last_value(expr: PyExpr) -> PyExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything preventing also updating of first_value
?
IDK if you have used Maybe I should do a deeper comparison by writing their examples in |
That's a great suggestion about using the defaults instead. I'll convert this over to that approach. It will be might tighter interface. |
… functions. Also pass in scalar pyarrow values so we can set a range other than a uint
…ctions without calling window()
…tment so remove from user facing
…dow functions and what their impacts are
616a748
to
070b595
Compare
Which issue does this PR close?
This addresses part of #780 but we need a follow on PR to handle the aggregates.
Rationale for this change
In DataFusion 41 we have added the ability to create builders for the parameters for window and aggregate functions. These are more expressive when it comes to setting the window frame, ordering, partition, and null handling. This MR exposes this functionality on the python side.
What changes are included in this PR?
Added functions
Also updated the window function page of the online documentation to include links to these, examples of how to use them, and a description of how to set the additional parameters.
Updated the unit tests.
Are there any user-facing changes?
There are no breaking changes, but the old functions.window has been marked as deprecated. We can change this if needed because it is currently the only way to use aggregate functions as window functions. That is to be addressed in #833