Skip to content

Commit 0fc0895

Browse files
committed
Update user documentation on how to pass parameters for different window functions and what their impacts are
1 parent 67638ad commit 0fc0895

File tree

1 file changed

+105
-15
lines changed

1 file changed

+105
-15
lines changed

docs/source/user-guide/common-operations/windows.rst

+105-15
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,10 @@ previous row in the DataFrame.
5757
Setting Parameters
5858
------------------
5959

60+
61+
Ordering
62+
^^^^^^^^
63+
6064
You can control the order in which rows are processed by window functions by providing
6165
a list of ``order_by`` functions for the ``order_by`` parameter.
6266

@@ -66,28 +70,114 @@ a list of ``order_by`` functions for the ``order_by`` parameter.
6670
col('"Name"'),
6771
col('"Attack"'),
6872
col('"Type 1"'),
69-
f.rank()
70-
.partition_by(col('"Type 1"'))
71-
.order_by(col('"Attack"').sort(ascending=True))
72-
.build()
73-
.alias("rank"),
74-
).sort(col('"Type 1"').sort(), col('"Attack"').sort())
73+
f.rank(
74+
partition_by=[col('"Type 1"')],
75+
order_by=[col('"Attack"').sort(ascending=True)],
76+
).alias("rank"),
77+
).sort(col('"Type 1"'), col('"Attack"'))
78+
79+
Partitions
80+
^^^^^^^^^^
81+
82+
A window function can take a list of ``partition_by`` columns similar to an
83+
:ref:`Aggregation Function<aggregation>`. This will cause the window values to be evaluated
84+
independently for each of the partitions. In the example above, we found the rank of each
85+
Pokemon per ``Type 1`` partitions. We can see the first couple of each partition if we do
86+
the following:
87+
88+
.. ipython:: python
89+
90+
df.select(
91+
col('"Name"'),
92+
col('"Attack"'),
93+
col('"Type 1"'),
94+
f.rank(
95+
partition_by=[col('"Type 1"')],
96+
order_by=[col('"Attack"').sort(ascending=True)],
97+
).alias("rank"),
98+
).filter(col("rank") < lit(3)).sort(col('"Type 1"'), col("rank"))
99+
100+
Window Frame
101+
^^^^^^^^^^^^
102+
103+
When using aggregate functions, the Window Frame of defines the rows over which it operates.
104+
If you do not specify a Window Frame, the frame will be set depending on the following
105+
criteria.
106+
107+
* If an ``order_by`` clause is set, the default window frame is defined as the rows between
108+
unbounded preceeding and the current row.
109+
* If an ``order_by`` is not set, the default frame is defined as the rows betwene unbounded
110+
and unbounded following (the entire partition).
111+
112+
Window Frames are defined by three parameters: unit type, starting bound, and ending bound.
113+
114+
The unit types available are:
75115

76-
Window Functions can be configured using a builder approach to set a few parameters.
77-
To create a builder you simply need to call any one of these functions
116+
* Rows: The starting and ending boundaries are defined by the number of rows relative to the
117+
current row.
118+
* Range: When using Range, the ``order_by`` clause must have exactly one term. The boundaries
119+
are defined bow how close the rows are to the value of the expression in the ``order_by``
120+
parameter.
121+
* Groups: A "group" is the set of all rows that have equivalent values for all terms in the
122+
``order_by`` clause.
78123

79-
- :py:func:`datafusion.expr.Expr.order_by` to set the window ordering.
80-
- :py:func:`datafusion.expr.Expr.null_treatment` to set how ``null`` values should be handled.
81-
- :py:func:`datafusion.expr.Expr.partition_by` to set the partitions for processing.
82-
- :py:func:`datafusion.expr.Expr.window_frame` to set boundary of operation.
124+
In this example we perform a "rolling average" of the speed of the current Pokemon and the
125+
two preceeding rows.
83126

84-
After these parameters are set, you must call ``build()`` on the resultant object to get an
85-
expression as shown in the example above.
127+
.. ipython:: python
128+
129+
from datafusion.expr import WindowFrame
130+
131+
df.select(
132+
col('"Name"'),
133+
col('"Speed"'),
134+
f.window("avg",
135+
[col('"Speed"')],
136+
order_by=[col('"Speed"')],
137+
window_frame=WindowFrame("rows", 2, 0)
138+
).alias("Previous Speed")
139+
)
140+
141+
Null Treatment
142+
^^^^^^^^^^^^^^
143+
144+
When using aggregate functions as window functions, it is often useful to specify how null values
145+
should be treated. In order to do this you need to use the builder function. In future releases
146+
we expect this to be simplified in the interface.
147+
148+
One common usage for handling nulls is the case where you want to find the last value up to the
149+
current row. In the following example we demonstrate how setting the null treatment to ignore
150+
nulls will fill in with the value of the most recent non-null row. To do this, we also will set
151+
the window frame so that we only process up to the current row.
152+
153+
In this example, we filter down to one specific type of Pokemon that does have some entries in
154+
it's ``Type 2`` column that are null.
155+
156+
.. ipython:: python
157+
158+
from datafusion.common import NullTreatment
159+
160+
df.filter(col('"Type 1"') == lit("Bug")).select(
161+
'"Name"',
162+
'"Type 2"',
163+
f.window("last_value", [col('"Type 2"')])
164+
.window_frame(WindowFrame("rows", None, 0))
165+
.order_by(col('"Speed"'))
166+
.null_treatment(NullTreatment.IGNORE_NULLS)
167+
.build()
168+
.alias("last_wo_null"),
169+
f.window("last_value", [col('"Type 2"')])
170+
.window_frame(WindowFrame("rows", None, 0))
171+
.order_by(col('"Speed"'))
172+
.null_treatment(NullTreatment.RESPECT_NULLS)
173+
.build()
174+
.alias("last_with_null")
175+
)
86176
87177
Aggregate Functions
88178
-------------------
89179

90-
You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
180+
You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
91181
aggregate functions must use the deprecated
92182
:py:func:`datafusion.functions.window` API but this should be resolved in
93183
DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here

0 commit comments

Comments
 (0)