@@ -57,6 +57,10 @@ previous row in the DataFrame.
57
57
Setting Parameters
58
58
------------------
59
59
60
+
61
+ Ordering
62
+ ^^^^^^^^
63
+
60
64
You can control the order in which rows are processed by window functions by providing
61
65
a list of ``order_by `` functions for the ``order_by `` parameter.
62
66
@@ -66,28 +70,114 @@ a list of ``order_by`` functions for the ``order_by`` parameter.
66
70
col(' "Name"' ),
67
71
col(' "Attack"' ),
68
72
col(' "Type 1"' ),
69
- f.rank()
70
- .partition_by(col(' "Type 1"' ))
71
- .order_by(col(' "Attack"' ).sort(ascending = True ))
72
- .build()
73
- .alias(" rank" ),
74
- ).sort(col(' "Type 1"' ).sort(), col(' "Attack"' ).sort())
73
+ f.rank(
74
+ partition_by = [col(' "Type 1"' )],
75
+ order_by = [col(' "Attack"' ).sort(ascending = True )],
76
+ ).alias(" rank" ),
77
+ ).sort(col(' "Type 1"' ), col(' "Attack"' ))
78
+
79
+ Partitions
80
+ ^^^^^^^^^^
81
+
82
+ A window function can take a list of ``partition_by `` columns similar to an
83
+ :ref: `Aggregation Function<aggregation> `. This will cause the window values to be evaluated
84
+ independently for each of the partitions. In the example above, we found the rank of each
85
+ Pokemon per ``Type 1 `` partitions. We can see the first couple of each partition if we do
86
+ the following:
87
+
88
+ .. ipython :: python
89
+
90
+ df.select(
91
+ col(' "Name"' ),
92
+ col(' "Attack"' ),
93
+ col(' "Type 1"' ),
94
+ f.rank(
95
+ partition_by = [col(' "Type 1"' )],
96
+ order_by = [col(' "Attack"' ).sort(ascending = True )],
97
+ ).alias(" rank" ),
98
+ ).filter(col(" rank" ) < lit(3 )).sort(col(' "Type 1"' ), col(" rank" ))
99
+
100
+ Window Frame
101
+ ^^^^^^^^^^^^
102
+
103
+ When using aggregate functions, the Window Frame of defines the rows over which it operates.
104
+ If you do not specify a Window Frame, the frame will be set depending on the following
105
+ criteria.
106
+
107
+ * If an ``order_by `` clause is set, the default window frame is defined as the rows between
108
+ unbounded preceeding and the current row.
109
+ * If an ``order_by `` is not set, the default frame is defined as the rows betwene unbounded
110
+ and unbounded following (the entire partition).
111
+
112
+ Window Frames are defined by three parameters: unit type, starting bound, and ending bound.
113
+
114
+ The unit types available are:
75
115
76
- Window Functions can be configured using a builder approach to set a few parameters.
77
- To create a builder you simply need to call any one of these functions
116
+ * Rows: The starting and ending boundaries are defined by the number of rows relative to the
117
+ current row.
118
+ * Range: When using Range, the ``order_by `` clause must have exactly one term. The boundaries
119
+ are defined bow how close the rows are to the value of the expression in the ``order_by ``
120
+ parameter.
121
+ * Groups: A "group" is the set of all rows that have equivalent values for all terms in the
122
+ ``order_by `` clause.
78
123
79
- - :py:func: `datafusion.expr.Expr.order_by ` to set the window ordering.
80
- - :py:func: `datafusion.expr.Expr.null_treatment ` to set how ``null `` values should be handled.
81
- - :py:func: `datafusion.expr.Expr.partition_by ` to set the partitions for processing.
82
- - :py:func: `datafusion.expr.Expr.window_frame ` to set boundary of operation.
124
+ In this example we perform a "rolling average" of the speed of the current Pokemon and the
125
+ two preceeding rows.
83
126
84
- After these parameters are set, you must call ``build() `` on the resultant object to get an
85
- expression as shown in the example above.
127
+ .. ipython :: python
128
+
129
+ from datafusion.expr import WindowFrame
130
+
131
+ df.select(
132
+ col(' "Name"' ),
133
+ col(' "Speed"' ),
134
+ f.window(" avg" ,
135
+ [col(' "Speed"' )],
136
+ order_by = [col(' "Speed"' )],
137
+ window_frame = WindowFrame(" rows" , 2 , 0 )
138
+ ).alias(" Previous Speed" )
139
+ )
140
+
141
+ Null Treatment
142
+ ^^^^^^^^^^^^^^
143
+
144
+ When using aggregate functions as window functions, it is often useful to specify how null values
145
+ should be treated. In order to do this you need to use the builder function. In future releases
146
+ we expect this to be simplified in the interface.
147
+
148
+ One common usage for handling nulls is the case where you want to find the last value up to the
149
+ current row. In the following example we demonstrate how setting the null treatment to ignore
150
+ nulls will fill in with the value of the most recent non-null row. To do this, we also will set
151
+ the window frame so that we only process up to the current row.
152
+
153
+ In this example, we filter down to one specific type of Pokemon that does have some entries in
154
+ it's ``Type 2 `` column that are null.
155
+
156
+ .. ipython :: python
157
+
158
+ from datafusion.common import NullTreatment
159
+
160
+ df.filter(col(' "Type 1"' ) == lit(" Bug" )).select(
161
+ ' "Name"' ,
162
+ ' "Type 2"' ,
163
+ f.window(" last_value" , [col(' "Type 2"' )])
164
+ .window_frame(WindowFrame(" rows" , None , 0 ))
165
+ .order_by(col(' "Speed"' ))
166
+ .null_treatment(NullTreatment.IGNORE_NULLS )
167
+ .build()
168
+ .alias(" last_wo_null" ),
169
+ f.window(" last_value" , [col(' "Type 2"' )])
170
+ .window_frame(WindowFrame(" rows" , None , 0 ))
171
+ .order_by(col(' "Speed"' ))
172
+ .null_treatment(NullTreatment.RESPECT_NULLS )
173
+ .build()
174
+ .alias(" last_with_null" )
175
+ )
86
176
87
177
Aggregate Functions
88
178
-------------------
89
179
90
- You can use any :ref: `Aggregation Function<aggregation> ` as a window function. Currently
180
+ You can use any :ref: `Aggregation Function<aggregation> ` as a window function. Currently
91
181
aggregate functions must use the deprecated
92
182
:py:func: `datafusion.functions.window ` API but this should be resolved in
93
183
DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833 >`_). Here
0 commit comments