15
15
.. specific language governing permissions and limitations
16
16
.. under the License.
17
17
18
+ .. _window_functions :
19
+
18
20
Window Functions
19
21
================
20
22
21
- In this section you will learn about window functions. A window function utilizes values from one or multiple rows to
22
- produce a result for each individual row, unlike an aggregate function that provides a single value for multiple rows.
23
+ In this section you will learn about window functions. A window function utilizes values from one or
24
+ multiple rows to produce a result for each individual row, unlike an aggregate function that
25
+ provides a single value for multiple rows.
23
26
24
- The functionality of window functions in DataFusion is supported by the dedicated :py:func : `~datafusion.functions.window ` function .
27
+ The window functions are availble in the :py:mod : `~datafusion.functions ` module .
25
28
26
29
We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
27
30
@@ -40,54 +43,176 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
40
43
ctx = SessionContext()
41
44
df = ctx.read_csv(" pokemon.csv" )
42
45
43
- Here is an example that shows how to compare each pokemons’s attack power with the average attack power in its ``"Type 1" ``
46
+ Here is an example that shows how you can compare each pokemon's speed to the speed of the
47
+ previous row in the DataFrame.
44
48
45
49
.. ipython :: python
46
50
47
51
df.select(
48
52
col(' "Name"' ),
49
- col(' "Attack"' ),
50
- f.alias(
51
- f.window(" avg" , [col(' "Attack"' )], partition_by = [col(' "Type 1"' )]),
52
- " Average Attack" ,
53
- )
53
+ col(' "Speed"' ),
54
+ f.lag(col(' "Speed"' )).alias(" Previous Speed" )
54
55
)
55
56
56
- You can also control the order in which rows are processed by window functions by providing
57
+ Setting Parameters
58
+ ------------------
59
+
60
+
61
+ Ordering
62
+ ^^^^^^^^
63
+
64
+ You can control the order in which rows are processed by window functions by providing
57
65
a list of ``order_by `` functions for the ``order_by `` parameter.
58
66
59
67
.. ipython :: python
60
68
61
69
df.select(
62
70
col(' "Name"' ),
63
71
col(' "Attack"' ),
64
- f.alias(
65
- f.window(
66
- " rank" ,
67
- [],
68
- partition_by = [col(' "Type 1"' )],
69
- order_by = [f.order_by(col(' "Attack"' ))],
70
- ),
71
- " rank" ,
72
- ),
72
+ col(' "Type 1"' ),
73
+ f.rank(
74
+ partition_by = [col(' "Type 1"' )],
75
+ order_by = [col(' "Attack"' ).sort(ascending = True )],
76
+ ).alias(" rank" ),
77
+ ).sort(col(' "Type 1"' ), col(' "Attack"' ))
78
+
79
+ Partitions
80
+ ^^^^^^^^^^
81
+
82
+ A window function can take a list of ``partition_by `` columns similar to an
83
+ :ref: `Aggregation Function<aggregation> `. This will cause the window values to be evaluated
84
+ independently for each of the partitions. In the example above, we found the rank of each
85
+ Pokemon per ``Type 1 `` partitions. We can see the first couple of each partition if we do
86
+ the following:
87
+
88
+ .. ipython :: python
89
+
90
+ df.select(
91
+ col(' "Name"' ),
92
+ col(' "Attack"' ),
93
+ col(' "Type 1"' ),
94
+ f.rank(
95
+ partition_by = [col(' "Type 1"' )],
96
+ order_by = [col(' "Attack"' ).sort(ascending = True )],
97
+ ).alias(" rank" ),
98
+ ).filter(col(" rank" ) < lit(3 )).sort(col(' "Type 1"' ), col(" rank" ))
99
+
100
+ Window Frame
101
+ ^^^^^^^^^^^^
102
+
103
+ When using aggregate functions, the Window Frame of defines the rows over which it operates.
104
+ If you do not specify a Window Frame, the frame will be set depending on the following
105
+ criteria.
106
+
107
+ * If an ``order_by `` clause is set, the default window frame is defined as the rows between
108
+ unbounded preceeding and the current row.
109
+ * If an ``order_by `` is not set, the default frame is defined as the rows betwene unbounded
110
+ and unbounded following (the entire partition).
111
+
112
+ Window Frames are defined by three parameters: unit type, starting bound, and ending bound.
113
+
114
+ The unit types available are:
115
+
116
+ * Rows: The starting and ending boundaries are defined by the number of rows relative to the
117
+ current row.
118
+ * Range: When using Range, the ``order_by `` clause must have exactly one term. The boundaries
119
+ are defined bow how close the rows are to the value of the expression in the ``order_by ``
120
+ parameter.
121
+ * Groups: A "group" is the set of all rows that have equivalent values for all terms in the
122
+ ``order_by `` clause.
123
+
124
+ In this example we perform a "rolling average" of the speed of the current Pokemon and the
125
+ two preceeding rows.
126
+
127
+ .. ipython :: python
128
+
129
+ from datafusion.expr import WindowFrame
130
+
131
+ df.select(
132
+ col(' "Name"' ),
133
+ col(' "Speed"' ),
134
+ f.window(" avg" ,
135
+ [col(' "Speed"' )],
136
+ order_by = [col(' "Speed"' )],
137
+ window_frame = WindowFrame(" rows" , 2 , 0 )
138
+ ).alias(" Previous Speed" )
139
+ )
140
+
141
+ Null Treatment
142
+ ^^^^^^^^^^^^^^
143
+
144
+ When using aggregate functions as window functions, it is often useful to specify how null values
145
+ should be treated. In order to do this you need to use the builder function. In future releases
146
+ we expect this to be simplified in the interface.
147
+
148
+ One common usage for handling nulls is the case where you want to find the last value up to the
149
+ current row. In the following example we demonstrate how setting the null treatment to ignore
150
+ nulls will fill in with the value of the most recent non-null row. To do this, we also will set
151
+ the window frame so that we only process up to the current row.
152
+
153
+ In this example, we filter down to one specific type of Pokemon that does have some entries in
154
+ it's ``Type 2 `` column that are null.
155
+
156
+ .. ipython :: python
157
+
158
+ from datafusion.common import NullTreatment
159
+
160
+ df.filter(col(' "Type 1"' ) == lit(" Bug" )).select(
161
+ ' "Name"' ,
162
+ ' "Type 2"' ,
163
+ f.window(" last_value" , [col(' "Type 2"' )])
164
+ .window_frame(WindowFrame(" rows" , None , 0 ))
165
+ .order_by(col(' "Speed"' ))
166
+ .null_treatment(NullTreatment.IGNORE_NULLS )
167
+ .build()
168
+ .alias(" last_wo_null" ),
169
+ f.window(" last_value" , [col(' "Type 2"' )])
170
+ .window_frame(WindowFrame(" rows" , None , 0 ))
171
+ .order_by(col(' "Speed"' ))
172
+ .null_treatment(NullTreatment.RESPECT_NULLS )
173
+ .build()
174
+ .alias(" last_with_null" )
175
+ )
176
+
177
+ Aggregate Functions
178
+ -------------------
179
+
180
+ You can use any :ref: `Aggregation Function<aggregation> ` as a window function. Currently
181
+ aggregate functions must use the deprecated
182
+ :py:func: `datafusion.functions.window ` API but this should be resolved in
183
+ DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833 >`_). Here
184
+ is an example that shows how to compare each pokemons’s attack power with the average attack
185
+ power in its ``"Type 1" `` using the :py:func: `datafusion.functions.avg ` function.
186
+
187
+ .. ipython :: python
188
+ :okwarning:
189
+
190
+ df.select(
191
+ col(' "Name"' ),
192
+ col(' "Attack"' ),
193
+ col(' "Type 1"' ),
194
+ f.window(" avg" , [col(' "Attack"' )])
195
+ .partition_by(col(' "Type 1"' ))
196
+ .build()
197
+ .alias(" Average Attack" ),
73
198
)
74
199
200
+ Available Functions
201
+ -------------------
202
+
75
203
The possible window functions are:
76
204
77
205
1. Rank Functions
78
- - rank
79
- - dense_rank
80
- - row_number
81
- - ntile
206
+ - :py:func: ` datafusion.functions. rank`
207
+ - :py:func: ` datafusion.functions. dense_rank`
208
+ - :py:func: ` datafusion.functions.ntile `
209
+ - :py:func: ` datafusion.functions.row_number `
82
210
83
211
2. Analytical Functions
84
- - cume_dist
85
- - percent_rank
86
- - lag
87
- - lead
88
- - first_value
89
- - last_value
90
- - nth_value
212
+ - :py:func: `datafusion.functions.cume_dist `
213
+ - :py:func: `datafusion.functions.percent_rank `
214
+ - :py:func: `datafusion.functions.lag `
215
+ - :py:func: `datafusion.functions.lead `
91
216
92
217
3. Aggregate Functions
93
- - All aggregate functions can be used as window functions.
218
+ - All :ref: ` Aggregation Functions<aggregation> ` can be used as window functions.
0 commit comments