Skip to content

Commit dff27ad

Browse files
committed
Working on documentation for window functions
1 parent 707eef6 commit dff27ad

File tree

3 files changed

+111
-88
lines changed

3 files changed

+111
-88
lines changed

docs/source/user-guide/common-operations/aggregations.rst

+2
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _aggregation:
19+
1820
Aggregation
1921
============
2022

docs/source/user-guide/common-operations/windows.rst

+60-29
Original file line numberDiff line numberDiff line change
@@ -43,55 +43,86 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
4343
ctx = SessionContext()
4444
df = ctx.read_csv("pokemon.csv")
4545
46-
Here is an example that shows how to compare each pokemons’s attack power with the average attack
47-
power in its ``"Type 1"``
46+
Here is an example that shows how you can compare each pokemon's speed to the speed of the
47+
previous row in the DataFrame.
4848

4949
.. ipython:: python
5050
5151
df.select(
5252
col('"Name"'),
53-
col('"Attack"'),
54-
#f.alias(
55-
# f.window("avg", [col('"Attack"')], partition_by=[col('"Type 1"')]),
56-
# "Average Attack",
57-
#)
53+
col('"Speed"'),
54+
f.lag(col('"Speed"')).alias("Previous Speed")
5855
)
5956
60-
You can also control the order in which rows are processed by window functions by providing
57+
Setting Parameters
58+
------------------
59+
60+
You can control the order in which rows are processed by window functions by providing
6161
a list of ``order_by`` functions for the ``order_by`` parameter.
6262

6363
.. ipython:: python
6464
6565
df.select(
6666
col('"Name"'),
6767
col('"Attack"'),
68-
#f.alias(
69-
# f.window(
70-
# "rank",
71-
# [],
72-
# partition_by=[col('"Type 1"')],
73-
# order_by=[f.order_by(col('"Attack"'))],
74-
# ),
75-
# "rank",
76-
#),
68+
col('"Type 1"'),
69+
f.rank()
70+
.partition_by(col('"Type 1"'))
71+
.order_by(col('"Attack"').sort(ascending=True))
72+
.build()
73+
.alias("rank"),
74+
).sort(col('"Type 1"').sort(), col('"Attack"').sort())
75+
76+
Window Functions can be configured using a builder approach to set a few parameters.
77+
To create a builder you simply need to call any one of these functions
78+
79+
- :py:func:`datafusion.expr.Expr.order_by` to set the window ordering.
80+
- :py:func:`datafusion.expr.Expr.null_treatment` to set how ``null`` values should be handled.
81+
- :py:func:`datafusion.expr.Expr.partition_by` to set the partitions for processing.
82+
- :py:func:`datafusion.expr.Expr.window_frame` to set boundary of operation.
83+
84+
After these parameters are set, you must call ``build()`` on the resultant object to get an
85+
expression as shown in the example above.
86+
87+
Aggregate Functions
88+
-------------------
89+
90+
You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
91+
aggregate functions must use the deprecated
92+
:py:func:`datafusion.functions.window` API but this should be resolved in
93+
DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here
94+
is an example that shows how to compare each pokemons’s attack power with the average attack
95+
power in its ``"Type 1"`` using the :py:func:`datafusion.functions.avg` function.
96+
97+
.. ipython:: python
98+
:okwarning:
99+
100+
df.select(
101+
col('"Name"'),
102+
col('"Attack"'),
103+
col('"Type 1"'),
104+
f.window("avg", [col('"Attack"')])
105+
.partition_by(col('"Type 1"'))
106+
.build()
107+
.alias("Average Attack"),
77108
)
78109
110+
Available Functions
111+
-------------------
112+
79113
The possible window functions are:
80114

81115
1. Rank Functions
82-
- rank
83-
- dense_rank
84-
- row_number
85-
- ntile
116+
- :py:func:`datafusion.functions.rank`
117+
- :py:func:`datafusion.functions.dense_rank`
118+
- :py:func:`datafusion.functions.ntile`
119+
- :py:func:`datafusion.functions.row_number`
86120

87121
2. Analytical Functions
88-
- cume_dist
89-
- percent_rank
90-
- lag
91-
- lead
92-
- first_value
93-
- last_value
94-
- nth_value
122+
- :py:func:`datafusion.functions.cume_dist`
123+
- :py:func:`datafusion.functions.percent_rank`
124+
- :py:func:`datafusion.functions.lag`
125+
- :py:func:`datafusion.functions.lead`
95126

96127
3. Aggregate Functions
97-
- All aggregate functions can be used as window functions.
128+
- All :ref:`Aggregation Functions<aggregation>` can be used as window functions.

python/datafusion/functions.py

+49-59
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,7 @@
259259
"dense_rank",
260260
"percent_rank",
261261
"cume_dist",
262+
"ntile",
262263
]
263264

264265

@@ -1816,18 +1817,16 @@ def rank() -> Expr:
18161817
is an example of a dataframe with a window ordered by descending ``points`` and the
18171818
associated rank.
18181819
1819-
You should set ``order_by`` to produce meaningful results.
1820+
You should set ``order_by`` to produce meaningful results::
18201821
1821-
```
1822-
+--------+------+
1823-
| points | rank |
1824-
+--------+------+
1825-
| 100 | 1 |
1826-
| 100 | 1 |
1827-
| 50 | 3 |
1828-
| 25 | 4 |
1829-
+--------+------+
1830-
```
1822+
+--------+------+
1823+
| points | rank |
1824+
+--------+------+
1825+
| 100 | 1 |
1826+
| 100 | 1 |
1827+
| 50 | 3 |
1828+
| 25 | 4 |
1829+
+--------+------+
18311830
18321831
To set window function parameters use the window builder approach described in the
18331832
ref:`_window_functions` online documentation.
@@ -1840,18 +1839,16 @@ def dense_rank() -> Expr:
18401839
18411840
This window function is similar to :py:func:`rank` except that the returned values
18421841
will be consecutive. Here is an example of a dataframe with a window ordered by
1843-
descending ``points`` and the associated dense rank.
1842+
descending ``points`` and the associated dense rank::
18441843
1845-
```
1846-
+--------+------------+
1847-
| points | dense_rank |
1848-
+--------+------------+
1849-
| 100 | 1 |
1850-
| 100 | 1 |
1851-
| 50 | 2 |
1852-
| 25 | 3 |
1853-
+--------+------------+
1854-
```
1844+
+--------+------------+
1845+
| points | dense_rank |
1846+
+--------+------------+
1847+
| 100 | 1 |
1848+
| 100 | 1 |
1849+
| 50 | 2 |
1850+
| 25 | 3 |
1851+
+--------+------------+
18551852
18561853
To set window function parameters use the window builder approach described in the
18571854
ref:`_window_functions` online documentation.
@@ -1865,18 +1862,16 @@ def percent_rank() -> Expr:
18651862
This window function is similar to :py:func:`rank` except that the returned values
18661863
are the percentage from 0.0 to 1.0 from first to last. Here is an example of a
18671864
dataframe with a window ordered by descending ``points`` and the associated percent
1868-
rank.
1865+
rank::
18691866
1870-
```
1871-
+--------+--------------+
1872-
| points | percent_rank |
1873-
+--------+--------------+
1874-
| 100 | 0.0 |
1875-
| 100 | 0.0 |
1876-
| 50 | 0.666667 |
1877-
| 25 | 1.0 |
1878-
+--------+--------------+
1879-
```
1867+
+--------+--------------+
1868+
| points | percent_rank |
1869+
+--------+--------------+
1870+
| 100 | 0.0 |
1871+
| 100 | 0.0 |
1872+
| 50 | 0.666667 |
1873+
| 25 | 1.0 |
1874+
+--------+--------------+
18801875
18811876
To set window function parameters use the window builder approach described in the
18821877
ref:`_window_functions` online documentation.
@@ -1890,18 +1885,16 @@ def cume_dist() -> Expr:
18901885
This window function is similar to :py:func:`rank` except that the returned values
18911886
are the ratio of the row number to the total numebr of rows. Here is an example of a
18921887
dataframe with a window ordered by descending ``points`` and the associated
1893-
cumulative distribution.
1888+
cumulative distribution::
18941889
1895-
```
1896-
+--------+-----------+
1897-
| points | cume_dist |
1898-
+--------+-----------+
1899-
| 100 | 0.5 |
1900-
| 100 | 0.5 |
1901-
| 50 | 0.75 |
1902-
| 25 | 1.0 |
1903-
+--------+-----------+
1904-
```
1890+
+--------+-----------+
1891+
| points | cume_dist |
1892+
+--------+-----------+
1893+
| 100 | 0.5 |
1894+
| 100 | 0.5 |
1895+
| 50 | 0.75 |
1896+
| 25 | 1.0 |
1897+
+--------+-----------+
19051898
19061899
To set window function parameters use the window builder approach described in the
19071900
ref:`_window_functions` online documentation.
@@ -1915,23 +1908,20 @@ def ntile(groups: int) -> Expr:
19151908
This window function orders the window frame into a give number of groups based on
19161909
the ordering criteria. It then returns which group the current row is assigned to.
19171910
Here is an example of a dataframe with a window ordered by descending ``points``
1918-
and the associated n-tile function.
1919-
1920-
```
1921-
+--------+-------+
1922-
| points | ntile |
1923-
+--------+-------+
1924-
| 120 | 1 |
1925-
| 100 | 1 |
1926-
| 80 | 2 |
1927-
| 60 | 2 |
1928-
| 40 | 3 |
1929-
| 20 | 3 |
1930-
+--------+-------+
1931-
```
1911+
and the associated n-tile function::
1912+
1913+
+--------+-------+
1914+
| points | ntile |
1915+
+--------+-------+
1916+
| 120 | 1 |
1917+
| 100 | 1 |
1918+
| 80 | 2 |
1919+
| 60 | 2 |
1920+
| 40 | 3 |
1921+
| 20 | 3 |
1922+
+--------+-------+
19321923
19331924
To set window function parameters use the window builder approach described in the
19341925
ref:`_window_functions` online documentation.
19351926
"""
1936-
# Developer note: ntile only accepts literal values.
19371927
return Expr(f.ntile(Expr.literal(groups).expr))

0 commit comments

Comments
 (0)