Working on documentation for window functions

timsaucer · timsaucer · commit dff27ad551e2 · 2024-08-23T20:46:40.000-04:00
diff --git a/docs/source/user-guide/common-operations/aggregations.rst b/docs/source/user-guide/common-operations/aggregations.rst
@@ -15,6 +15,8 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
+.. _aggregation:
+
 Aggregation
 ============
 
diff --git a/docs/source/user-guide/common-operations/windows.rst b/docs/source/user-guide/common-operations/windows.rst
@@ -43,55 +43,86 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
     ctx = SessionContext()
     df = ctx.read_csv("pokemon.csv")
 
-Here is an example that shows how to compare each pokemons’s attack power with the average attack
-power in its ``"Type 1"``
+Here is an example that shows how you can compare each pokemon's speed to the speed of the
+previous row in the DataFrame.
 
 .. ipython:: python
 
     df.select(
         col('"Name"'),
-        col('"Attack"'),
-        #f.alias(
-        #    f.window("avg", [col('"Attack"')], partition_by=[col('"Type 1"')]),
-        #    "Average Attack",
-        #)
+        col('"Speed"'),
+        f.lag(col('"Speed"')).alias("Previous Speed")
     )
 
-You can also control the order in which rows are processed by window functions by providing
+Setting Parameters
+------------------
+
+You can control the order in which rows are processed by window functions by providing
 a list of ``order_by`` functions for the ``order_by`` parameter.
 
 .. ipython:: python
 
     df.select(
         col('"Name"'),
         col('"Attack"'),
-        #f.alias(
-        #    f.window(
-        #        "rank",
-        #        [],
-        #        partition_by=[col('"Type 1"')],
-        #        order_by=[f.order_by(col('"Attack"'))],
-        #    ),
-        #    "rank",
-        #),
+        col('"Type 1"'),
+        f.rank()
+            .partition_by(col('"Type 1"'))
+            .order_by(col('"Attack"').sort(ascending=True))
+            .build()
+            .alias("rank"),
+    ).sort(col('"Type 1"').sort(), col('"Attack"').sort())
+
+Window Functions can be configured using a builder approach to set a few parameters.
+To create a builder you simply need to call any one of these functions
+
+- :py:func:`datafusion.expr.Expr.order_by` to set the window ordering.
+- :py:func:`datafusion.expr.Expr.null_treatment` to set how ``null`` values should be handled.
+- :py:func:`datafusion.expr.Expr.partition_by` to set the partitions for processing.
+- :py:func:`datafusion.expr.Expr.window_frame` to set boundary of operation.
+
+After these parameters are set, you must call ``build()`` on the resultant object to get an
+expression as shown in the example above.
+
+Aggregate Functions
+-------------------
+
+You can use any  :ref:`Aggregation Function<aggregation>` as a window function. Currently
+aggregate functions must use the deprecated
+:py:func:`datafusion.functions.window` API but this should be resolved in
+DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here
+is an example that shows how to compare each pokemons’s attack power with the average attack
+power in its ``"Type 1"`` using the :py:func:`datafusion.functions.avg` function.
+
+.. ipython:: python
+    :okwarning:
+
+    df.select(
+        col('"Name"'),
+        col('"Attack"'),
+        col('"Type 1"'),
+        f.window("avg", [col('"Attack"')])
+            .partition_by(col('"Type 1"'))
+            .build()
+            .alias("Average Attack"),
     )
 
+Available Functions
+-------------------
+
 The possible window functions are:
 
 1. Rank Functions
-    - rank
-    - dense_rank
-    - row_number
-    - ntile
+    - :py:func:`datafusion.functions.rank`
+    - :py:func:`datafusion.functions.dense_rank`
+    - :py:func:`datafusion.functions.ntile`
+    - :py:func:`datafusion.functions.row_number`
 
 2. Analytical Functions
-    - cume_dist
-    - percent_rank
-    - lag
-    - lead
-    - first_value
-    - last_value
-    - nth_value
+    - :py:func:`datafusion.functions.cume_dist`
+    - :py:func:`datafusion.functions.percent_rank`
+    - :py:func:`datafusion.functions.lag`
+    - :py:func:`datafusion.functions.lead`
 
 3. Aggregate Functions
-    - All aggregate functions can be used as window functions.
+    - All :ref:`Aggregation Functions<aggregation>` can be used as window functions.
diff --git a/python/datafusion/functions.py b/python/datafusion/functions.py
@@ -259,6 +259,7 @@
     "dense_rank",
     "percent_rank",
     "cume_dist",
+    "ntile",
 ]
 
 
@@ -1816,18 +1817,16 @@ def rank() -> Expr:
     is an example of a dataframe with a window ordered by descending ``points`` and the
     associated rank.
 
-    You should set ``order_by`` to produce meaningful results.
+    You should set ``order_by`` to produce meaningful results::
 
-    ```
-    +--------+------+
-    | points | rank |
-    +--------+------+
-    | 100    | 1    |
-    | 100    | 1    |
-    | 50     | 3    |
-    | 25     | 4    |
-    +--------+------+
-    ```
+        +--------+------+
+        | points | rank |
+        +--------+------+
+        | 100    | 1    |
+        | 100    | 1    |
+        | 50     | 3    |
+        | 25     | 4    |
+        +--------+------+
 
     To set window function parameters use the window builder approach described in the
     ref:`_window_functions` online documentation.
@@ -1840,18 +1839,16 @@ def dense_rank() -> Expr:
 
     This window function is similar to :py:func:`rank` except that the returned values
     will be consecutive. Here is an example of a dataframe with a window ordered by
-    descending ``points`` and the associated dense rank.
+    descending ``points`` and the associated dense rank::
 
-    ```
-    +--------+------------+
-    | points | dense_rank |
-    +--------+------------+
-    | 100    | 1          |
-    | 100    | 1          |
-    | 50     | 2          |
-    | 25     | 3          |
-    +--------+------------+
-    ```
+        +--------+------------+
+        | points | dense_rank |
+        +--------+------------+
+        | 100    | 1          |
+        | 100    | 1          |
+        | 50     | 2          |
+        | 25     | 3          |
+        +--------+------------+
 
     To set window function parameters use the window builder approach described in the
     ref:`_window_functions` online documentation.
@@ -1865,18 +1862,16 @@ def percent_rank() -> Expr:
     This window function is similar to :py:func:`rank` except that the returned values
     are the percentage from 0.0 to 1.0 from first to last. Here is an example of a
     dataframe with a window ordered by descending ``points`` and the associated percent
-    rank.
+    rank::
 
-    ```
-    +--------+--------------+
-    | points | percent_rank |
-    +--------+--------------+
-    | 100    | 0.0          |
-    | 100    | 0.0          |
-    | 50     | 0.666667     |
-    | 25     | 1.0          |
-    +--------+--------------+
-    ```
+        +--------+--------------+
+        | points | percent_rank |
+        +--------+--------------+
+        | 100    | 0.0          |
+        | 100    | 0.0          |
+        | 50     | 0.666667     |
+        | 25     | 1.0          |
+        +--------+--------------+
 
     To set window function parameters use the window builder approach described in the
     ref:`_window_functions` online documentation.
@@ -1890,18 +1885,16 @@ def cume_dist() -> Expr:
     This window function is similar to :py:func:`rank` except that the returned values
     are the ratio of the row number to the total numebr of rows. Here is an example of a
     dataframe with a window ordered by descending ``points`` and the associated
-    cumulative distribution.
+    cumulative distribution::
 
-    ```
-    +--------+-----------+
-    | points | cume_dist |
-    +--------+-----------+
-    | 100    | 0.5       |
-    | 100    | 0.5       |
-    | 50     | 0.75      |
-    | 25     | 1.0       |
-    +--------+-----------+
-    ```
+        +--------+-----------+
+        | points | cume_dist |
+        +--------+-----------+
+        | 100    | 0.5       |
+        | 100    | 0.5       |
+        | 50     | 0.75      |
+        | 25     | 1.0       |
+        +--------+-----------+
 
     To set window function parameters use the window builder approach described in the
     ref:`_window_functions` online documentation.
@@ -1915,23 +1908,20 @@ def ntile(groups: int) -> Expr:
     This window function orders the window frame into a give number of groups based on
     the ordering criteria. It then returns which group the current row is assigned to.
     Here is an example of a dataframe with a window ordered by descending ``points``
-    and the associated n-tile function.
-
-    ```
-    +--------+-------+
-    | points | ntile |
-    +--------+-------+
-    | 120    | 1     |
-    | 100    | 1     |
-    | 80     | 2     |
-    | 60     | 2     |
-    | 40     | 3     |
-    | 20     | 3     |
-    +--------+-------+
-    ```
+    and the associated n-tile function::
+
+        +--------+-------+
+        | points | ntile |
+        +--------+-------+
+        | 120    | 1     |
+        | 100    | 1     |
+        | 80     | 2     |
+        | 60     | 2     |
+        | 40     | 3     |
+        | 20     | 3     |
+        +--------+-------+
 
     To set window function parameters use the window builder approach described in the
     ref:`_window_functions` online documentation.
     """
-    # Developer note: ntile only accepts literal values.
     return Expr(f.ntile(Expr.literal(groups).expr))