|
20 | 20 | Aggregation
|
21 | 21 | ============
|
22 | 22 |
|
23 |
| -An aggregate or aggregation is a function where the values of multiple rows are processed together to form a single summary value. |
24 |
| -For performing an aggregation, DataFusion provides the :py:func:`~datafusion.dataframe.DataFrame.aggregate` |
| 23 | +An aggregate or aggregation is a function where the values of multiple rows are processed together |
| 24 | +to form a single summary value. For performing an aggregation, DataFusion provides the |
| 25 | +:py:func:`~datafusion.dataframe.DataFrame.aggregate` |
25 | 26 |
|
26 | 27 | .. ipython:: python
|
27 | 28 |
|
| 29 | + import urllib.request |
28 | 30 | from datafusion import SessionContext
|
29 |
| - from datafusion import column, lit |
| 31 | + from datafusion import col, lit |
30 | 32 | from datafusion import functions as f
|
31 |
| - import random |
32 | 33 |
|
33 |
| - ctx = SessionContext() |
34 |
| - df = ctx.from_pydict( |
35 |
| - { |
36 |
| - "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], |
37 |
| - "b": ["one", "one", "two", "three", "two", "two", "one", "three"], |
38 |
| - "c": [random.randint(0, 100) for _ in range(8)], |
39 |
| - "d": [random.random() for _ in range(8)], |
40 |
| - }, |
41 |
| - name="foo_bar" |
| 34 | + urllib.request.urlretrieve( |
| 35 | + "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv", |
| 36 | + "pokemon.csv", |
42 | 37 | )
|
43 | 38 |
|
44 |
| - col_a = column("a") |
45 |
| - col_b = column("b") |
46 |
| - col_c = column("c") |
47 |
| - col_d = column("d") |
| 39 | + ctx = SessionContext() |
| 40 | + df = ctx.read_csv("pokemon.csv") |
| 41 | +
|
| 42 | + col_type_1 = col('"Type 1"') |
| 43 | + col_type_2 = col('"Type 2"') |
| 44 | + col_speed = col('"Speed"') |
| 45 | + col_attack = col('"Attack"') |
48 | 46 |
|
49 |
| - df.aggregate([], [f.approx_distinct(col_c), f.approx_median(col_d), f.approx_percentile_cont(col_d, lit(0.5))]) |
| 47 | + df.aggregate([col_type_1], [ |
| 48 | + f.approx_distinct(col_speed).alias("Count"), |
| 49 | + f.approx_median(col_speed).alias("Median Speed"), |
| 50 | + f.approx_percentile_cont(col_speed, 0.9).alias("90% Speed")]) |
50 | 51 |
|
51 |
| -When the :code:`group_by` list is empty the aggregation is done over the whole :class:`.DataFrame`. For grouping |
52 |
| -the :code:`group_by` list must contain at least one column |
| 52 | +When the :code:`group_by` list is empty the aggregation is done over the whole :class:`.DataFrame`. |
| 53 | +For grouping the :code:`group_by` list must contain at least one column. |
53 | 54 |
|
54 | 55 | .. ipython:: python
|
55 | 56 |
|
56 |
| - df.aggregate([col_a], [f.sum(col_c), f.max(col_d), f.min(col_d)]) |
| 57 | + df.aggregate([col_type_1], [ |
| 58 | + f.max(col_speed).alias("Max Speed"), |
| 59 | + f.avg(col_speed).alias("Avg Speed"), |
| 60 | + f.min(col_speed).alias("Min Speed")]) |
57 | 61 |
|
58 | 62 | More than one column can be used for grouping
|
59 | 63 |
|
60 | 64 | .. ipython:: python
|
61 | 65 |
|
62 |
| - df.aggregate([col_a, col_b], [f.sum(col_c), f.max(col_d), f.min(col_d)]) |
| 66 | + df.aggregate([col_type_1, col_type_2], [ |
| 67 | + f.max(col_speed).alias("Max Speed"), |
| 68 | + f.avg(col_speed).alias("Avg Speed"), |
| 69 | + f.min(col_speed).alias("Min Speed")]) |
| 70 | +
|
| 71 | +
|
| 72 | +
|
| 73 | +Setting Parameters |
| 74 | +------------------ |
| 75 | + |
| 76 | +Each of the built in aggregate functions provides arguments for the parameters that affect their |
| 77 | +operation. These can also be overridden using the builder approach to setting any of the following |
| 78 | +parameters. When you use the builder, you must call ``build()`` to finish. For example, these two |
| 79 | +expressions are equivalent. |
| 80 | + |
| 81 | +.. ipython:: python |
| 82 | +
|
| 83 | + first_1 = f.first_value(col("a"), order_by=[col("a")]) |
| 84 | + first_2 = f.first_value(col("a")).order_by(col("a")).build() |
| 85 | +
|
| 86 | +Ordering |
| 87 | +^^^^^^^^ |
| 88 | + |
| 89 | +You can control the order in which rows are processed by window functions by providing |
| 90 | +a list of ``order_by`` functions for the ``order_by`` parameter. In the following example, we |
| 91 | +sort the Pokemon by their attack in increasing order and take the first value, which gives us the |
| 92 | +Pokemon with the smallest attack value in each ``Type 1``. |
| 93 | + |
| 94 | +.. ipython:: python |
| 95 | +
|
| 96 | + df.aggregate( |
| 97 | + [col('"Type 1"')], |
| 98 | + [f.first_value( |
| 99 | + col('"Name"'), |
| 100 | + order_by=[col('"Attack"').sort(ascending=True)] |
| 101 | + ).alias("Smallest Attack") |
| 102 | + ]) |
| 103 | +
|
| 104 | +Distinct |
| 105 | +^^^^^^^^ |
| 106 | + |
| 107 | +When you set the parameter ``distinct`` to ``True``, then unique values will only be evaluated one |
| 108 | +time each. Suppose we want to create an array of all of the ``Type 2`` for each ``Type 1`` of our |
| 109 | +Pokemon set. Since there will be many entries of ``Type 2`` we only one each distinct value. |
| 110 | + |
| 111 | +.. ipython:: python |
| 112 | +
|
| 113 | + df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")]) |
| 114 | +
|
| 115 | +In the output of the above we can see that there are some ``Type 1`` for which the ``Type 2`` entry |
| 116 | +is ``null``. In reality, we probably want to filter those out. We can do this in two ways. First, |
| 117 | +we can filter DataFrame rows that have no ``Type 2``. If we do this, we might have some ``Type 1`` |
| 118 | +entries entirely removed. The second is we can use the ``filter`` argument described below. |
| 119 | + |
| 120 | +.. ipython:: python |
| 121 | +
|
| 122 | + df.filter(col_type_2.is_not_null()).aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")]) |
| 123 | +
|
| 124 | + df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True, filter=col_type_2.is_not_null()).alias("Type 2 List")]) |
| 125 | +
|
| 126 | +Which approach you take should depend on your use case. |
| 127 | + |
| 128 | +Null Treatment |
| 129 | +^^^^^^^^^^^^^^ |
| 130 | + |
| 131 | +This option allows you to either respect or ignore null values. |
| 132 | + |
| 133 | +One common usage for handling nulls is the case where you want to find the first value within a |
| 134 | +partition. By setting the null treatment to ignore nulls, we can find the first non-null value |
| 135 | +in our partition. |
| 136 | + |
| 137 | + |
| 138 | +.. ipython:: python |
| 139 | +
|
| 140 | + from datafusion.common import NullTreatment |
| 141 | +
|
| 142 | + df.aggregate([col_type_1], [ |
| 143 | + f.first_value( |
| 144 | + col_type_2, |
| 145 | + order_by=[col_attack], |
| 146 | + null_treatment=NullTreatment.RESPECT_NULLS |
| 147 | + ).alias("Lowest Attack Type 2")]) |
| 148 | +
|
| 149 | + df.aggregate([col_type_1], [ |
| 150 | + f.first_value( |
| 151 | + col_type_2, |
| 152 | + order_by=[col_attack], |
| 153 | + null_treatment=NullTreatment.IGNORE_NULLS |
| 154 | + ).alias("Lowest Attack Type 2")]) |
| 155 | +
|
| 156 | +Filter |
| 157 | +^^^^^^ |
| 158 | + |
| 159 | +Using the filter option is useful for filtering results to include in the aggregate function. It can |
| 160 | +be seen in the example above on how this can be useful to only filter rows evaluated by the |
| 161 | +aggregate function without filtering rows from the entire DataFrame. |
| 162 | + |
| 163 | +Filter takes a single expression. |
| 164 | + |
| 165 | +Suppose we want to find the speed values for only Pokemon that have low Attack values. |
| 166 | + |
| 167 | +.. ipython:: python |
| 168 | +
|
| 169 | + df.aggregate([col_type_1], [ |
| 170 | + f.avg(col_speed).alias("Avg Speed All"), |
| 171 | + f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")]) |
| 172 | +
|
| 173 | +
|
| 174 | +Aggregate Functions |
| 175 | +------------------- |
| 176 | + |
| 177 | +The available aggregate functions are: |
| 178 | + |
| 179 | +1. Comparison Functions |
| 180 | + - :py:func:`datafusion.functions.min` |
| 181 | + - :py:func:`datafusion.functions.max` |
| 182 | +2. Math Functions |
| 183 | + - :py:func:`datafusion.functions.sum` |
| 184 | + - :py:func:`datafusion.functions.avg` |
| 185 | + - :py:func:`datafusion.functions.median` |
| 186 | +3. Array Functions |
| 187 | + - :py:func:`datafusion.functions.array_agg` |
| 188 | +4. Logical Functions |
| 189 | + - :py:func:`datafusion.functions.bit_and` |
| 190 | + - :py:func:`datafusion.functions.bit_or` |
| 191 | + - :py:func:`datafusion.functions.bit_xor` |
| 192 | + - :py:func:`datafusion.functions.bool_and` |
| 193 | + - :py:func:`datafusion.functions.bool_or` |
| 194 | +5. Statistical Functions |
| 195 | + - :py:func:`datafusion.functions.count` |
| 196 | + - :py:func:`datafusion.functions.corr` |
| 197 | + - :py:func:`datafusion.functions.covar_samp` |
| 198 | + - :py:func:`datafusion.functions.covar_pop` |
| 199 | + - :py:func:`datafusion.functions.stddev` |
| 200 | + - :py:func:`datafusion.functions.stddev_pop` |
| 201 | + - :py:func:`datafusion.functions.var_samp` |
| 202 | + - :py:func:`datafusion.functions.var_pop` |
| 203 | +6. Linear Regression Functions |
| 204 | + - :py:func:`datafusion.functions.regr_count` |
| 205 | + - :py:func:`datafusion.functions.regr_slope` |
| 206 | + - :py:func:`datafusion.functions.regr_intercept` |
| 207 | + - :py:func:`datafusion.functions.regr_r2` |
| 208 | + - :py:func:`datafusion.functions.regr_avgx` |
| 209 | + - :py:func:`datafusion.functions.regr_avgy` |
| 210 | + - :py:func:`datafusion.functions.regr_sxx` |
| 211 | + - :py:func:`datafusion.functions.regr_syy` |
| 212 | + - :py:func:`datafusion.functions.regr_slope` |
| 213 | +7. Positional Functions |
| 214 | + - :py:func:`datafusion.functions.first_value` |
| 215 | + - :py:func:`datafusion.functions.last_value` |
| 216 | + - :py:func:`datafusion.functions.nth_value` |
| 217 | +8. String Functions |
| 218 | + - :py:func:`datafusion.functions.string_agg` |
| 219 | +9. Approximation Functions |
| 220 | + - :py:func:`datafusion.functions.approx_distinct` |
| 221 | + - :py:func:`datafusion.functions.approx_median` |
| 222 | + - :py:func:`datafusion.functions.approx_percentile_cont` |
| 223 | + - :py:func:`datafusion.functions.approx_percentile_cont_with_weight` |
| 224 | + |
0 commit comments