Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Expand NumPy section in the user guide with ufunc info #13392

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ pandas
pyarrow
graphviz
matplotlib
numba
seaborn
plotly
altair
Expand Down
17 changes: 17 additions & 0 deletions docs/src/python/user-guide/expressions/numba-example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import polars as pl
import numba as nb

df = pl.DataFrame({"a": [10, 9, 8, 7]})


@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
res[0] = x[0]
for i in range(1, x.shape[0]):
res[i] = x[i] + res[i - 1]
if res[i] >= y:
res[i] = x[i]


out = df.select(cum_sum_reset(pl.all(), 5))
print(out)
Comment on lines +1 to +17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this example - for these particular numbers, they're all above the threshold, so wouldn't the result just be the same as the input? maybe run it with cum_sum_reset(pl.all(), 30)?

22 changes: 19 additions & 3 deletions docs/user-guide/expressions/numpy.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Numpy
# Numpy ufuncs

Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs)
for a list on all supported numpy functions.
for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats.

This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API.
This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [`__array_ufunc__`](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions.

### Example

Expand All @@ -13,10 +13,26 @@ This means that if a function is not provided by Polars, we can use NumPy and we
--8<-- "python/user-guide/expressions/numpy-example.py"
```

## Numba

[Numba](https://numba.pydata.org/) is an open source JIT compiler that allows you to create your own ufuncs entirely within python. The key is to use the [@guvectorize](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) decorator. One popular use case is conditional cumulative functions. For example, suppose you want to take a cumulative sum but have it reset whenever it gets to a threshold.

### Example

{{code_block('user-guide/expressions/numpy-example',api_functions=['DataFrame'])}}

```python exec="on" result="text" session="user-guide/numpy"
--8<-- "python/user-guide/expressions/numba-example.py"
```

### Interoperability

Polars `Series` have support for NumPy universal functions (ufuncs). Element-wise functions such as `np.exp()`, `np.cos()`, `np.div()`, etc. all work with almost zero overhead.

However, as a Polars-specific remark: missing values are a separate bitmask and are not visible by NumPy. This can lead to a window function or a `np.convolve()` giving flawed or incomplete results.

Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missing values will be replaced by `np.nan` during the conversion.

### Note on Performance

The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but map_batches has is_elementwise=False by default, and so will do the expected thing in group-by / over

maybe, rather than avoiding map_batches, map_batches should be the primary way the ufuncs are taught?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli

It turns out that all the numpy and scipy ufuncs are element wise in the sense that they aren't aggregations. Where this becomes important is if someone wants to do a sum and then the np.expm1 function. The ufuncarray hook will do is_elementwise=true. If it's anumba ufunc then it won't.

Whether people should be taught to not use the hook is more philosophical imo. I suppose there's slightly less parsing in that case so would be technically more performant but by using it, it's a nice syntax, imo.

3 changes: 1 addition & 2 deletions docs/user-guide/expressions/user-defined-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,7 @@ These functions have an important distinction in how they operate and consequent
A `map_batches` passes the `Series` backed by the `expression` as is.

`map_batches` follows the same rules in both the `select` and the `group_by` context, this will
mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet
aggregated!
mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.**
Copy link
Collaborator

@MarcoGorelli MarcoGorelli Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't true anymore, is it?

In [18]: df.with_columns(result=pl.col('b').map_batches(lambda x: np.cumsum(x)).over('a'))
Out[18]:
shape: (3, 3)
┌─────┬─────┬────────┐
│ abresult │
│ ---------    │
│ i64i64i64    │
╞═════╪═════╪════════╡
│ 144      │
│ 159      │
│ 266      │
└─────┴─────┴────────┘

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I've lost quite a bit of steam on this.


Use cases for `map_batches` are for instance passing the `Series` in an expression to a third party library. Below we show how
we could use `map_batches` to pass an expression column to a neural network model.
Expand Down
1 change: 1 addition & 0 deletions py-polars/docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
numpy
pandas
pyarrow
numba

hypothesis==6.97.4

Expand Down
1 change: 1 addition & 0 deletions py-polars/requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ hypothesis==6.97.4
pytest==8.0.0
pytest-cov==4.1.0
pytest-xdist==3.5.0
numba

# Need moto.server to mock s3fs - see: https://github.com/aio-libs/aiobotocore/issues/755
moto[s3]==5.0.0
Expand Down