-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(python): Expand NumPy section in the user guide with ufunc
info
#13392
Changes from all commits
a510ad1
904a999
1812f7f
8a29f25
ce36292
583bd83
b5ff692
db28fec
67141a9
56e443c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ pandas | |
pyarrow | ||
graphviz | ||
matplotlib | ||
numba | ||
seaborn | ||
plotly | ||
altair | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
import polars as pl | ||
import numba as nb | ||
|
||
df = pl.DataFrame({"a": [10, 9, 8, 7]}) | ||
|
||
|
||
@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)") | ||
def cum_sum_reset(x, y, res): | ||
res[0] = x[0] | ||
for i in range(1, x.shape[0]): | ||
res[i] = x[i] + res[i - 1] | ||
if res[i] >= y: | ||
res[i] = x[i] | ||
|
||
|
||
out = df.select(cum_sum_reset(pl.all(), 5)) | ||
print(out) | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
# Numpy | ||
# Numpy ufuncs | ||
|
||
Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) | ||
for a list on all supported numpy functions. | ||
for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. | ||
|
||
This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. | ||
This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [`__array_ufunc__`](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. | ||
|
||
### Example | ||
|
||
|
@@ -13,10 +13,26 @@ This means that if a function is not provided by Polars, we can use NumPy and we | |
--8<-- "python/user-guide/expressions/numpy-example.py" | ||
``` | ||
|
||
## Numba | ||
|
||
[Numba](https://numba.pydata.org/) is an open source JIT compiler that allows you to create your own ufuncs entirely within python. The key is to use the [@guvectorize](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) decorator. One popular use case is conditional cumulative functions. For example, suppose you want to take a cumulative sum but have it reset whenever it gets to a threshold. | ||
|
||
### Example | ||
|
||
{{code_block('user-guide/expressions/numpy-example',api_functions=['DataFrame'])}} | ||
|
||
```python exec="on" result="text" session="user-guide/numpy" | ||
--8<-- "python/user-guide/expressions/numba-example.py" | ||
``` | ||
|
||
### Interoperability | ||
|
||
Polars `Series` have support for NumPy universal functions (ufuncs). Element-wise functions such as `np.exp()`, `np.cos()`, `np.div()`, etc. all work with almost zero overhead. | ||
|
||
However, as a Polars-specific remark: missing values are a separate bitmask and are not visible by NumPy. This can lead to a window function or a `np.convolve()` giving flawed or incomplete results. | ||
|
||
Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missing values will be replaced by `np.nan` during the conversion. | ||
|
||
### Note on Performance | ||
|
||
The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. true, but maybe, rather than avoiding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It turns out that all the numpy and scipy ufuncs are element wise in the sense that they aren't aggregations. Where this becomes important is if someone wants to do a sum and then the np.expm1 function. The ufuncarray hook will do is_elementwise=true. If it's anumba ufunc then it won't. Whether people should be taught to not use the hook is more philosophical imo. I suppose there's slightly less parsing in that case so would be technically more performant but by using it, it's a nice syntax, imo. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,8 +18,7 @@ These functions have an important distinction in how they operate and consequent | |
A `map_batches` passes the `Series` backed by the `expression` as is. | ||
|
||
`map_batches` follows the same rules in both the `select` and the `group_by` context, this will | ||
mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet | ||
aggregated! | ||
mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this isn't true anymore, is it? In [18]: df.with_columns(result=pl.col('b').map_batches(lambda x: np.cumsum(x)).over('a'))
Out[18]:
shape: (3, 3)
┌─────┬─────┬────────┐
│ a ┆ b ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪════════╡
│ 1 ┆ 4 ┆ 4 │
│ 1 ┆ 5 ┆ 9 │
│ 2 ┆ 6 ┆ 6 │
└─────┴─────┴────────┘ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right. I've lost quite a bit of steam on this. |
||
|
||
Use cases for `map_batches` are for instance passing the `Series` in an expression to a third party library. Below we show how | ||
we could use `map_batches` to pass an expression column to a neural network model. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,7 @@ | |
numpy | ||
pandas | ||
pyarrow | ||
numba | ||
|
||
hypothesis==6.97.4 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this example - for these particular numbers, they're all above the threshold, so wouldn't the result just be the same as the input? maybe run it with
cum_sum_reset(pl.all(), 30)
?