feat: Series.hist #1859

camriddell · 2025-01-23T23:17:41Z

What type of PR is this? (check all applicable)

Related issues

Related issue [Enh]: Support for narwhals.Expr.hist #1561

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

Narwhals Expressions do not yet allow one to return a DataFrame (or a struct), so .hist is only implemented at the Series level to be compliant with the rest of the API.

The PyArrow implementation can likely be streamlined a bit more, so a review on that section would be appreciated.

MarcoGorelli · 2025-01-24T07:30:58Z

wow, amazing!

.hist is only implemented at the Series level to be compliant with the rest of the API.

agree, good design decision here, I think it would be quite awkward to add Expr.hist because:

struct dtype is not supported by default in pandas (and not at all in pandas pre 2.0. Maybe 1.5. But not before that)
it's a length-changing expression which it doesn't make sense to aggregate (e.g. nw.col('a').unique().len() makes sense but nw.col('a').hist().struct.field('value').sum() doesn't seem useful..), so supporting this for pyspark / duckdb / ibis could be a real issue. Ibis does seem to have bucket but there's no examples and I have no idea what it does

Hi @mscolnick - just wanted to check that Series.hist would still be useful to you?

FBruzzesi

Thanks a ton @camriddell ! I left a couple of comments and will go through the arrow implementation in more details later today 🚀

narwhals/_pandas_like/series.py

narwhals/_arrow/series.py

…ersion

mscolnick · 2025-01-24T17:35:22Z

awesome work @camriddell! yes, @MarcoGorelli this is definitely valuable for us (marimo) and would use it right away, thank you all for the efforts.

FBruzzesi

Hey @camriddell , I still didn't put my head around the arrow implementation. Yet I left some feedbacks on the other backends!

I hope to make it by end of this week to arrow

FBruzzesi · 2025-01-29T21:05:46Z

narwhals/series.py

+        if bins is None and bin_count is None:
+            msg = "must provide one of `bin_count` or `bins`"
+            raise InvalidOperationError(msg)


I couldn't find in the docs, but if they are both None then 10 bins are used? At least that's what happens if I simply run it with no input

Good catch, I made an assumption based on the default function values of polars...hist(...).

Admittedly this behavior feels quite surprising:

>>> pl.Series([1,2,3]).hist(None, bin_count=None) shape: (10, 3) ┌────────────┬──────────────┬───────┐ │ breakpoint ┆ category ┆ count │ │ --- ┆ --- ┆ --- │ │ f64 ┆ cat ┆ u32 │ ╞════════════╪══════════════╪═══════╡ │ 1.2 ┆ (0.998, 1.2] ┆ 1 │ │ 1.4 ┆ (1.2, 1.4] ┆ 0 │ │ 1.6 ┆ (1.4, 1.6] ┆ 0 │ │ 1.8 ┆ (1.6, 1.8] ┆ 0 │ │ 2.0 ┆ (1.8, 2.0] ┆ 1 │ │ 2.2 ┆ (2.0, 2.2] ┆ 0 │ │ 2.4 ┆ (2.2, 2.4] ┆ 0 │ │ 2.6 ┆ (2.4, 2.6] ┆ 0 │ │ 2.8 ┆ (2.6, 2.8] ┆ 0 │ │ 3.0 ┆ (2.8, 3.0] ┆ 1 │ └────────────┴──────────────┴───────┘

FBruzzesi · 2025-01-29T21:06:34Z

narwhals/series.py

+        if bins is not None and bin_count is not None:
+            msg = "can only provide one of `bin_count` or `bins`"
+            raise InvalidOperationError(msg)


Polars raises a ComputeError for this case, we still don't have it implemented, but it would be a good opportunity to add it and raise such exception

FBruzzesi · 2025-01-29T21:07:20Z

narwhals/_pandas_like/series.py

+        bins: list[float | int] | None = None,
+        *,
+        bin_count: int | None = None,
+        include_category: bool = True,
+        include_breakpoint: bool = True,


Remove default values in compliant implementations

Suggested change

bins: list[float | int] | None = None,

*,

bin_count: int | None = None,

include_category: bool = True,

include_breakpoint: bool = True,

bins: list[float | int] | None,

*,

bin_count: int | None,

include_category: bool,

include_breakpoint: bool,

FBruzzesi · 2025-01-29T21:08:32Z

narwhals/_pandas_like/series.py

+        ns = self.__native_namespace__()
+        data: dict[str, Sequence[int | float | str]]
+
+        if bin_count is not None and bin_count == 0:


I think it would be enough to check for equality with zero? (bins is implied to be None, and bin_count to be not None)

Suggested change

if bin_count is not None and bin_count == 0:

if bin_count == 0:

FBruzzesi · 2025-01-29T21:19:29Z

narwhals/_pandas_like/series.py

+        result = (
+            ns.cut(self._native_series, bins=bins if bin_count is None else bin_count)
+            .value_counts()
+            .sort_index()
+        )


TIL: we might just be able to use Series.value_counts with bins and sort specification:

Suggested change

result = (

ns.cut(self._native_series, bins=bins if bin_count is None else bin_count)

.value_counts()

.sort_index()

)

result = (

self._native_series

.value_counts(

bins=bins if bin_count is None else bin_count,

sort=False

)

)

and the remaining of the logic to get breakpoint and category if needed.

Edit I didn't check if the option is available for cudf and modin. If it isn't, we might have a different path for those with what you currently implemented

Yep, I had this break in Modin when passing .value_counts(..., sort=False). Then I wasn't certain how backwards compatible the bins= parameter so I opted to go with using an explicit cut. However taking a quick look it seems that pandas has allowed .value_counts(bins=...) for a while now so this would probably work if you think it is worth changing.

actually, going back through the code and tests- if one directly uses .value_counts pandas seems to adjust the passed in bins (which feels like incorrect behavior).

>>> import pandas as pd >>> pd.__version__ '2.2.3' >>> pd.Series([0, 1, 2, 3, 4, 5, 6]).value_counts(bins=[0.0, 2.5, 5.0, float('inf')], sort=False) (-0.001, 2.5] 3 # this should be (0, 2.5] as it was explicitly defined in the `bins` argument (2.5, 5.0] 3 (5.0, inf] 1 Name: count, dtype: int64

FBruzzesi · 2025-01-29T21:20:45Z

narwhals/_arrow/series.py

+        bins: list[float | int] | None = None,
+        *,
+        bin_count: int | None = None,
+        include_category: bool = True,
+        include_breakpoint: bool = True,


FBruzzesi · 2025-01-29T21:26:11Z

narwhals/_polars/series.py

+        bins: list[float | int] | None = None,
+        *,
+        bin_count: int | None = None,
+        include_category: bool = True,
+        include_breakpoint: bool = True,


FBruzzesi · 2025-01-29T21:27:22Z

narwhals/_polars/series.py

+        # check for monotonicity, polars<1.0 does not do this.
+        if bins is not None:
+            for i in range(1, len(bins)):
+                if bins[i - 1] >= bins[i]:
+                    msg = "bins must increase monotonically"
+                    raise InvalidOperationError(msg)


Should we check this at Narwhals level? Also polars raises a ComputeError instead

I wanted to leave this behavior up to each backend in case they have a fastpath for non-list types (though this is just speculation).

I am definitely open to moving this logic into a more superficial level so we can guarantee a uniform error for all backends. Would you like me to do this move?

Alternatively as an attempt to preserve fastpaths, I could check

if hasattr(bins, 'diff'): # use .diff else: # python for-loop

Though this may be a bit of a "pre-mature optimization" case.

narwhals/_polars/series.py

- move monotonicity check to narwhals level - remove default arguments from compliant implementations - add ComputeError; use in hist - correct polars version specific hist behaviors

camriddell · 2025-01-31T21:18:42Z

Found a new breaking change that is bisected on Polars version 1.5 for bin_count=.... I'll manually calculate the bins for backwards compat.

❯ python -c 'import polars as pl; print(pl.__version__); print(pl.Series([1,2,3]).hist(bin_count=3))'
1.6.0
shape: (3, 3)
┌────────────┬──────────────────────┬───────┐
│ breakpoint ┆ category             ┆ count │
│ ---        ┆ ---                  ┆ ---   │
│ f64        ┆ cat                  ┆ u32   │
╞════════════╪══════════════════════╪═══════╡
│ 1.666667   ┆ (0.998, 1.666667]    ┆ 1     │
│ 2.333333   ┆ (1.666667, 2.333333] ┆ 1     │
│ 3.0        ┆ (2.333333, 3.0]      ┆ 1     │
└────────────┴──────────────────────┴───────┘

❯ python -c 'import polars as pl; print(pl.__version__); print(pl.Series([1,2,3]).hist(bin_count=3))'
1.5.0
shape: (4, 3)
┌────────────┬──────────────────────┬───────┐
│ breakpoint ┆ category             ┆ count │
│ ---        ┆ ---                  ┆ ---   │
│ f64        ┆ cat                  ┆ u32   │
╞════════════╪══════════════════════╪═══════╡
│ 0.0        ┆ (-inf, 0.0]          ┆ 0     │
│ 1.333333   ┆ (0.0, 1.333333]      ┆ 1     │
│ 2.666667   ┆ (1.333333, 2.666667] ┆ 1     │
│ inf        ┆ (2.666667, inf]      ┆ 1     │
└────────────┴──────────────────────┴───────┘

FBruzzesi · 2025-01-31T21:31:47Z

Amazing discoveries @camriddell , I am so sorry you need to take care of all these edge cases 🙈

MarcoGorelli · 2025-02-01T19:12:57Z

awesome, thanks both!

CI logs show

FAILED tests/series_only/hist_test.py::test_hist_count[pyarrow-False-True-params1] - pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.DoubleScalar: -0.006> with type pyarrow.lib.DoubleScalar: did not recognize Python value type when inferring an Arrow data type

if it's too much bother to fix, we could just set a minimum pyarrow version for this feature, no big deal

MarcoGorelli

Very impressive, thanks @camriddell !

I think we just need a little docstring example, then this can be good to go? I can add this later

narwhals/_arrow/series.py

MarcoGorelli · 2025-02-02T13:59:01Z

I took a closer look, and have a couple of more comments, sorry

For .hist() without arguments, it looks like pyarrow produces different results, is this expected?

In [25]: nw.from_native(pd.Series([1, 3, 8, 8, 2, 1, 3], name='a'), allow_series=True).hist()
Out[25]: 
┌──────────────────────────────────┐
|        Narwhals DataFrame        |
|----------------------------------|
|   breakpoint      category  count|
|0         1.7  (0.993, 1.7]      2|
|1         2.4    (1.7, 2.4]      1|
|2         3.1    (2.4, 3.1]      2|
|3         3.8    (3.1, 3.8]      0|
|4         4.5    (3.8, 4.5]      0|
|5         5.2    (4.5, 5.2]      0|
|6         5.9    (5.2, 5.9]      0|
|7         6.6    (5.9, 6.6]      0|
|8         7.3    (6.6, 7.3]      0|
|9         8.0    (7.3, 8.0]      2|
└──────────────────────────────────┘

In [26]: nw.from_native(pa.chunked_array([pl.Series(values=[1, 3, 8, 8, 2, 1, 3], name='a').to_arrow()]), allow_series=True).hist().to_polars()
    ...: 
Out[26]: 
shape: (10, 3)
┌────────────┬─────────────────────────────────┬───────┐
│ breakpoint ┆ category                        ┆ count │
│ ---        ┆ ---                             ┆ ---   │
│ f64        ┆ str                             ┆ i64   │
╞════════════╪═════════════════════════════════╪═══════╡
│ 0.7        ┆ (-0.007, 0.7]                   ┆ 2     │
│ 1.4        ┆ (0.7, 1.4]                      ┆ 1     │
│ 2.1        ┆ (1.4, 2.0999999999999996]       ┆ 2     │
│ 2.8        ┆ (2.0999999999999996, 2.8]       ┆ 0     │
│ 3.5        ┆ (2.8, 3.5]                      ┆ 0     │
│ 4.2        ┆ (3.5, 4.2]                      ┆ 0     │
│ 4.9        ┆ (4.199999999999999, 4.89999999… ┆ 0     │
│ 5.6        ┆ (4.8999999999999995, 5.6]       ┆ 0     │
│ 6.3        ┆ (5.6, 6.3]                      ┆ 0     │
│ 7.0        ┆ (6.3, 7.0]                      ┆ 2     │
└────────────┴─────────────────────────────────┴───────┘

Second, I'm not totally sure about the categories column being a categorical in Polars in the first place (pola-rs/polars#18645). Do we need it all though? We could always start by just not including that argument and column, and can always add it back later if necessary

camriddell · 2025-02-02T16:07:13Z

I took a closer look, and have a couple of more comments, sorry

For .hist() without arguments, it looks like pyarrow produces different results, is this expected?

In [25]: nw.from_native(pd.Series([1, 3, 8, 8, 2, 1, 3], name='a'), allow_series=True).hist()
Out[25]: 
┌──────────────────────────────────┐
|        Narwhals DataFrame        |
|----------------------------------|
|   breakpoint      category  count|
|0         1.7  (0.993, 1.7]      2|
|1         2.4    (1.7, 2.4]      1|
|2         3.1    (2.4, 3.1]      2|
|3         3.8    (3.1, 3.8]      0|
|4         4.5    (3.8, 4.5]      0|
|5         5.2    (4.5, 5.2]      0|
|6         5.9    (5.2, 5.9]      0|
|7         6.6    (5.9, 6.6]      0|
|8         7.3    (6.6, 7.3]      0|
|9         8.0    (7.3, 8.0]      2|
└──────────────────────────────────┘

In [26]: nw.from_native(pa.chunked_array([pl.Series(values=[1, 3, 8, 8, 2, 1, 3], name='a').to_arrow()]), allow_series=True).hist().to_polars()
    ...: 
Out[26]: 
shape: (10, 3)
┌────────────┬─────────────────────────────────┬───────┐
│ breakpoint ┆ category                        ┆ count │
│ ---        ┆ ---                             ┆ ---   │
│ f64        ┆ str                             ┆ i64   │
╞════════════╪═════════════════════════════════╪═══════╡
│ 0.7        ┆ (-0.007, 0.7]                   ┆ 2     │
│ 1.4        ┆ (0.7, 1.4]                      ┆ 1     │
│ 2.1        ┆ (1.4, 2.0999999999999996]       ┆ 2     │
│ 2.8        ┆ (2.0999999999999996, 2.8]       ┆ 0     │
│ 3.5        ┆ (2.8, 3.5]                      ┆ 0     │
│ 4.2        ┆ (3.5, 4.2]                      ┆ 0     │
│ 4.9        ┆ (4.199999999999999, 4.89999999… ┆ 0     │
│ 5.6        ┆ (4.8999999999999995, 5.6]       ┆ 0     │
│ 6.3        ┆ (5.6, 6.3]                      ┆ 0     │
│ 7.0        ┆ (6.3, 7.0]                      ┆ 2     │
└────────────┴─────────────────────────────────┴───────┘

Yep, I know the culprit here (forgot to add the minimum value back in when performing a bin calculation)- I'll add this as a property test as well.

Second, I'm not totally sure about the categories column being a categorical in Polars in the first place (pola-rs/polars#18645). Do we need it all though? We could always start by just not including that argument and column, and can always add it back later if necessary

If I understood some of the discussion going on- aren't we also in trouble by keeping the "breakpoint" column around as well? It seems like the new result may just be counts as an integer and the remaining information as a struct.

IMO, the results would be more usable as a struct, but then we lose the closedness information where the string representation encodes 4 pieces of information, left open/closed, right open/closed, left value, right value and a struct of this fashion (or pandas.IntervalIndex) loses two of those pieces of information. I'll think on this some more.

camriddell added 4 commits January 23, 2025 15:05

add hist scaffolding & tests

f75c540

implement hist for series & tests

92ae425

refactor pyarrow hist & allow pandas bin_count=0

35d1bc6

add expected edgecases to hist tests

6901b1d

FBruzzesi reviewed Jan 24, 2025

View reviewed changes

narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

FBruzzesi added the enhancement New feature or request label Jan 24, 2025

camriddell added 3 commits January 24, 2025 08:18

use __native_namespace and remove redundant imports

7c3bacb

hist as backwards compatible as possible with polars<1.0

7cf5c1e

improve hist polars<1.0 backwards compat; alleviate tests from this v…

f424c9b

…ersion

FBruzzesi reviewed Jan 29, 2025

View reviewed changes

camriddell added 3 commits January 31, 2025 12:43

fix various hist implementation details

e1a8210

- move monotonicity check to narwhals level - remove default arguments from compliant implementations - add ComputeError; use in hist - correct polars version specific hist behaviors

Merge branch 'main' of github.com:narwhals-dev/narwhals into feat/hist

2fcf79e

remove monotonicity check from arrow hist

997cfe4

camriddell added 2 commits February 1, 2025 10:27

add hist behavior for polars<1.5 with bin_count

e79da2d

hist align arrow dtypes for join

86002a3

camriddell and others added 5 commits February 1, 2025 11:50

hist pyarrow chunked array must infer

54b0236

Merge remote-tracking branch 'upstream/main' into feat/hist

6e9ece4

keep arrow computation more arrow-native

83be519

pyarrow versions

9d4c0dc

remove defaults

16d90ab

MarcoGorelli reviewed Feb 2, 2025

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Series.hist #1859

feat: Series.hist #1859

camriddell commented Jan 23, 2025

MarcoGorelli commented Jan 24, 2025

FBruzzesi left a comment

mscolnick commented Jan 24, 2025

FBruzzesi left a comment

FBruzzesi Jan 29, 2025

camriddell Jan 31, 2025 •

edited

Loading

FBruzzesi Jan 29, 2025

FBruzzesi Jan 29, 2025

FBruzzesi Jan 29, 2025

FBruzzesi Jan 29, 2025

camriddell Jan 31, 2025

camriddell Jan 31, 2025

FBruzzesi Jan 29, 2025

FBruzzesi Jan 29, 2025

FBruzzesi Jan 29, 2025

camriddell Jan 31, 2025 •

edited

Loading

camriddell commented Jan 31, 2025

FBruzzesi commented Jan 31, 2025

MarcoGorelli commented Feb 1, 2025

MarcoGorelli left a comment

MarcoGorelli commented Feb 2, 2025

camriddell commented Feb 2, 2025

	if bin_count is not None and bin_count == 0:
	if bin_count == 0:

feat: Series.hist #1859

Are you sure you want to change the base?

feat: Series.hist #1859

Conversation

camriddell commented Jan 23, 2025

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

MarcoGorelli commented Jan 24, 2025

FBruzzesi left a comment

Choose a reason for hiding this comment

mscolnick commented Jan 24, 2025

FBruzzesi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camriddell Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camriddell Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

camriddell commented Jan 31, 2025

FBruzzesi commented Jan 31, 2025

MarcoGorelli commented Feb 1, 2025

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Feb 2, 2025

camriddell commented Feb 2, 2025

camriddell Jan 31, 2025 •

edited

Loading

camriddell Jan 31, 2025 •

edited

Loading