Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Series.hist #1859

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

feat: Series.hist #1859

wants to merge 17 commits into from

Conversation

camriddell
Copy link
Contributor

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Narwhals Expressions do not yet allow one to return a DataFrame (or a struct), so .hist is only implemented at the Series level to be compliant with the rest of the API.

The PyArrow implementation can likely be streamlined a bit more, so a review on that section would be appreciated.

@MarcoGorelli
Copy link
Member

wow, amazing!

.hist is only implemented at the Series level to be compliant with the rest of the API.

agree, good design decision here, I think it would be quite awkward to add Expr.hist because:

  • struct dtype is not supported by default in pandas (and not at all in pandas pre 2.0. Maybe 1.5. But not before that)
  • it's a length-changing expression which it doesn't make sense to aggregate (e.g. nw.col('a').unique().len() makes sense but nw.col('a').hist().struct.field('value').sum() doesn't seem useful..), so supporting this for pyspark / duckdb / ibis could be a real issue. Ibis does seem to have bucket but there's no examples and I have no idea what it does

Hi @mscolnick - just wanted to check that Series.hist would still be useful to you?

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton @camriddell ! I left a couple of comments and will go through the arrow implementation in more details later today πŸš€

narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved
narwhals/_arrow/series.py Outdated Show resolved Hide resolved
@FBruzzesi FBruzzesi added the enhancement New feature or request label Jan 24, 2025
@mscolnick
Copy link
Contributor

awesome work @camriddell! yes, @MarcoGorelli this is definitely valuable for us (marimo) and would use it right away, thank you all for the efforts.

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @camriddell , I still didn't put my head around the arrow implementation. Yet I left some feedbacks on the other backends!

I hope to make it by end of this week to arrow

Comment on lines 4927 to 4929
if bins is None and bin_count is None:
msg = "must provide one of `bin_count` or `bins`"
raise InvalidOperationError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find in the docs, but if they are both None then 10 bins are used? At least that's what happens if I simply run it with no input

Copy link
Contributor Author

@camriddell camriddell Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I made an assumption based on the default function values of polars...hist(...).

Admittedly this behavior feels quite surprising:

>>> pl.Series([1,2,3]).hist(None, bin_count=None)
shape: (10, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ breakpoint ┆ category     ┆ count β”‚
β”‚ ---        ┆ ---          ┆ ---   β”‚
β”‚ f64        ┆ cat          ┆ u32   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═══════║
β”‚ 1.2        ┆ (0.998, 1.2] ┆ 1     β”‚
β”‚ 1.4        ┆ (1.2, 1.4]   ┆ 0     β”‚
β”‚ 1.6        ┆ (1.4, 1.6]   ┆ 0     β”‚
β”‚ 1.8        ┆ (1.6, 1.8]   ┆ 0     β”‚
β”‚ 2.0        ┆ (1.8, 2.0]   ┆ 1     β”‚
β”‚ 2.2        ┆ (2.0, 2.2]   ┆ 0     β”‚
β”‚ 2.4        ┆ (2.2, 2.4]   ┆ 0     β”‚
β”‚ 2.6        ┆ (2.4, 2.6]   ┆ 0     β”‚
β”‚ 2.8        ┆ (2.6, 2.8]   ┆ 0     β”‚
β”‚ 3.0        ┆ (2.8, 3.0]   ┆ 1     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Comment on lines 4930 to 4932
if bins is not None and bin_count is not None:
msg = "can only provide one of `bin_count` or `bins`"
raise InvalidOperationError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polars raises a ComputeError for this case, we still don't have it implemented, but it would be a good opportunity to add it and raise such exception

Comment on lines 1036 to 1040
bins: list[float | int] | None = None,
*,
bin_count: int | None = None,
include_category: bool = True,
include_breakpoint: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove default values in compliant implementations

Suggested change
bins: list[float | int] | None = None,
*,
bin_count: int | None = None,
include_category: bool = True,
include_breakpoint: bool = True,
bins: list[float | int] | None,
*,
bin_count: int | None,
include_category: bool,
include_breakpoint: bool,

ns = self.__native_namespace__()
data: dict[str, Sequence[int | float | str]]

if bin_count is not None and bin_count == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be enough to check for equality with zero? (bins is implied to be None, and bin_count to be not None)

Suggested change
if bin_count is not None and bin_count == 0:
if bin_count == 0:

Comment on lines 1062 to 1066
result = (
ns.cut(self._native_series, bins=bins if bin_count is None else bin_count)
.value_counts()
.sort_index()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: we might just be able to use Series.value_counts with bins and sort specification:

Suggested change
result = (
ns.cut(self._native_series, bins=bins if bin_count is None else bin_count)
.value_counts()
.sort_index()
)
result = (
self._native_series
.value_counts(
bins=bins if bin_count is None else bin_count,
sort=False
)
)

and the remaining of the logic to get breakpoint and category if needed.

Edit I didn't check if the option is available for cudf and modin. If it isn't, we might have a different path for those with what you currently implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I had this break in Modin when passing .value_counts(..., sort=False). Then I wasn't certain how backwards compatible the bins= parameter so I opted to go with using an explicit cut. However taking a quick look it seems that pandas has allowed .value_counts(bins=...) for a while now so this would probably work if you think it is worth changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, going back through the code and tests- if one directly uses .value_counts pandas seems to adjust the passed in bins (which feels like incorrect behavior).

>>> import pandas as pd
>>> pd.__version__
'2.2.3'
>>> pd.Series([0, 1, 2, 3, 4, 5, 6]).value_counts(bins=[0.0, 2.5, 5.0, float('inf')], sort=False)
(-0.001, 2.5]    3 # this should be (0, 2.5] as it was explicitly defined in the `bins` argument
(2.5, 5.0]       3
(5.0, inf]       1
Name: count, dtype: int64

Comment on lines 1010 to 1014
bins: list[float | int] | None = None,
*,
bin_count: int | None = None,
include_category: bool = True,
include_breakpoint: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bins: list[float | int] | None = None,
*,
bin_count: int | None = None,
include_category: bool = True,
include_breakpoint: bool = True,
bins: list[float | int] | None,
*,
bin_count: int | None,
include_category: bool,
include_breakpoint: bool,

Comment on lines 421 to 425
bins: list[float | int] | None = None,
*,
bin_count: int | None = None,
include_category: bool = True,
include_breakpoint: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more:

Suggested change
bins: list[float | int] | None = None,
*,
bin_count: int | None = None,
include_category: bool = True,
include_breakpoint: bool = True,
bins: list[float | int] | None,
*,
bin_count: int | None,
include_category: bool,
include_breakpoint: bool,

Comment on lines 430 to 435
# check for monotonicity, polars<1.0 does not do this.
if bins is not None:
for i in range(1, len(bins)):
if bins[i - 1] >= bins[i]:
msg = "bins must increase monotonically"
raise InvalidOperationError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check this at Narwhals level? Also polars raises a ComputeError instead

Copy link
Contributor Author

@camriddell camriddell Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to leave this behavior up to each backend in case they have a fastpath for non-list types (though this is just speculation).

I am definitely open to moving this logic into a more superficial level so we can guarantee a uniform error for all backends. Would you like me to do this move?


Alternatively as an attempt to preserve fastpaths, I could check

if hasattr(bins, 'diff'): # use .diff
else: # python for-loop

Though this may be a bit of a "pre-mature optimization" case.

narwhals/_polars/series.py Outdated Show resolved Hide resolved
- move monotonicity check to narwhals level
- remove default arguments from compliant implementations
- add ComputeError; use in hist
- correct polars version specific hist behaviors
@camriddell
Copy link
Contributor Author

Found a new breaking change that is bisected on Polars version 1.5 for bin_count=.... I'll manually calculate the bins for backwards compat.

❯ python -c 'import polars as pl; print(pl.__version__); print(pl.Series([1,2,3]).hist(bin_count=3))'
1.6.0
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ breakpoint ┆ category             ┆ count β”‚
β”‚ ---        ┆ ---                  ┆ ---   β”‚
β”‚ f64        ┆ cat                  ┆ u32   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════β•ͺ═══════║
β”‚ 1.666667   ┆ (0.998, 1.666667]    ┆ 1     β”‚
β”‚ 2.333333   ┆ (1.666667, 2.333333] ┆ 1     β”‚
β”‚ 3.0        ┆ (2.333333, 3.0]      ┆ 1     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
❯ python -c 'import polars as pl; print(pl.__version__); print(pl.Series([1,2,3]).hist(bin_count=3))'
1.5.0
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ breakpoint ┆ category             ┆ count β”‚
β”‚ ---        ┆ ---                  ┆ ---   β”‚
β”‚ f64        ┆ cat                  ┆ u32   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════β•ͺ═══════║
β”‚ 0.0        ┆ (-inf, 0.0]          ┆ 0     β”‚
β”‚ 1.333333   ┆ (0.0, 1.333333]      ┆ 1     β”‚
β”‚ 2.666667   ┆ (1.333333, 2.666667] ┆ 1     β”‚
β”‚ inf        ┆ (2.666667, inf]      ┆ 1     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

@FBruzzesi
Copy link
Member

Amazing discoveries @camriddell , I am so sorry you need to take care of all these edge cases πŸ™ˆ

@MarcoGorelli
Copy link
Member

awesome, thanks both!

CI logs show

FAILED tests/series_only/hist_test.py::test_hist_count[pyarrow-False-True-params1] - pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.DoubleScalar: -0.006> with type pyarrow.lib.DoubleScalar: did not recognize Python value type when inferring an Arrow data type

if it's too much bother to fix, we could just set a minimum pyarrow version for this feature, no big deal

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive, thanks @camriddell !

I think we just need a little docstring example, then this can be good to go? I can add this later

narwhals/_arrow/series.py Outdated Show resolved Hide resolved
narwhals/_arrow/series.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member

I took a closer look, and have a couple of more comments, sorry

For .hist() without arguments, it looks like pyarrow produces different results, is this expected?

In [25]: nw.from_native(pd.Series([1, 3, 8, 8, 2, 1, 3], name='a'), allow_series=True).hist()
Out[25]: 
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|        Narwhals DataFrame        |
|----------------------------------|
|   breakpoint      category  count|
|0         1.7  (0.993, 1.7]      2|
|1         2.4    (1.7, 2.4]      1|
|2         3.1    (2.4, 3.1]      2|
|3         3.8    (3.1, 3.8]      0|
|4         4.5    (3.8, 4.5]      0|
|5         5.2    (4.5, 5.2]      0|
|6         5.9    (5.2, 5.9]      0|
|7         6.6    (5.9, 6.6]      0|
|8         7.3    (6.6, 7.3]      0|
|9         8.0    (7.3, 8.0]      2|
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In [26]: nw.from_native(pa.chunked_array([pl.Series(values=[1, 3, 8, 8, 2, 1, 3], name='a').to_arrow()]), allow_series=True).hist().to_polars()
    ...: 
Out[26]: 
shape: (10, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ breakpoint ┆ category                        ┆ count β”‚
β”‚ ---        ┆ ---                             ┆ ---   β”‚
β”‚ f64        ┆ str                             ┆ i64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════════════════β•ͺ═══════║
β”‚ 0.7        ┆ (-0.007, 0.7]                   ┆ 2     β”‚
β”‚ 1.4        ┆ (0.7, 1.4]                      ┆ 1     β”‚
β”‚ 2.1        ┆ (1.4, 2.0999999999999996]       ┆ 2     β”‚
β”‚ 2.8        ┆ (2.0999999999999996, 2.8]       ┆ 0     β”‚
β”‚ 3.5        ┆ (2.8, 3.5]                      ┆ 0     β”‚
β”‚ 4.2        ┆ (3.5, 4.2]                      ┆ 0     β”‚
β”‚ 4.9        ┆ (4.199999999999999, 4.89999999… ┆ 0     β”‚
β”‚ 5.6        ┆ (4.8999999999999995, 5.6]       ┆ 0     β”‚
β”‚ 6.3        ┆ (5.6, 6.3]                      ┆ 0     β”‚
β”‚ 7.0        ┆ (6.3, 7.0]                      ┆ 2     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Second, I'm not totally sure about the categories column being a categorical in Polars in the first place (pola-rs/polars#18645). Do we need it all though? We could always start by just not including that argument and column, and can always add it back later if necessary

@camriddell
Copy link
Contributor Author

I took a closer look, and have a couple of more comments, sorry

For .hist() without arguments, it looks like pyarrow produces different results, is this expected?

In [25]: nw.from_native(pd.Series([1, 3, 8, 8, 2, 1, 3], name='a'), allow_series=True).hist()
Out[25]: 
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|        Narwhals DataFrame        |
|----------------------------------|
|   breakpoint      category  count|
|0         1.7  (0.993, 1.7]      2|
|1         2.4    (1.7, 2.4]      1|
|2         3.1    (2.4, 3.1]      2|
|3         3.8    (3.1, 3.8]      0|
|4         4.5    (3.8, 4.5]      0|
|5         5.2    (4.5, 5.2]      0|
|6         5.9    (5.2, 5.9]      0|
|7         6.6    (5.9, 6.6]      0|
|8         7.3    (6.6, 7.3]      0|
|9         8.0    (7.3, 8.0]      2|
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In [26]: nw.from_native(pa.chunked_array([pl.Series(values=[1, 3, 8, 8, 2, 1, 3], name='a').to_arrow()]), allow_series=True).hist().to_polars()
    ...: 
Out[26]: 
shape: (10, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ breakpoint ┆ category                        ┆ count β”‚
β”‚ ---        ┆ ---                             ┆ ---   β”‚
β”‚ f64        ┆ str                             ┆ i64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════════════════β•ͺ═══════║
β”‚ 0.7        ┆ (-0.007, 0.7]                   ┆ 2     β”‚
β”‚ 1.4        ┆ (0.7, 1.4]                      ┆ 1     β”‚
β”‚ 2.1        ┆ (1.4, 2.0999999999999996]       ┆ 2     β”‚
β”‚ 2.8        ┆ (2.0999999999999996, 2.8]       ┆ 0     β”‚
β”‚ 3.5        ┆ (2.8, 3.5]                      ┆ 0     β”‚
β”‚ 4.2        ┆ (3.5, 4.2]                      ┆ 0     β”‚
β”‚ 4.9        ┆ (4.199999999999999, 4.89999999… ┆ 0     β”‚
β”‚ 5.6        ┆ (4.8999999999999995, 5.6]       ┆ 0     β”‚
β”‚ 6.3        ┆ (5.6, 6.3]                      ┆ 0     β”‚
β”‚ 7.0        ┆ (6.3, 7.0]                      ┆ 2     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Yep, I know the culprit here (forgot to add the minimum value back in when performing a bin calculation)- I'll add this as a property test as well.

Second, I'm not totally sure about the categories column being a categorical in Polars in the first place (pola-rs/polars#18645). Do we need it all though? We could always start by just not including that argument and column, and can always add it back later if necessary

If I understood some of the discussion going on- aren't we also in trouble by keeping the "breakpoint" column around as well? It seems like the new result may just be counts as an integer and the remaining information as a struct.

IMO, the results would be more usable as a struct, but then we lose the closedness information where the string representation encodes 4 pieces of information, left open/closed, right open/closed, left value, right value and a struct of this fashion (or pandas.IntervalIndex) loses two of those pieces of information. I'll think on this some more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants