-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Series.hist #1859
base: main
Are you sure you want to change the base?
feat: Series.hist #1859
Conversation
wow, amazing!
agree, good design decision here, I think it would be quite awkward to add Expr.hist because:
Hi @mscolnick - just wanted to check that Series.hist would still be useful to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @camriddell ! I left a couple of comments and will go through the arrow implementation in more details later today π
awesome work @camriddell! yes, @MarcoGorelli this is definitely valuable for us (marimo) and would use it right away, thank you all for the efforts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @camriddell , I still didn't put my head around the arrow implementation. Yet I left some feedbacks on the other backends!
I hope to make it by end of this week to arrow
narwhals/series.py
Outdated
if bins is None and bin_count is None: | ||
msg = "must provide one of `bin_count` or `bins`" | ||
raise InvalidOperationError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find in the docs, but if they are both None then 10 bins are used? At least that's what happens if I simply run it with no input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I made an assumption based on the default function values of polars...hist(...)
.
Admittedly this behavior feels quite surprising:
>>> pl.Series([1,2,3]).hist(None, bin_count=None)
shape: (10, 3)
ββββββββββββββ¬βββββββββββββββ¬ββββββββ
β breakpoint β category β count β
β --- β --- β --- β
β f64 β cat β u32 β
ββββββββββββββͺβββββββββββββββͺββββββββ‘
β 1.2 β (0.998, 1.2] β 1 β
β 1.4 β (1.2, 1.4] β 0 β
β 1.6 β (1.4, 1.6] β 0 β
β 1.8 β (1.6, 1.8] β 0 β
β 2.0 β (1.8, 2.0] β 1 β
β 2.2 β (2.0, 2.2] β 0 β
β 2.4 β (2.2, 2.4] β 0 β
β 2.6 β (2.4, 2.6] β 0 β
β 2.8 β (2.6, 2.8] β 0 β
β 3.0 β (2.8, 3.0] β 1 β
ββββββββββββββ΄βββββββββββββββ΄ββββββββ
narwhals/series.py
Outdated
if bins is not None and bin_count is not None: | ||
msg = "can only provide one of `bin_count` or `bins`" | ||
raise InvalidOperationError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars raises a ComputeError
for this case, we still don't have it implemented, but it would be a good opportunity to add it and raise such exception
narwhals/_pandas_like/series.py
Outdated
bins: list[float | int] | None = None, | ||
*, | ||
bin_count: int | None = None, | ||
include_category: bool = True, | ||
include_breakpoint: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove default values in compliant implementations
bins: list[float | int] | None = None, | |
*, | |
bin_count: int | None = None, | |
include_category: bool = True, | |
include_breakpoint: bool = True, | |
bins: list[float | int] | None, | |
*, | |
bin_count: int | None, | |
include_category: bool, | |
include_breakpoint: bool, |
narwhals/_pandas_like/series.py
Outdated
ns = self.__native_namespace__() | ||
data: dict[str, Sequence[int | float | str]] | ||
|
||
if bin_count is not None and bin_count == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be enough to check for equality with zero? (bins
is implied to be None
, and bin_count
to be not None
)
if bin_count is not None and bin_count == 0: | |
if bin_count == 0: |
narwhals/_pandas_like/series.py
Outdated
result = ( | ||
ns.cut(self._native_series, bins=bins if bin_count is None else bin_count) | ||
.value_counts() | ||
.sort_index() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: we might just be able to use Series.value_counts
with bins
and sort
specification:
result = ( | |
ns.cut(self._native_series, bins=bins if bin_count is None else bin_count) | |
.value_counts() | |
.sort_index() | |
) | |
result = ( | |
self._native_series | |
.value_counts( | |
bins=bins if bin_count is None else bin_count, | |
sort=False | |
) | |
) |
and the remaining of the logic to get breakpoint and category if needed.
Edit I didn't check if the option is available for cudf and modin. If it isn't, we might have a different path for those with what you currently implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I had this break in Modin when passing .value_counts(..., sort=False)
. Then I wasn't certain how backwards compatible the bins=
parameter so I opted to go with using an explicit cut
. However taking a quick look it seems that pandas
has allowed .value_counts(bins=...)
for a while now so this would probably work if you think it is worth changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, going back through the code and tests- if one directly uses .value_counts
pandas seems to adjust the passed in bins (which feels like incorrect behavior).
>>> import pandas as pd
>>> pd.__version__
'2.2.3'
>>> pd.Series([0, 1, 2, 3, 4, 5, 6]).value_counts(bins=[0.0, 2.5, 5.0, float('inf')], sort=False)
(-0.001, 2.5] 3 # this should be (0, 2.5] as it was explicitly defined in the `bins` argument
(2.5, 5.0] 3
(5.0, inf] 1
Name: count, dtype: int64
narwhals/_arrow/series.py
Outdated
bins: list[float | int] | None = None, | ||
*, | ||
bin_count: int | None = None, | ||
include_category: bool = True, | ||
include_breakpoint: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bins: list[float | int] | None = None, | |
*, | |
bin_count: int | None = None, | |
include_category: bool = True, | |
include_breakpoint: bool = True, | |
bins: list[float | int] | None, | |
*, | |
bin_count: int | None, | |
include_category: bool, | |
include_breakpoint: bool, |
narwhals/_polars/series.py
Outdated
bins: list[float | int] | None = None, | ||
*, | ||
bin_count: int | None = None, | ||
include_category: bool = True, | ||
include_breakpoint: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one more:
bins: list[float | int] | None = None, | |
*, | |
bin_count: int | None = None, | |
include_category: bool = True, | |
include_breakpoint: bool = True, | |
bins: list[float | int] | None, | |
*, | |
bin_count: int | None, | |
include_category: bool, | |
include_breakpoint: bool, |
narwhals/_polars/series.py
Outdated
# check for monotonicity, polars<1.0 does not do this. | ||
if bins is not None: | ||
for i in range(1, len(bins)): | ||
if bins[i - 1] >= bins[i]: | ||
msg = "bins must increase monotonically" | ||
raise InvalidOperationError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check this at Narwhals level? Also polars raises a ComputeError
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to leave this behavior up to each backend in case they have a fastpath for non-list types (though this is just speculation).
I am definitely open to moving this logic into a more superficial level so we can guarantee a uniform error for all backends. Would you like me to do this move?
Alternatively as an attempt to preserve fastpaths, I could check
if hasattr(bins, 'diff'): # use .diff
else: # python for-loop
Though this may be a bit of a "pre-mature optimization" case.
- move monotonicity check to narwhals level - remove default arguments from compliant implementations - add ComputeError; use in hist - correct polars version specific hist behaviors
Found a new breaking change that is bisected on Polars version 1.5 for β― python -c 'import polars as pl; print(pl.__version__); print(pl.Series([1,2,3]).hist(bin_count=3))'
1.6.0
shape: (3, 3)
ββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββ
β breakpoint β category β count β
β --- β --- β --- β
β f64 β cat β u32 β
ββββββββββββββͺβββββββββββββββββββββββͺββββββββ‘
β 1.666667 β (0.998, 1.666667] β 1 β
β 2.333333 β (1.666667, 2.333333] β 1 β
β 3.0 β (2.333333, 3.0] β 1 β
ββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββ β― python -c 'import polars as pl; print(pl.__version__); print(pl.Series([1,2,3]).hist(bin_count=3))'
1.5.0
shape: (4, 3)
ββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββ
β breakpoint β category β count β
β --- β --- β --- β
β f64 β cat β u32 β
ββββββββββββββͺβββββββββββββββββββββββͺββββββββ‘
β 0.0 β (-inf, 0.0] β 0 β
β 1.333333 β (0.0, 1.333333] β 1 β
β 2.666667 β (1.333333, 2.666667] β 1 β
β inf β (2.666667, inf] β 1 β
ββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββ |
Amazing discoveries @camriddell , I am so sorry you need to take care of all these edge cases π |
awesome, thanks both! CI logs show
if it's too much bother to fix, we could just set a minimum pyarrow version for this feature, no big deal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive, thanks @camriddell !
I think we just need a little docstring example, then this can be good to go? I can add this later
I took a closer look, and have a couple of more comments, sorry For
Second, I'm not totally sure about the categories column being a categorical in Polars in the first place (pola-rs/polars#18645). Do we need it all though? We could always start by just not including that argument and column, and can always add it back later if necessary |
Yep, I know the culprit here (forgot to add the minimum value back in when performing a bin calculation)- I'll add this as a property test as well.
If I understood some of the discussion going on- aren't we also in trouble by keeping the IMO, the results would be more usable as a struct, but then we lose the closedness information where the string representation encodes 4 pieces of information, left open/closed, right open/closed, left value, right value and a struct of this fashion (or |
What type of PR is this? (check all applicable)
Related issues
narwhals.Expr.hist
Β #1561Checklist
If you have comments or can explain your changes, please do so below
Narwhals Expressions do not yet allow one to return a DataFrame (or a struct), so
.hist
is only implemented at the Series level to be compliant with the rest of the API.The PyArrow implementation can likely be streamlined a bit more, so a review on that section would be appreciated.