-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement {Series,DataFrame}GroupBy fillna
methods
#8869
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pavithraes, this is looking great! The implementation is looking good to me, I think we could get some more coverage of the optional args, but otherwise I think this looks close.
I also notice that there are forwardfill
and backfill
aliases in pandas, we could think about adding those as well while we're at it.
601d509
to
eff9714
Compare
@ian-r-rose Thanks for the review!
Looks like I also noticed the Full tracebackddf.groupby("A").fillna(0, axis=1)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
545 try:
--> 546 return cls._AXIS_TO_AXIS_NUMBER[axis]
547 except KeyError:
KeyError: 1
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
1316 try:
-> 1317 path, res = self._choose_path(fast_path, slow_path, group)
1318 except TypeError:
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _choose_path(self, fast_path, slow_path, group)
1393 path = slow_path
-> 1394 res = slow_path(group)
1395
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(group)
1386 fast_path = lambda group: func(group, *args, **kwargs)
-> 1387 slow_path = lambda group: group.apply(
1388 lambda x: func(x, *args, **kwargs), axis=self.axis
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8739 )
-> 8740 return op.apply()
8741
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
687
--> 688 return self.apply_standard()
689
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
811 def apply_standard(self):
--> 812 results, res_index = self.apply_series_generator()
813
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
827 # ignore SettingWithCopy here in case the user mutates
--> 828 results[i] = self.f(v)
829 if isinstance(results[i], ABCSeries):
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(x)
1387 slow_path = lambda group: group.apply(
-> 1388 lambda x: func(x, *args, **kwargs), axis=self.axis
1389 )
~/Developer/Dask/dask/dask/dataframe/groupby.py in _fillna_groups(self, groups, **kwargs)
1987 def _fillna_groups(self, groups, **kwargs):
-> 1988 return groups.fillna(**kwargs)
1989
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
310 )
--> 311 return func(*args, **kwargs)
312
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/series.py in fillna(self, value, method, axis, inplace, limit, downcast)
4815 ) -> Series | None:
-> 4816 return super().fillna(
4817 value=value,
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
6320 axis = 0
-> 6321 axis = self._get_axis_number(axis)
6322
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
547 except KeyError:
--> 548 raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
549
ValueError: No axis named 1 for object type Series
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-4-f9b1a98697a0> in <module>
----> 1 ddf.groupby("A").fillna(0, axis=1)
~/Developer/Dask/dask/dask/dataframe/groupby.py in fillna(self, value, method, limit, axis)
1993 "groupby-fillna with value=dict/Series/DataFrame is currently not supported"
1994 )
-> 1995 meta = self._meta_nonempty.transform(
1996 self._fillna_groups, value=value, method=method, limit=limit, axis=axis
1997 )
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
1355 @Appender(_transform_template)
1356 def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
-> 1357 return self._transform(
1358 func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
1359 )
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in _transform(self, func, engine, engine_kwargs, *args, **kwargs)
1444
1445 if not isinstance(func, str):
-> 1446 return self._transform_general(func, *args, **kwargs)
1447
1448 elif func not in base.transform_kernel_allowlist:
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
1320 except ValueError as err:
1321 msg = "transform must return a scalar value for each group"
-> 1322 raise ValueError(msg) from err
1323
1324 if isinstance(res, Series):
ValueError: transform must return a scalar value for each group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pavithraes, this is looking close.
I did some experimenting with grouping by multiple columns, and ran into some troubles with transform
not liking empty groups very much, but apply
did just fine. I wonder if we should just simplify things by using apply
everywhere
Thanks for checking! I agree that we can use apply throughout. I'll also include a test for grouping by multiple columns. :) |
@pavithraes This change in upstream pandas might also affect what you are doing here. |
Thanks for the note! I just triggered an upstream build to check and it does seem to be affected. :/ |
@jrbourbeau Thanks for helping review this! To solve the upstream build issues, we need to add I think we can't add that right now because Dask's groupby-apply doesn't support the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased your PR on top of #8961 to see if it still works with the group_keys
changes, and it does 🎉 !
The bad news is: with the group_keys
changes I think there is one more thing we need to cover. I tried parameterizing your test_fillna
over group_keys=[True, False, None]
. Everything works on pandas 1.4.x
, but on 1.5.x
, group_keys=True
presents a problem: in that case, the apply
returns with extra multiindex columns for your grouped-by items.
I think the fix should be fairly straightforward, however. For the case where self.group_keys
is True
in the dd.GroupBy
object, we can do a final map_partitions
to df.droplevel(by)
. That is to say, we can easily drop the extra index columns at the end when group_keys
is True
. It's annoying, but doable
@ian-r-rose Thanks for the notes! I've pushed the |
I'm not sure why the macos-python-3.8 test is failing with: https://github.com/dask/dask/runs/6209266570?check_suite_focus=true#step:6:21560 :/ Maybe it's related to #8889 (comment)? Edit: Looks like it was a one-off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be slow in reviewing @pavithraes, and thanks for your persistence. This looks great! I have one minor comment that I would consider optional. But from my perspective, this is ready to go into our next release!
Co-authored-by: Ian Rose <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pavithraes!
@jrbourbeau I think this is ready to merge! Could you please help take a quick look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pavithraes and @ian-r-rose! There's a merge conflict that's popped up (apologies for the delayed response). @pavithraes would you mind fixing that conflict? Otherwise, this looks good to go
@jrbourbeau Thanks for taking a look! I've fixed the conflict. :) |
Thank you @pavithraes! |
Co-authored-by: Ian Rose <[email protected]>
ffill
#8708pre-commit run --all-files
TODO before merging: