Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6367: Enable support for 'groupby.size()' in reshuffling groupby #6370

Merged
merged 3 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion modin/core/storage_formats/pandas/query_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -3460,7 +3460,15 @@ def _groupby_shuffle(
original_agg_func = agg_func

def agg_func(grp, *args, **kwargs):
return agg_method(grp, original_agg_func, *args, **kwargs)
result = agg_method(grp, original_agg_func, *args, **kwargs)

# Convert Series to DataFrame
if result.ndim == 1:
result = result.to_frame(
MODIN_UNNAMED_SERIES_LABEL if result.name is None else result.name
)

return result

result = obj._modin_frame.groupby(
axis=axis,
Expand Down
12 changes: 2 additions & 10 deletions modin/pandas/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -963,16 +963,8 @@ def size(self):
idx_name=self._idx_name,
**self._kwargs,
).size()
work_object = type(self)(
self._df,
self._by,
self._axis,
drop=False,
Copy link
Collaborator Author

@dchigarev dchigarev Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly setting drop=False seems useless for the result of groupby.size(), all the tests are passing even without this tweak

idx_name=None,
**self._kwargs,
)
result = work_object._wrap_aggregation(
type(work_object._query_compiler).groupby_size,
result = self._wrap_aggregation(
type(self._query_compiler).groupby_size,
numeric_only=False,
)
if not isinstance(result, Series):
Expand Down
23 changes: 22 additions & 1 deletion modin/test/storage_formats/pandas/test_internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,15 @@
test_data_values,
df_equals,
)
from modin.config import NPartitions, Engine, MinPartitionSize
from modin.config import NPartitions, Engine, MinPartitionSize, ExperimentalGroupbyImpl
from modin.distributed.dataframe.pandas import from_partitions
from modin.core.storage_formats.pandas.utils import split_result_of_axis_func_pandas
from modin.utils import try_cast_to_pandas

import numpy as np
import pandas
import pytest
import unittest.mock as mock

NPartitions.put(4)

Expand Down Expand Up @@ -1089,3 +1091,22 @@ def test_setitem_bool_preserve_dtypes():
# scalar as a col_loc
df.loc[indexer, "a"] = 2.0
assert df._query_compiler._modin_frame.has_materialized_dtypes


@pytest.mark.parametrize(
"modify_config", [{ExperimentalGroupbyImpl: True}], indirect=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pls explain how this works?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modin_config is a fixture that modifies the modin.config variables just for this specific test suit.

The inderect=True parameter allows parametrization of the fixture, so the modify_config would take the parameters specified in the list, otherwise it won't be able to receive those parameters

)
def test_groupby_size_shuffling(modify_config):
# verifies that 'groupby.size()' works with reshuffling implementation
# https://github.com/modin-project/modin/issues/6367
df = pd.DataFrame({"a": [1, 1, 2, 2], "b": [3, 4, 5, 6]})
modin_frame = df._query_compiler._modin_frame

with mock.patch.object(
modin_frame,
"_apply_func_to_range_partitioning",
wraps=modin_frame._apply_func_to_range_partitioning,
) as shuffling_method:
try_cast_to_pandas(df.groupby("a").size())

shuffling_method.assert_called()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what assert are we expecting there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies that setting the ExperimentalGroupbyImpl to True will guarantee us to engage the reshuffling groupby implementation.

The patched method is the one that actually builds range-partitioning and executes functions over it, so if the code made it through to this method, we can assume that the reshuffling implementation was used, rather than the default MapReduce one.

Loading