-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6367: Enable support for 'groupby.size()' in reshuffling groupby #6370
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,13 +17,15 @@ | |
test_data_values, | ||
df_equals, | ||
) | ||
from modin.config import NPartitions, Engine, MinPartitionSize | ||
from modin.config import NPartitions, Engine, MinPartitionSize, ExperimentalGroupbyImpl | ||
from modin.distributed.dataframe.pandas import from_partitions | ||
from modin.core.storage_formats.pandas.utils import split_result_of_axis_func_pandas | ||
from modin.utils import try_cast_to_pandas | ||
|
||
import numpy as np | ||
import pandas | ||
import pytest | ||
import unittest.mock as mock | ||
|
||
NPartitions.put(4) | ||
|
||
|
@@ -1089,3 +1091,22 @@ def test_setitem_bool_preserve_dtypes(): | |
# scalar as a col_loc | ||
df.loc[indexer, "a"] = 2.0 | ||
assert df._query_compiler._modin_frame.has_materialized_dtypes | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"modify_config", [{ExperimentalGroupbyImpl: True}], indirect=True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you pls explain how this works? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The |
||
) | ||
def test_groupby_size_shuffling(modify_config): | ||
# verifies that 'groupby.size()' works with reshuffling implementation | ||
# https://github.com/modin-project/modin/issues/6367 | ||
df = pd.DataFrame({"a": [1, 1, 2, 2], "b": [3, 4, 5, 6]}) | ||
modin_frame = df._query_compiler._modin_frame | ||
|
||
with mock.patch.object( | ||
modin_frame, | ||
"_apply_func_to_range_partitioning", | ||
wraps=modin_frame._apply_func_to_range_partitioning, | ||
) as shuffling_method: | ||
try_cast_to_pandas(df.groupby("a").size()) | ||
|
||
shuffling_method.assert_called() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what assert are we expecting there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test verifies that setting the The patched method is the one that actually builds range-partitioning and executes functions over it, so if the code made it through to this method, we can assume that the reshuffling implementation was used, rather than the default MapReduce one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly setting
drop=False
seems useless for the result ofgroupby.size()
, all the tests are passing even without this tweak