-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6367: Enable support for 'groupby.size()' in reshuffling groupby #6370
Conversation
self._df, | ||
self._by, | ||
self._axis, | ||
drop=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly setting drop=False
seems useless for the result of groupby.size()
, all the tests are passing even without this tweak
…fling groupby Signed-off-by: Dmitry Chigarev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Dmitry Chigarev <[email protected]>
|
||
|
||
@pytest.mark.parametrize( | ||
"modify_config", [{ExperimentalGroupbyImpl: True}], indirect=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you pls explain how this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modin_config
is a fixture that modifies the modin.config
variables just for this specific test suit.
The inderect=True
parameter allows parametrization of the fixture, so the modify_config
would take the parameters specified in the list, otherwise it won't be able to receive those parameters
) as shuffling_method: | ||
try_cast_to_pandas(df.groupby("a").size()) | ||
|
||
shuffling_method.assert_called() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what assert are we expecting there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test verifies that setting the ExperimentalGroupbyImpl
to True will guarantee us to engage the reshuffling groupby implementation.
The patched method is the one that actually builds range-partitioning and executes functions over it, so if the code made it through to this method, we can assume that the reshuffling implementation was used, rather than the default MapReduce one.
@dchigarev please resolve conflicts |
@vnlitvinov do you have any other comments? |
…#6370) Signed-off-by: Dmitry Chigarev <[email protected]>
…#6370) Signed-off-by: Dmitry Chigarev <[email protected]>
…#6370) Signed-off-by: Dmitry Chigarev <[email protected]>
What do these changes do?
This PR removes
drop=False
parameter overwriting for the groupby object used ingroupby.size()
. The overwriting was originally introduced by this commit bb8b23d more than 3 years ago in an attempt to fixgroupby(as_index=False, ...).size()
. Nowadays, this case works perfectly fine even without thisdrop=False
overwriting.The problem that this overwriting had brought, is that it confused implementations on whether the 'by' columns actually came from the grouping frame or not (
drop=True
means that the grouping column came from the same frame anddrop=False
means that the 'by' columns are foreign).The experimental groupby implementation only allows for the 'by' columns to originate from the same frame, meaning that this
drop=False
overwriting never allowed us to rungroupby.size()
using the new implementation. This PR fixes that problem.modin/modin/core/storage_formats/pandas/query_compiler.py
Lines 3421 to 3425 in f24ba6f
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
groupby.size()
method #6367docs/development/architecture.rst
is up-to-date