Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6367: Enable support for 'groupby.size()' in reshuffling groupby #6370

Merged
merged 3 commits into from
Aug 10, 2023

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Jul 7, 2023

What do these changes do?

This PR removes drop=False parameter overwriting for the groupby object used in groupby.size(). The overwriting was originally introduced by this commit bb8b23d more than 3 years ago in an attempt to fix groupby(as_index=False, ...).size(). Nowadays, this case works perfectly fine even without this drop=False overwriting.

The problem that this overwriting had brought, is that it confused implementations on whether the 'by' columns actually came from the grouping frame or not (drop=True means that the grouping column came from the same frame and drop=False means that the 'by' columns are foreign).

The experimental groupby implementation only allows for the 'by' columns to originate from the same frame, meaning that this drop=False overwriting never allowed us to run groupby.size() using the new implementation. This PR fixes that problem.

if not is_all_column_names:
raise NotImplementedError(
"Reshuffling groupby is only supported when grouping on a column(s) of the same frame. "
+ "https://github.com/modin-project/modin/issues/5926"
)

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: Reshuffling groupby doesn't work for groupby.size() method #6367
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

self._df,
self._by,
self._axis,
drop=False,
Copy link
Collaborator Author

@dchigarev dchigarev Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly setting drop=False seems useless for the result of groupby.size(), all the tests are passing even without this tweak

@dchigarev dchigarev marked this pull request as ready for review August 8, 2023 14:46
@dchigarev dchigarev requested a review from a team as a code owner August 8, 2023 14:46
anmyachev
anmyachev previously approved these changes Aug 8, 2023
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

anmyachev
anmyachev previously approved these changes Aug 8, 2023


@pytest.mark.parametrize(
"modify_config", [{ExperimentalGroupbyImpl: True}], indirect=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pls explain how this works?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modin_config is a fixture that modifies the modin.config variables just for this specific test suit.

The inderect=True parameter allows parametrization of the fixture, so the modify_config would take the parameters specified in the list, otherwise it won't be able to receive those parameters

) as shuffling_method:
try_cast_to_pandas(df.groupby("a").size())

shuffling_method.assert_called()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what assert are we expecting there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies that setting the ExperimentalGroupbyImpl to True will guarantee us to engage the reshuffling groupby implementation.

The patched method is the one that actually builds range-partitioning and executes functions over it, so if the code made it through to this method, we can assume that the reshuffling implementation was used, rather than the default MapReduce one.

@anmyachev
Copy link
Collaborator

@dchigarev please resolve conflicts

@anmyachev
Copy link
Collaborator

@vnlitvinov do you have any other comments?

@anmyachev anmyachev merged commit 3b3bba9 into modin-project:master Aug 10, 2023
37 checks passed
RehanSD pushed a commit that referenced this pull request Aug 22, 2023
RehanSD pushed a commit that referenced this pull request Aug 22, 2023
RehanSD pushed a commit that referenced this pull request Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Reshuffling groupby doesn't work for groupby.size() method
3 participants