Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476

Merged
merged 1 commit into from
Aug 24, 2023

Conversation

AndreyPavlenko
Copy link
Collaborator

@AndreyPavlenko AndreyPavlenko commented Aug 9, 2023

Wrap a list-like object into a single-column query compiler before the insertion.

Note: this PR does not cover the HDK backend. For HDK a separate PR is created #6412, but it's not ready yet due to the HDK issue intel/hdk#588.

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Improve performance of list-like objects insertion into DataFrames  #6398
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

)
except ImportError:
# No engine
...

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.
@AndreyPavlenko AndreyPavlenko marked this pull request as ready for review August 9, 2023 18:18
@AndreyPavlenko AndreyPavlenko requested a review from a team as a code owner August 9, 2023 18:18
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix #6399 first.

P.S. The name of the PR should start with PERF-.

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
@@ -2511,7 +2511,7 @@ def setitem_unhashable_key(df, value):
value = value.T.reshape(-1)
if len(self) > 0:
value = value[: len(self)]
if not isinstance(value, (Series, Categorical, np.ndarray)):
if not isinstance(value, (Series, Categorical, np.ndarray, list, range)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list(value) makes a copy if value is already a list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes an unnecessary copy.

):
item.add_marker(
pytest.mark.xfail(
reason="https://github.com/modin-project/modin/issues/6399"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't like to merge perf fixes that break current tests.

@AndreyPavlenko AndreyPavlenko changed the title FEAT-#6398: Improved performance of list-like objects insertion into DataFrames PERF-#6398: Improved performance of list-like objects insertion into DataFrames Aug 10, 2023
…sertion into DataFrames

Wrap a list-like object into a single-column query compiler before the insertion.

Signed-off-by: Andrey Pavlenko <[email protected]>
@@ -2922,6 +2924,7 @@ def _compute_duplicated(df): # pragma: no cover
# return a new one from here and let the front end handle the inplace
# update.
def insert(self, loc, column, value):
value = self._wrap_column_data(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine as a temporary hack.

However, there's a TODO message a few lines below mentioning that the insertion of a non-distributed data can be speeded-up significantly by changing apply_full_axis... call to the apply_select_indices(item_to_distribute=value, ...)

# TODO: rework by passing list-like values to `apply_select_indices`
# as an item to distribute

This way, we wouldn't need to wrap the newly created dataframe but propagate the value directly to the partitions, which, in theory, should be quite fast.

Could you please quickly check the approach from TODO message makes sense? If it'll be too inefficient or too complex to implement, we can stick to the changes introduced by this PR

Comment on lines +27 to +28
"test_dataframe_dt_index[3s-both-DateCol-0]",
"test_dataframe_dt_index[3s-right-DateCol-0]",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that these test fail anytime the execution goes to this if-else branch

pandas_df[on] = pandas.date_range("22/06/1941", periods=12, freq="T")
modin_df[on] = pd.date_range("22/06/1941", periods=12, freq="T")

can we call .xfail() directly from there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but we will definitely forget to remove this call when the problem is fixed.

@anmyachev
Copy link
Collaborator

anmyachev commented Aug 22, 2023

This PR will affect (at least) the following cases:

We need to consider whether we should fix these cases first. Perhaps for a start, it will be enough to default to pandas according to some heuristic regarding accessing columns that can be in different partitions.

сс @dchigarev @YarShev

@dchigarev
Copy link
Collaborator

By the way, I believe that both #2511 and #3435 should work fine if the cfg.ExperimentalGroupbyImpl is enabled. Though, I don't think the experimental groupby is mature enough to engage it unconditionally on groupby.apply().

As a temp-fix, I think it might be reasonable to fallback on pandas for groupby.apply() in case there's more than one column partition. However, this would affect performance dramatically for all the .apply() calls, even if the applied function is simple and doesn't trigger the problem, which is quite sad, so I wouldn't agree nor disagree with this approach.

A perfect solution would be to finish the experimental groupby to a level when it would have satisfiable quality to transfer all the .apply() call to it before the release (for that we have to fix #6465 and finish 1 or 2 additional feats, not sure if it will fit in this release).

@Garra1980
Copy link
Collaborator

Fallback to pandas option looks bad to me

I suggest then to fix this one and plan #6465 for this release hopefully closing #2511 and #3435 and #6399 will be fixed later :)

@anmyachev anmyachev dismissed their stale review August 24, 2023 10:29

If #6465 can be the solution to the problem, then it is more rational if the final decision is yours @dchigarev, so you are more aware of how much more needs to be done there.

@Garra1980
Copy link
Collaborator

Let's merge then

@dchigarev dchigarev merged commit da385c9 into modin-project:master Aug 24, 2023
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of list-like objects insertion into DataFrames
4 participants