-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476
Conversation
ded13c3
to
3c04a06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fix #6399 first.
P.S. The name of the PR should start with PERF-
.
@@ -2511,7 +2511,7 @@ def setitem_unhashable_key(df, value): | |||
value = value.T.reshape(-1) | |||
if len(self) > 0: | |||
value = value[: len(self)] | |||
if not isinstance(value, (Series, Categorical, np.ndarray)): | |||
if not isinstance(value, (Series, Categorical, np.ndarray, list, range)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list(value)
makes a copy if value is already a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes an unnecessary copy.
): | ||
item.add_marker( | ||
pytest.mark.xfail( | ||
reason="https://github.com/modin-project/modin/issues/6399" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't like to merge perf fixes that break current tests.
…sertion into DataFrames Wrap a list-like object into a single-column query compiler before the insertion. Signed-off-by: Andrey Pavlenko <[email protected]>
3c04a06
to
eb6ed27
Compare
@@ -2922,6 +2924,7 @@ def _compute_duplicated(df): # pragma: no cover | |||
# return a new one from here and let the front end handle the inplace | |||
# update. | |||
def insert(self, loc, column, value): | |||
value = self._wrap_column_data(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine as a temporary hack.
However, there's a TODO message a few lines below mentioning that the insertion of a non-distributed data can be speeded-up significantly by changing apply_full_axis...
call to the apply_select_indices(item_to_distribute=value, ...)
modin/modin/core/storage_formats/pandas/query_compiler.py
Lines 2945 to 2946 in 781fed8
# TODO: rework by passing list-like values to `apply_select_indices` | |
# as an item to distribute |
This way, we wouldn't need to wrap the newly created dataframe but propagate the value directly to the partitions, which, in theory, should be quite fast.
Could you please quickly check the approach from TODO message makes sense? If it'll be too inefficient or too complex to implement, we can stick to the changes introduced by this PR
"test_dataframe_dt_index[3s-both-DateCol-0]", | ||
"test_dataframe_dt_index[3s-right-DateCol-0]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that these test fail anytime the execution goes to this if-else branch
modin/modin/pandas/test/test_rolling.py
Lines 162 to 163 in 781fed8
pandas_df[on] = pandas.date_range("22/06/1941", periods=12, freq="T") | |
modin_df[on] = pd.date_range("22/06/1941", periods=12, freq="T") |
can we call
.xfail()
directly from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but we will definitely forget to remove this call when the problem is fixed.
This PR will affect (at least) the following cases:
We need to consider whether we should fix these cases first. Perhaps for a start, it will be enough to default to pandas according to some heuristic regarding accessing columns that can be in different partitions. |
By the way, I believe that both #2511 and #3435 should work fine if the As a temp-fix, I think it might be reasonable to fallback on pandas for A perfect solution would be to finish the experimental groupby to a level when it would have satisfiable quality to transfer all the |
If #6465 can be the solution to the problem, then it is more rational if the final decision is yours @dchigarev, so you are more aware of how much more needs to be done there.
Let's merge then |
Wrap a list-like object into a single-column query compiler before the insertion.
Note: this PR does not cover the HDK backend. For HDK a separate PR is created #6412, but it's not ready yet due to the HDK issue intel/hdk#588.
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date