-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2511,7 +2511,7 @@ def setitem_unhashable_key(df, value): | |
value = value.T.reshape(-1) | ||
if len(self) > 0: | ||
value = value[: len(self)] | ||
if not isinstance(value, (Series, Categorical, np.ndarray)): | ||
if not isinstance(value, (Series, Categorical, np.ndarray, list, range)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it makes an unnecessary copy. |
||
value = list(value) | ||
|
||
if not self._query_compiler.lazy_execution and len(self.index) == 0: | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,37 @@ | ||||||
# Licensed to Modin Development Team under one or more contributor license agreements. | ||||||
# See the NOTICE file distributed with this work for additional information regarding | ||||||
# copyright ownership. The Modin Development Team licenses this file to you under the | ||||||
# Apache License, Version 2.0 (the "License"); you may not use this file except in | ||||||
# compliance with the License. You may obtain a copy of the License at | ||||||
# | ||||||
# http://www.apache.org/licenses/LICENSE-2.0 | ||||||
# | ||||||
# Unless required by applicable law or agreed to in writing, software distributed under | ||||||
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF | ||||||
# ANY KIND, either express or implied. See the License for the specific language | ||||||
# governing permissions and limitations under the License. | ||||||
|
||||||
import pytest | ||||||
|
||||||
from modin.config import Engine, StorageFormat | ||||||
|
||||||
|
||||||
def pytest_collection_modifyitems(items): | ||||||
try: | ||||||
if ( | ||||||
Engine.get() in ("Ray", "Unidist", "Dask", "Python") | ||||||
and StorageFormat.get() != "Base" | ||||||
): | ||||||
for item in items: | ||||||
if item.name in ( | ||||||
"test_dataframe_dt_index[3s-both-DateCol-0]", | ||||||
"test_dataframe_dt_index[3s-right-DateCol-0]", | ||||||
Comment on lines
+27
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems that these test fail anytime the execution goes to this if-else branch modin/modin/pandas/test/test_rolling.py Lines 162 to 163 in 781fed8
can we call .xfail() directly from there?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can, but we will definitely forget to remove this call when the problem is fixed. |
||||||
): | ||||||
item.add_marker( | ||||||
pytest.mark.xfail( | ||||||
reason="https://github.com/modin-project/modin/issues/6399" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't like to merge perf fixes that break current tests. |
||||||
) | ||||||
) | ||||||
except ImportError: | ||||||
# No engine | ||||||
... | ||||||
Check notice Code scanning / CodeQL Statement has no effect Note test
This statement has no effect.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine as a temporary hack.
However, there's a TODO message a few lines below mentioning that the insertion of a non-distributed data can be speeded-up significantly by changing
apply_full_axis...
call to theapply_select_indices(item_to_distribute=value, ...)
modin/modin/core/storage_formats/pandas/query_compiler.py
Lines 2945 to 2946 in 781fed8
This way, we wouldn't need to wrap the newly created dataframe but propagate the value directly to the partitions, which, in theory, should be quite fast.
Could you please quickly check the approach from TODO message makes sense? If it'll be too inefficient or too complex to implement, we can stick to the changes introduced by this PR