-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#6433: Implement '.dropna()' using map-reduce pattern #6472
Conversation
Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
processable_amount_of_partitions = ( | ||
np.prod(self._modin_frame._partitions.shape) < CpuCount.get() * 32 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise I was getting ~60s. for the case of a square-like frame with the shape of (5000, 5000), when normally it should pass for about 1-2 seconds
@@ -2868,13 +2909,19 @@ def _pick_axis(get_axis, sizes_cache): | |||
if axis == 0: | |||
# Pass shape caches instead of values in order to not trigger shape computation. | |||
new_index, new_row_lengths = _pick_axis( | |||
self._get_index, self._row_lengths_cache | |||
self.copy_index_cache, self._row_lengths_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.copy_index_cache()
doesn't trigger index materialization in comparison with ._get_index()
that does it
processable_amount_of_partitions = ( | ||
np.prod(self._modin_frame._partitions.shape) < CpuCount.get() * 32 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not mix abstraction levels and separate this condition into a dataframe method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's separate this condition into a dataframe method.
well, that would be quite hard...
here I want to dispatch between two different implementations of .dropna()
. Since dataframe level doesn't define dropna
, then I have to do this dispatching at the QC level.
The idea of a "general dispatcher" at the dataframe level was discussed at #5394 but trying suggested ideas didn't really find success, so there's a lot of work to be done in this regard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just hide implementation details something like that:
processable_amount_of_partitions = self._modin_frame.processable_amount_of_partitions_for_dropna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the .processable_amount_of_partitions_for_dropna
field, I simply added a property .num_parts
and kept the deciding logic at the QC, so the line now looks like:
processable_amount_of_partitions = self._modin_frame.num_parts < CpuCount.get() * 32
It feels wrong adding the .processable_amount_of_partitions_for_dropna
field, as the modin_frame
layer should know nothing about the dropna
and its implementation desires in terms of partitions amount
Signed-off-by: Dmitry Chigarev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR replaces a full-column implementation of
.dropna()
with a chain of two row-axis/map functions.At the first stage, we submit a row-axis function evaluating whether each column should be dropped for this specific row partition and builds a boolean mask for that matter, at the second stage those masks are being merged and the columns to be dropped are being dropped.
perf results
Here's the comparison for
df.dropna(axis=1, how="any")
between these two implementations:code to measure
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
.dropna()
using MapReduce pattern #6433docs/development/architecture.rst
is up-to-date