-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#4909: Properly implement map operator #5118
base: master
Are you sure you want to change the base?
Changes from all commits
da415ad
af1926a
be8ff60
ea1c425
0cd70f5
cb5c99b
f506eed
2d987b0
81afa94
07aef83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,9 @@ | |
|
||
from abc import ABC, abstractmethod | ||
from typing import List, Hashable, Optional, Callable, Union, Dict | ||
|
||
import pandas | ||
|
||
from modin.core.dataframe.base.dataframe.utils import Axis, JoinType | ||
|
||
|
||
|
@@ -91,8 +94,10 @@ def filter_by_types(self, types: List[Hashable]) -> "ModinDataframe": | |
def map( | ||
self, | ||
function: Callable, | ||
*, | ||
axis: Optional[Union[int, Axis]] = None, | ||
dtypes: Optional[str] = None, | ||
dtypes: Optional[Union[pandas.Series, type]] = None, | ||
copy_dtypes: bool = False, | ||
) -> "ModinDataframe": | ||
""" | ||
Apply a user-defined function row-wise if `axis`=0, column-wise if `axis`=1, and cell-wise if `axis` is None. | ||
|
@@ -102,11 +107,14 @@ def map( | |
function : callable(row|col|cell) -> row|col|cell | ||
The function to map across the dataframe. | ||
axis : int or modin.core.dataframe.base.utils.Axis, optional | ||
The axis to map over. | ||
dtypes : str, optional | ||
The axis to map over. If None, the map will be performed element-wise. | ||
dtypes : pandas.Series or scalar type, optional | ||
The data types for the result. This is an optimization | ||
because there are functions that always result in a particular data | ||
type, and this allows us to avoid (re)computing it. | ||
copy_dtypes : bool, default: False | ||
If True, the dtypes of the resulting dataframe are copied from the original, | ||
and the ``dtypes`` argument is ignored. | ||
Comment on lines
+115
to
+117
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we offer a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not really sure, though my guess is that the frequently dimension-reducing nature of reduce/tree-reduce makes the argument less relevant for those cases. Here, I introduced |
||
|
||
Returns | ||
------- | ||
|
@@ -258,7 +266,8 @@ def reduce( | |
self, | ||
axis: Union[int, Axis], | ||
function: Callable, | ||
dtypes: Optional[str] = None, | ||
*, | ||
dtypes: Optional[pandas.Series] = None, | ||
) -> "ModinDataframe": | ||
""" | ||
Perform a user-defined aggregation on the specified axis, where the axis reduces down to a singleton. | ||
|
@@ -269,7 +278,7 @@ def reduce( | |
The axis to perform the reduce over. | ||
function : callable(row|col) -> single value | ||
The reduce function to apply to each column. | ||
dtypes : str, optional | ||
dtypes : pandas.Series, optional | ||
The data types for the result. This is an optimization | ||
because there are functions that always result in a particular data | ||
type, and this allows us to avoid (re)computing it. | ||
|
@@ -291,7 +300,8 @@ def tree_reduce( | |
axis: Union[int, Axis], | ||
map_func: Callable, | ||
reduce_func: Optional[Callable] = None, | ||
dtypes: Optional[str] = None, | ||
*, | ||
dtypes: Optional[pandas.Series] = None, | ||
) -> "ModinDataframe": | ||
""" | ||
Perform a user-defined aggregation on the specified axis, where the axis reduces down to a singleton using a tree-reduce computation pattern. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,12 @@ | |
""" | ||
|
||
from enum import Enum | ||
import sys | ||
|
||
if sys.version_info.minor < 8: | ||
from typing_extensions import Literal | ||
else: | ||
from typing import Literal # type: ignore | ||
|
||
|
||
class Axis(Enum): # noqa: PR01 | ||
|
@@ -36,6 +42,10 @@ class Axis(Enum): # noqa: PR01 | |
CELL_WISE = None | ||
|
||
|
||
AxisInt = Literal[0, 1] | ||
"""Type for the two possible integer values of an axis argument (0 or 1).""" | ||
|
||
|
||
Comment on lines
+45
to
+48
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this needed? I mean, why would we extend internal dataframe API to also be able to accept There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A lot of the codebase (mostly the query compiler) is written to call dataframe methods with a literal int rather than the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see why we need this |
||
class JoinType(Enum): # noqa: PR01 | ||
""" | ||
An enum that represents the `join_type` argument provided to the algebra operators. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, there was a discussion regarding limiting the usage of pandas entities in the base classes of Modin internals. Some executions may not require pandas at all and wouldn't like to deal with handling breaking changes introduced by some pandas updates.
May we define the
dtypes
type as something abstract likecollections.abc.Mapping
so every execution would use whatever container they like?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes senes. Is there some other generic container that would accept
pandas.Series
though? It seems like it's not a subclass ofMapping
.