BUG: KeyError for TimeGrouper with df.group_by() #5091

benaurelschill · 2022-10-05T21:24:31Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
data = {'id': [1,2],'time_stamp': ["2022-03-24 23:53:09", "2022-03-24 21:53:09"], 'count':[5,5]}
data_frame = pd.DataFrame(data)
data_frame['time_stamp'] = pd.to_datetime(data_frame['time_stamp'])
print(data_frame.dtypes)
daily = data_frame.groupby(pd.Grouper(key='time_stamp', freq='D')).mean().reset_index()
print(daily)

Issue Description

Issue is thrown with the newest Modin version when installed from Github Master.

Expected Behavior

Modin should group data and aggregate data by date.
Pandas gives the following output for daily:
time_stamp id count
0 2022-03-24 1.5 5.0

Error Logs

Stack trace

KeyError                                  Traceback (most recent call last)
Input In [35], in <cell line: 6>()
      4 data_frame['time_stamp'] = pd.to_datetime(data_frame['time_stamp'])
      5 print(data_frame.dtypes)
----> 6 daily = data_frame.groupby(pd.Grouper(key='time_stamp', freq='D')).mean().reset_index()
      7 print(daily)

File /usr/local/lib/python3.9/site-packages/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File /usr/local/lib/python3.9/site-packages/modin/pandas/groupby.py:138, in DataFrameGroupBy.mean(self, numeric_only)
    136 def mean(self, numeric_only=None):
    137     return self._check_index(
--> 138         self._wrap_aggregation(
    139             type(self._query_compiler).groupby_mean,
    140             numeric_only=numeric_only,
    141             agg_kwargs=dict(numeric_only=numeric_only),
    142         )
    143     )

File /usr/local/lib/python3.9/site-packages/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File /usr/local/lib/python3.9/site-packages/modin/pandas/groupby.py:1095, in DataFrameGroupBy._wrap_aggregation(self, qc_method, numeric_only, agg_args, agg_kwargs, **kwargs)
   1091 else:
   1092     groupby_qc = self._query_compiler
   1094 result = type(self._df)(
-> 1095     query_compiler=qc_method(
   1096         groupby_qc,
   1097         by=self._by,
   1098         axis=self._axis,
   1099         groupby_kwargs=self._kwargs,
   1100         agg_args=agg_args,
   1101         agg_kwargs=agg_kwargs,
   1102         drop=self._drop,
   1103         **kwargs,
   1104     )
   1105 )
   1106 if self._squeeze:
   1107     return result.squeeze()

File /usr/local/lib/python3.9/site-packages/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File /usr/local/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py:2616, in PandasQueryCompiler.groupby_mean(self, by, axis, groupby_kwargs, agg_args, agg_kwargs, drop)
   2613     count_df = sums_counts_df.iloc[:, len(sums_counts_df.columns) // 2 :]
   2614     return sum_df / count_df
-> 2616 result = GroupByReduce.register(
   2617     lambda dfgb, **kwargs: pandas.concat(
   2618         [dfgb.sum(**kwargs), dfgb.count()],
   2619         axis=1,
   2620         copy=False,
   2621     ),
   2622     _groupby_mean_reduce,
   2623     default_to_pandas_func=lambda dfgb, **kwargs: dfgb.mean(**kwargs),
   2624 )(
   2625     query_compiler=qc_with_converted_datetime_cols,
   2626     by=by,
   2627     axis=axis,
   2628     groupby_kwargs=groupby_kwargs,
   2629     agg_args=agg_args,
   2630     agg_kwargs=agg_kwargs,
   2631     drop=drop,
   2632 )
   2634 if len(datetime_cols) > 0:
   2635     result = result.astype({col: dtype for col, dtype in datetime_cols.items()})

File /usr/local/lib/python3.9/site-packages/modin/core/dataframe/algebra/groupby.py:68, in GroupByReduce.register.<locals>.<lambda>(*args, **kwargs)
     61     reduce_func = map_func
     62 assert not (
     63     isinstance(map_func, dict) ^ isinstance(reduce_func, dict)
     64 ) and not (
     65     callable(map_func) ^ callable(reduce_func)
     66 ), "Map and reduce functions must be either both dict or both callable."
---> 68 return lambda *args, **kwargs: cls.caller(
     69     *args, map_func=map_func, reduce_func=reduce_func, **kwargs, **call_kwds
     70 )

File /usr/local/lib/python3.9/site-packages/modin/core/dataframe/algebra/groupby.py:306, in GroupByReduce.caller(cls, query_compiler, by, map_func, reduce_func, axis, groupby_kwargs, agg_args, agg_kwargs, drop, method, default_to_pandas_func)
    300     if default_to_pandas_func is None:
    301         default_to_pandas_func = (
    302             (lambda grp: grp.agg(map_func))
    303             if isinstance(map_func, dict)
    304             else map_func
    305         )
--> 306     return query_compiler.default_to_pandas(
    307         lambda df: default_to_pandas_func(
    308             df.groupby(by=by, axis=axis, **groupby_kwargs),
    309             *agg_args,
    310             **agg_kwargs,
    311         )
    312     )
    314 # The bug only occurs in the case of Categorical 'by', so we might want to check whether any of
    315 # the 'by' dtypes is Categorical before going into this branch, however triggering 'dtypes'
    316 # computation if they're not computed may take time, so we don't do it
    317 if not groupby_kwargs.get("sort", True) and isinstance(
    318     by, type(query_compiler)
    319 ):

File /usr/local/lib/python3.9/site-packages/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File /usr/local/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py:253, in PandasQueryCompiler.default_to_pandas(self, pandas_op, *args, **kwargs)
    247 args = (a.to_pandas() if isinstance(a, type(self)) else a for a in args)
    248 kwargs = {
    249     k: v.to_pandas if isinstance(v, type(self)) else v
    250     for k, v in kwargs.items()
    251 }
--> 253 result = pandas_op(self.to_pandas(), *args, **kwargs)
    254 if isinstance(result, pandas.Series):
    255     if result.name is None:

File /usr/local/lib/python3.9/site-packages/modin/core/dataframe/algebra/groupby.py:308, in GroupByReduce.caller.<locals>.<lambda>(df)
    300     if default_to_pandas_func is None:
    301         default_to_pandas_func = (
    302             (lambda grp: grp.agg(map_func))
    303             if isinstance(map_func, dict)
    304             else map_func
    305         )
    306     return query_compiler.default_to_pandas(
    307         lambda df: default_to_pandas_func(
--> 308             df.groupby(by=by, axis=axis, **groupby_kwargs),
    309             *agg_args,
    310             **agg_kwargs,
    311         )
    312     )
    314 # The bug only occurs in the case of Categorical 'by', so we might want to check whether any of
    315 # the 'by' dtypes is Categorical before going into this branch, however triggering 'dtypes'
    316 # computation if they're not computed may take time, so we don't do it
    317 if not groupby_kwargs.get("sort", True) and isinstance(
    318     by, type(query_compiler)
    319 ):

File /usr/local/lib64/python3.9/site-packages/pandas/core/frame.py:8392, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   8389     raise TypeError("You have to supply one of 'by' and 'level'")
   8390 axis = self._get_axis_number(axis)
-> 8392 return DataFrameGroupBy(
   8393     obj=self,
   8394     keys=by,
   8395     axis=axis,
   8396     level=level,
   8397     as_index=as_index,
   8398     sort=sort,
   8399     group_keys=group_keys,
   8400     squeeze=squeeze,
   8401     observed=observed,
   8402     dropna=dropna,
   8403 )

File /usr/local/lib64/python3.9/site-packages/pandas/core/groupby/groupby.py:959, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    956 if grouper is None:
    957     from pandas.core.groupby.grouper import get_grouper
--> 959     grouper, exclusions, obj = get_grouper(
    960         obj,
    961         keys,
    962         axis=axis,
    963         level=level,
    964         sort=sort,
    965         observed=observed,
    966         mutated=self.mutated,
    967         dropna=self.dropna,
    968     )
    970 self.obj = obj
    971 self.axis = obj._get_axis_number(axis)

File /usr/local/lib64/python3.9/site-packages/pandas/core/groupby/grouper.py:787, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    785 # a passed-in Grouper, directly convert
    786 if isinstance(key, Grouper):
--> 787     binner, grouper, obj = key._get_grouper(obj, validate=False)
    788     if key.key is None:
    789         return grouper, frozenset(), obj

File /usr/local/lib64/python3.9/site-packages/pandas/core/resample.py:1707, in TimeGrouper._get_grouper(self, obj, validate)
   1705 def _get_grouper(self, obj, validate: bool = True):
   1706     # create the resampler and return our binner
-> 1707     r = self._get_resampler(obj)
   1708     return r.binner, r.grouper, r.obj

File /usr/local/lib64/python3.9/site-packages/pandas/core/resample.py:1683, in TimeGrouper._get_resampler(self, obj, kind)
   1664 def _get_resampler(self, obj, kind=None):
   1665     """
   1666     Return my resampler or raise if we have an invalid axis.
   1667 
   (...)
   1681 
   1682     """
-> 1683     self._set_grouper(obj)
   1685     ax = self.ax
   1686     if isinstance(ax, DatetimeIndex):

File /usr/local/lib64/python3.9/site-packages/pandas/core/groupby/grouper.py:385, in Grouper._set_grouper(self, obj, sort)
    383     else:
    384         if key not in obj._info_axis:
--> 385             raise KeyError(f"The grouper name {key} is not found")
    386         ax = Index(obj[key], name=key)
    388 else:

KeyError: 'The grouper name time_stamp is not found'

Installed Versions

Versions

INSTALLED VERSIONS

commit : 621bc10
python : 3.9.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-70.26.1.el9_0.x86_64
Version : #1 SMP PREEMPT Fri Sep 2 16:07:40 EDT 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.16.0
ray : 2.0.0
dask : 2022.01.1
distributed : 2022.01.1
hdk : None

pandas dependencies

pandas : 1.5.0
numpy : 1.23.2
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 53.0.0
pip : 21.2.3
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.5
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.7.1
gcsfs : None
matplotlib : 3.5.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.0
snappy : None
sqlalchemy : 1.4.40
tables : None
tabulate : 0.8.10
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

benaurelschill · 2022-10-06T01:01:30Z

Updated Issue for reproducible code and complete error message

mvashishtha · 2022-10-06T02:10:42Z

@benaurelschill thanks for reporting this issue. Your script reproduces the error for me. I'm taking a look to see what's going wrong and whether this can be fixed quickly.

mvashishtha · 2022-10-06T04:39:31Z

The chain of events producing the bug:

we don't compute drop correctly in modin.pandas.DataFrame._groupby when the by is a Grouper object. In the case from the reproducer, we should know we're going to drop the one column in the Grouper object
Because self._drop is wrong, _internal_by thinks there are no by columns:

modin/modin/pandas/groupby.py

Line 377 in 621bc10

if self._drop:

... although even if self._drop were correct, _internal_by might not be able to deduce the by columns from the grouper.

When dropping non-numeric columns, we assume we can drop the non-numeric key column, because we don't think it's in the by:

modin/modin/pandas/groupby.py

Lines 1078 to 1089 in 621bc10

    
           mask_cols = [ 
        
               col 
        
               for col, dtype in self._query_compiler.dtypes.items() 
        
               if ( 
        
                   is_numeric_dtype(dtype) 
        
                   or ( 
        
                       isinstance(dtype, pandas.CategoricalDtype) 
        
                       and is_numeric_dtype(dtype.categories.dtype) 
        
                   ) 
        
                   or col in by_cols 
        
               ) 
        
           ]

We try to do the groupby by defaulting to pandas because of the Grouper, and the key column is missing.

It looks like we need to fix _drop and the _internal_by.

Meanwhile, doing groupby on timestamp, but without a grouper, fails, but that's separate issue #5099. #5099 might be a dependency for fixing this bug, too.

@benaurelschill the modin contributors will try to fix this bug soon. Meanwhile, you can work around this bug by doing the entire groupby step in pandas, then converting the result to modin, e.g. as follows (for details see #896):

daily = pd.DataFrame(data_frame._to_pandas().groupby(pd.Grouper(key='time_stamp', freq='D')).mean().reset_index())

Signed-off-by: Karthik Velayutham <[email protected]>

…oject#6174) Signed-off-by: Karthik Velayutham <[email protected]>

benaurelschill added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Oct 5, 2022

benaurelschill changed the title ~~BUG: Keyerror for TimeGrouper with df.group_by()~~ BUG: GrouperNotFound for TimeGrouper with df.group_by() Oct 6, 2022

mvashishtha changed the title ~~BUG: GrouperNotFound for TimeGrouper with df.group_by()~~ BUG: KeyError for TimeGrouper with df.group_by() Oct 6, 2022

mvashishtha added P1 Important tasks that we should complete soon and removed Triage 🩹 Issues that need triage labels Oct 6, 2022

mvashishtha mentioned this issue Oct 12, 2022

Grouper on datetime working in pandas but not modin #2939

Closed

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023

pyrito pushed a commit to pyrito/modin that referenced this issue May 19, 2023

FIX-modin-project#5091: Handle pd.Grouper objects correctly

f39e496

Signed-off-by: Karthik Velayutham <[email protected]>

pyrito mentioned this issue May 19, 2023

FIX-#5091: Handle pd.Grouper objects correctly #6174

Merged

7 tasks

devin-petersohn closed this as completed in #6174 May 22, 2023

devin-petersohn pushed a commit that referenced this issue May 22, 2023

FIX-#5091: Handle pd.Grouper objects correctly (#6174)

681c326

Signed-off-by: Karthik Velayutham <[email protected]>

pyrito pushed a commit to ponder-org/modin-public that referenced this issue May 22, 2023

FIX-modin-project#5091: Handle pd.Grouper objects correctly (modin-pr…

161b49a

…oject#6174) Signed-off-by: Karthik Velayutham <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: KeyError for TimeGrouper with df.group_by() #5091

BUG: KeyError for TimeGrouper with df.group_by() #5091

benaurelschill commented Oct 5, 2022 •

edited by mvashishtha

Loading

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

benaurelschill commented Oct 6, 2022 •

edited

Loading

mvashishtha commented Oct 6, 2022

mvashishtha commented Oct 6, 2022

BUG: KeyError for TimeGrouper with df.group_by() #5091

BUG: KeyError for TimeGrouper with df.group_by() #5091

Comments

benaurelschill commented Oct 5, 2022 • edited by mvashishtha Loading

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

benaurelschill commented Oct 6, 2022 • edited Loading

mvashishtha commented Oct 6, 2022

mvashishtha commented Oct 6, 2022

benaurelschill commented Oct 5, 2022 •

edited by mvashishtha

Loading

benaurelschill commented Oct 6, 2022 •

edited

Loading