New feature: Lag or windows features grouped by #727

Ezzaldin97 · 2024-02-25T21:40:43Z

Create a New feature for the #668 issue

create a feature for lag features to create lag features based on a list or variable.
create a feature for window features to create rolling window features based on a list or variable.
create a feature for expanding window features to create expanding window features based on a list or variable.
fix type-hint error in drop_psi_features.

codecov · 2024-02-25T22:07:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.33%. Comparing base (1a090ba) to head (9d999b0).
Report is 4 commits behind head on main.

❗ Current head 9d999b0 differs from pull request most recent head 09db782. Consider uploading reports for the commit 09db782 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #727      +/-   ##
==========================================
+ Coverage   98.31%   98.33%   +0.01%     
==========================================
  Files         103      103              
  Lines        3928     3965      +37     
  Branches      764      774      +10     
==========================================
+ Hits         3862     3899      +37     
  Misses         23       23              
  Partials       43       43

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

feature_engine/selection/drop_psi_features.py

feature_engine/timeseries/forecasting/base_forecast_transformers.py

VascoSch92 · 2024-02-26T05:44:14Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

-    _check_X_matches_training_df,
-    check_X,
-)
+from feature_engine.dataframe_checks import (_check_contains_inf,


I thought that black, isort and flack8 will automatically adjust the formatting and code style, because no problems appeared during development, and CI

I'm not sure that this is the case...

On circleCI we have automatic checks for formatting, but it does not fix it automatically. You need to isort and black your files before pushing them for the tests to pass.

I am going to resolve it, and push again.

VascoSch92 · 2024-02-26T05:56:29Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

+            if isinstance(group_by_variables, list):
+                if len(set(group_by_variables)) != len(group_by_variables):
+                    raise ValueError("group_by_variables contains duplicate values")
+


To improve the code I will do the following: after the checks on group_by_variables, I will put this variable into a list if the variable is just a string. In this way from now on you know that you are working with a list of string and you don't need to do other checks (like for example the one at line 180).

You could do something like

if group_by_variables: if isinstance(group_by_variables, str): self.group_by_variables = [group_by_variables] elif not ( isinstance(group_by_variables, list) and all(isinstance(element, str) for element in group_by_variables) ): raise ValueError( "group_by_variables must be an string or a list of strings. " f"Got {group_by_variables} instead." ) else: # note that if you are here, then group_by_variables is a list if len(set(group_by_variables)) != len(group_by_variables): raise ValueError("group_by_variables contains duplicate

Unfortunately, for consistency with the sklearn api we cannot modify the input parameters. We need to leave them as they are.

solegalli · 2024-02-28T08:30:11Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

@@ -51,6 +42,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

    {drop_original}

+    group_by_variables: str, list of str, default=None


I'd call this method group_by to keep it similar to pandas method.

Also, we do allow variable names to be integers, so in theory it could also take int and list of ints.

yes, sure I am going to rename this parameter, and also allow to pass it as integer or list of integers, and use check_variables methods to check if it exists in dataframe, thank you.

solegalli · 2024-02-28T08:32:18Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

+            # check if passed list has duplicates.
+            if isinstance(group_by_variables, list):
+                if len(set(group_by_variables)) != len(group_by_variables):
+                    raise ValueError("group_by_variables contains duplicate values")


looks like tests for these 2 error catches are missing? Haven't gotten to the end of the PR so forgive me if I am wrong.

I am going to add some test cases for them.

solegalli · 2024-02-28T08:36:27Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

@@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        if self.missing_values == "raise":
            self._check_na_and_inf(X)

+        if self.group_by_variables:


Do we need to add this functionality to feature-engine? Pandas will fail if the variables are not in the dataframe.

@solegalli Yes, That's True Pandas will fail because of that, I am going to remove it, and push again.

solegalli · 2024-02-28T08:37:40Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

@@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        if self.missing_values == "raise":
            self._check_na_and_inf(X)

+        if self.group_by_variables:


Another thing that comes before this, in line 168, we are explicity forbidding duplicate values in the index, but if we allow groupby, we need to think what we do with this check

@solegalli I think no duplicates will be in resultant dataframe, because we are creating the features for every group like the example below:

>>> import pandas as pd >>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures >>> X = pd.DataFrame(dict(date = ["2022-09-18", >>> "2022-09-19", >>> "2022-09-20", >>> "2022-09-21", >>> "2022-09-22", >>> "2022-09-18", >>> "2022-09-19", >>> "2022-09-20", >>> "2022-09-21", >>> "2022-09-22"], >>> x1 = [1,2,3,4,5, 3,5,6,8,11], >>> x2 = [6,7,8,9,10, 2,9,10,15,2], >>> x3=['a','a','a','a','a', 'b','b','b','b','b'] >>> )) >>> ewf = ExpandingWindowFeatures(group_by_variables='x3') >>> ewf.fit_transform(X)

the result is:

date x1 x2 x3 x1_expanding_mean x2_expanding_mean 0 2022-09-18 1 6 a NaN NaN 1 2022-09-19 2 7 a 1.000000 6.0 2 2022-09-20 3 8 a 1.500000 6.5 3 2022-09-21 4 9 a 2.000000 7.0 4 2022-09-22 5 10 a 2.500000 7.5 5 2022-09-18 3 2 b NaN NaN 6 2022-09-19 5 9 b 3.000000 2.0 7 2022-09-20 6 10 b 4.000000 5.5 8 2022-09-21 8 15 b 4.666667 7.0 9 2022-09-22 11 2 b 5.500000 9.0

solegalli · 2024-02-28T08:43:44Z

feature_engine/timeseries/forecasting/expanding_window_features.py

+            returned expanding window features
+        """
+        tmp_data = []
+        for _, group in grouped_df:


We do we need to loop?

Are we creating a grouped df for every variable passed to group_by_variables?

And is this the desired functionality? For time series forecasting, would we not have all ts in 1 col and then we would group by one or more variables that identify the ts, but we would not create many groups?

When would we need to create many groups?

let me explain what I need to do here, the reason behind adding group_by_variables to time series transformers is because of this issue #668 , when we need to create some lags, rolling window, or expanding window features based on a set of groups.
the above code loop over the set of groups to create the features for every group then concatenate them, and sort by index to return the dataframe to its original
let me explain it in the following code

X = pd.DataFrame(dict(date = ["2022-09-18", "2022-09-19", "2022-09-20", "2022-09-21", "2022-09-22", "2022-09-18", "2022-09-19", "2022-09-20", "2022-09-21", "2022-09-22"], x1 = [1,2,3,4,5, 3,5,6,8,11], x2 = [6,7,8,9,10, 2,9,10,15,2], x3=['a','a','a','a','a', 'b','b','b','b','b'], x4=['c','c','c','w','w','c','c','w','w','w'] )) X_grouped = X.groupby(['x3', 'x4']) for _, group in X_grouped: print(group)

the result is the dataframes of every group of ('x3', 'x4')

date x1 x2 x3 x4 0 2022-09-18 1 6 a c 1 2022-09-19 2 7 a c 2 2022-09-20 3 8 a c date x1 x2 x3 x4 3 2022-09-21 4 9 a w 4 2022-09-22 5 10 a w date x1 x2 x3 x4 5 2022-09-18 3 2 b c 6 2022-09-19 5 9 b c date x1 x2 x3 x4 7 2022-09-20 6 10 b w 8 2022-09-21 8 15 b w 9 2022-09-22 11 2 b w

I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/

feature_engine/timeseries/forecasting/expanding_window_features.py

solegalli

Hey @Ezzaldin97

Thank you so much for the enormous contribution!

Apologies for the delayed review. There are a few things that I don't understand regarding the logic we are trying to implement.

this transformer is designed for forecasting. In forecasting we normally have all time series in one column and then we have one or more columns that identify them with an id or a combination of characteristics for example store number and product number.

So I thought we'd group one either by one or many variables. But the loop threw me off. maybe there is something that I am not understanding?

Or maybe you see a use case for multiple grouping? in which case we'd need to allow both functionalities to coexist,

Thank you for your patience and support!

Ezzaldin97 · 2024-03-18T21:02:14Z

hi @solegalli,
I am looking forward to your feedback and any suggestions you may have to improve the pull request. Thank you in advance for your time and assistance.

solegalli

Hi @Ezzaldin97

Thank you so much for the amount of work that went into implementing this functionality. I am really sorry for the delay in taking a look at the proposed changes. I know it is frustrating when you put a lot of work towards something, and it is not acknowledged. I apologize. I had a very tight agenda in the last 2 months, which means that I couldn't really catch up with what was going on with Feature-engine.

I think we are aligned with the idea of the functionality that we need to add to these classes. My first impression is that the logic should be much simpler, because pandas can groupby, and then apply shift and rolling to the groups directly, without us having to loop over the groups.

It would be great if you could take a look and try to simplify the logic of the classes and outsource everything to pandas.

Let me know if this makes sense. Thank you so much for your effort!

solegalli · 2024-03-30T17:57:46Z

feature_engine/selection/drop_psi_features.py

@@ -475,7 +475,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None):
                threshold_cat = self.threshold

        # Compute the PSI by looping over the features
-        self.psi_values_ = {}
+        self.psi_values_: Dict = {}


We resolved this in a different PR. Could we remove this change from here please?

solegalli · 2024-03-30T17:59:54Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

@@ -51,6 +51,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

    {drop_original}

+    group_by: str, str, int, or list of strings or integers, default=None


we are using pandas groupby under the hood, so the docs here should probably be identical or just a summary of what we see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html and then refer the user to pandas groupby's documentation for more details

I have updated it using a summary of pandas groupby

solegalli · 2024-03-30T18:00:34Z

feature_engine/timeseries/forecasting/base_forecast_transformers.py

@@ -81,6 +85,7 @@ def __init__(
        self.variables = _check_variables_input_value(variables)
        self.missing_values = missing_values
        self.drop_original = drop_original
+        self.group_by = _check_variables_input_value(group_by)


if we defer the functionality to pandas, then we don't need this check. We just assign and let pandas handle the rest.

That's right, pandas will handle it, Thanks for your help 🙏

solegalli · 2024-03-30T18:02:02Z

feature_engine/timeseries/forecasting/expanding_window_features.py

@@ -93,6 +93,9 @@ class ExpandingWindowFeatures(BaseForecastTransformer):

    {drop_original}

+    group_by: str, str, int, or list of strings or integers, default=None


if we are repeating the same string over and over, instead of writing it multiple times, we'd create a single text in the _docstrings module, and import it instead. Like we do with fit_transform for example.

added to _docstring, Thanks for the clarification.

solegalli · 2024-03-30T18:02:41Z

feature_engine/timeseries/forecasting/expanding_window_features.py

@@ -139,6 +142,36 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
    2  2022-09-20   3   8                1.5                6.5
    3  2022-09-21   4   9                2.0                7.0
    4  2022-09-22   5  10                2.5                7.5
+    create expanding window features based on other variables.


the example in the class' docstrings is just meant for the user to "copy and paste" a simple example, not a full blown demo. For that we have the user guide. Could we please keep the original example?

solegalli · 2024-03-30T18:09:55Z

feature_engine/timeseries/forecasting/expanding_window_features.py

+        """
+        tmp_data = []
+        for _, group in grouped_df:
+            tmp = (


I don't think we need to loop over each group. Pandas does that under the hood if I recall correctly. So we'd just add groupby before .expanding. Check these resources:

https://www.statology.org/pandas-lag-by-group/
https://stackoverflow.com/questions/37231844/pandas-creating-a-lagged-column-with-grouped-data

I found a simple way to perform the group_by operation to calculate expanding window features using the .apply() method in pandas

solegalli · 2024-03-30T18:10:11Z

feature_engine/timeseries/forecasting/lag_features.py

@@ -117,6 +120,26 @@ class LagFeatures(BaseForecastTransformer):
    2  2022-09-20   3   8       2.0       7.0       1.0       6.0
    3  2022-09-21   4   9       3.0       8.0       2.0       7.0
    4  2022-09-22   5  10       4.0       9.0       3.0       8.0
+    create lags based on other variables.
+    >>> import pandas as pd


Could we please keep the original example? Demos go in the user-guide :)

solegalli · 2024-03-30T18:11:35Z

feature_engine/timeseries/forecasting/lag_features.py

+            lag feature or dataframe of lag features
+        """
+        tmp_data = []
+        for _, group in grouped_df:


Do we need to loop over the groups to apply the lags? pandas does the lags per group automatically.

I tried many approaches to simplify this approach, but it is only working when using periods argument with shift() method like the line in 231
, however when using freq argument with shift() method it doesn't work, so I used loop to make it work.
kindly advice if we can simplify it.

solegalli · 2024-03-30T18:15:23Z

feature_engine/timeseries/forecasting/expanding_window_features.py

+            returned expanding window features
+        """
+        tmp_data = []
+        for _, group in grouped_df:


I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/

Ezzaldin97 · 2024-04-02T01:10:20Z

Hi @solegalli
I have updated, and pushed the latest changes according to your last instructions, but there is still no enhancement in lag_features.py file can be done, I have checked many solution to remove the loop when working with freq argument, but no one get the expected output.
the unit test of CircleCI unexpectedly failed, I don't know why, but all tests passed on local

please advice to resolve this issue

KananMahammadli · 2024-08-30T13:47:59Z

@solegalli @Ezzaldin97 if it is ok with you, I would like to implement lag and window feature updates. I wanted to first confirm the design of the implementation before starting. When working with multiple time series, we often have covariates in time-series order, some having future values available (such as weather features), some not; we also use these lag and window features for a target. Therefore, for each variable, one needs to apply different window sizes, and also, for each window, different aggregation methods become important. I suggest using the define method: window_size(s) pairs as a dictionary for each variable and providing them as a final dictionary.
Ex:

Python

feature_method_window_size_dict = {
   "feature1": {
             "mean": 8,
             "std": [8, 9, 10],
            },
   "feature2": {
             "sum": [2, 6],
             "max":  [3, 5, 7],
            },
}

after agreeing on the design, I can start on this feature.

solegalli · 2024-08-30T17:54:39Z

Hey @KananMahammadli

I do agree with you that we'd use different windows and different operations for different features, and allowing the split with the dictionary will be handy for the most advanced user. For less experienced users, the more the class can resolve by itself the better, and in this sense, applying the same window and same functions to all features is sort of the simplest.

This PR is about allowing the transformers to create and add the lags and windows grouped by a third variable or variables that would, generally indicate some grouping, like name of the product, or land of sale, or this kind of thing.

A different PR could be to extend the functionality of the class so that it offers more granularity for the advanced user in the method that you just described. And if we do so, we need to do it in a way that does not break backward compatibility whenever possible.

I'd prefer if we could add the grouping function first and the granularity later. But this is open source, so we welcome all help, if you want to start otherwise, you are free to do it :)

KananMahammadli · 2024-08-30T18:31:20Z

hi @solegalli,

Sorry for not being clear enough; since this PR doesn't move forward, I thought of handling groupby in a separate PR (because finding why implementation here fails and correcting the bugs is time-consuming, modifying the current version of the feature engine to support groups is a better option). While adding the groupby, I also considered adding the flexibility of variables. I agree that it can confuse some users. Therefore I would like to get your opinion; we can do it in two ways I think:

separate classes for fixed and flexible variable usage
in this setup, I will first add groupby feature to the current fixed variable method, then create a separate class for a flexible method that also supports the groupby
single class with enabler
here we can add enable_flexible_variables, by default this is False, and in that case, we just use the variables argument, and it becomes the same as the current logic, but if given True, then we require mapping to be provided and check if the provided dictionary aligns with the format requirement

solegalli · 2024-08-31T10:03:42Z

Hi @KananMahammadli

Thanks for the detail.

Feel free to start a new PR for the groupby function. Please add a note to mention that it superseeds this PR so we can keep track :)

I think single class is easier to maintain and it avoids boilerplate. So I'd suggest sticking to one class, unless the code base is very different, in which case, it could be justified. For end users I think it gets confusing if we have 2 classes that do more or less the same.

If we add the new functionality in one class, it helps if we keep the addition of additional parameters to the minimum. Here for example:

feature_engine/feature_engine/imputation/arbitrary_number.py

Line 66 in b7133e8

imputer_dict: dict, default=None

If the user enters an imputer_dict, that superseeds the parameter arbitrary value, and we make it clear in the docstrings, so that the user only needs to change one parameter for the intended functionality.

The classes for lags and windows, as they stand today, they have the same parameters that pandas lag and rolling and shift. So I'd probably would not change that, because the users, or at least some of them, are familiar with those parameters already. So it is easy to jump from pandas to feature-engine.

I would add an extra parameter called, I don't know lag_dictionary, or, something informative (can't think of anything now), and when that parameter is not None, then we have the special functionality that you mention.

It would be easier to review, if we have 1 pr for each change. Thanks a lot for the initiative.

Ezzaldin97 added 9 commits February 23, 2024 23:03

add group by variables to base forecast transformer

4d653a9

add group by variables to lag_features

4e9d849

add group by window features

7f40391

add group by expanding window features

b476748

add test cases of groupby timeseries features

02c59bd

ensure code style tests

0dd92cc

fixing typehint errors

47de2d6

fixing docs indentation issue

dd43c27

fixing docs indentation issue in lag_features

7459811

VascoSch92 reviewed Feb 26, 2024

View reviewed changes

solegalli reviewed Feb 28, 2024

View reviewed changes

feature_engine/timeseries/forecasting/expanding_window_features.py Show resolved Hide resolved

solegalli reviewed Feb 28, 2024

View reviewed changes

Ezzaldin97 added 6 commits February 29, 2024 17:42

adjust formatting and code style in tests

12aa825

refactoring timeseries & reformatting the code

c3bee66

adjust code formatting & style in tests

67725dc

fix create lag features using groupby & freq parameters

9cb01ea

adjust code style

72ce43c

add test cases to ensure code coverage

9d999b0

solegalli reviewed Mar 30, 2024

View reviewed changes

Ezzaldin97 added 4 commits April 1, 2024 17:02

add group_by docstring to _docstring

b7b8bc9

remove check input of group_by

ba375a4

enhance performance of group_by window features operations

90f08f4

enhance performance of group_by expanding window features operations

66baa75

Ezzaldin97 added 5 commits April 1, 2024 21:08

fix reindexing to original index after grouping bug

92f996d

fix reindexing to original index after grouping operation bug

152c037

replacing group_by docstring with group_by_docstring

5343e50

adjust code-style and formatting

ef1eaa8

remove white spaces

09db782

		@@ -51,6 +42,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

		{drop_original}

		group_by_variables: str, list of str, default=None

		@@ -51,6 +51,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

		{drop_original}

		group_by: str, str, int, or list of strings or integers, default=None

		@@ -93,6 +93,9 @@ class ExpandingWindowFeatures(BaseForecastTransformer):

		{drop_original}

		group_by: str, str, int, or list of strings or integers, default=None

New feature: Lag or windows features grouped by #727

Are you sure you want to change the base?

New feature: Lag or windows features grouped by #727

Conversation

Ezzaldin97 commented Feb 25, 2024

codecov bot commented Feb 25, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solegalli Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solegalli left a comment

Choose a reason for hiding this comment

Ezzaldin97 commented Mar 18, 2024

solegalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ezzaldin97 commented Apr 2, 2024

KananMahammadli commented Aug 30, 2024

solegalli commented Aug 30, 2024

KananMahammadli commented Aug 30, 2024 • edited Loading

solegalli commented Aug 31, 2024

codecov bot commented Feb 25, 2024 •

edited

Loading

solegalli Feb 28, 2024 •

edited

Loading

KananMahammadli commented Aug 30, 2024 •

edited

Loading