Methodological mistake in the "Feature Extraction and Selection" Jupyter Notebook #1020

SilverSolver · 2023-05-05T10:04:49Z

SilverSolver
May 5, 2023

Hi.
First of all, I would like to thank the creators and contributors of tsfresh library for their work. I'm not experienced in Time Series topic in ML (because usually I'm working in NLP domain), but I have found this library helpful and easy to use. The documentation look quite easy to follow even for a specialist from another ML domain. However, I noticed the problem of the methodology in the "01 Feature Extraction and Selection" notebook ( https://github.com/blue-yonder/tsfresh/blob/main/notebooks/01%20Feature%20Extraction%20and%20Selection.ipynb ).

Here, you filter features, using a statistical test, before creating a train-test split, which contradicts "not look at the test set" principle. Doing this, you may bring some very subtle kind of information about the test set to the train set, which will decrease the quality of testing of your model. You can read a little bit more about this and other similar problems in the following great guide: https://arxiv.org/pdf/2108.02497.pdf (see "Other common examples of information leakage are carrying out feature selection before
partitioning the data").

So, it would be correct to firstly do the split, and only then to apply the Feature Selection procedure.
I.e. instead of this:

X_filtered = select_features(X, y)
X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4)
X_filtered_train, X_filtered_test = X_full_train[X_filtered.columns], X_full_test[X_filtered.columns]

do something like this:

X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4)
X_filtered_train = select_features(X_full_train, y_train)
X_filtered_test = X_full_test[X_filtered_train.columns]

I hope this info is helpful and will help to make a documentation better. <3

nils-braun · 2023-05-16T21:23:06Z

nils-braun
May 16, 2023
Maintainer

Great! Thanks @SilverSolver!
Would you like to do a quick PR to fix this?
Thanks for spotting!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodological mistake in the "Feature Extraction and Selection" Jupyter Notebook #1020

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Methodological mistake in the "Feature Extraction and Selection" Jupyter Notebook #1020

SilverSolver May 5, 2023

Replies: 1 comment

nils-braun May 16, 2023 Maintainer

SilverSolver
May 5, 2023

nils-braun
May 16, 2023
Maintainer