Methodological mistake in the "Feature Extraction and Selection" Jupyter Notebook #1020
SilverSolver
started this conversation in
General
Replies: 1 comment
-
Great! Thanks @SilverSolver! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi.
First of all, I would like to thank the creators and contributors of tsfresh library for their work. I'm not experienced in Time Series topic in ML (because usually I'm working in NLP domain), but I have found this library helpful and easy to use. The documentation look quite easy to follow even for a specialist from another ML domain. However, I noticed the problem of the methodology in the "01 Feature Extraction and Selection" notebook ( https://github.com/blue-yonder/tsfresh/blob/main/notebooks/01%20Feature%20Extraction%20and%20Selection.ipynb ).
Here, you filter features, using a statistical test, before creating a train-test split, which contradicts "not look at the test set" principle. Doing this, you may bring some very subtle kind of information about the test set to the train set, which will decrease the quality of testing of your model. You can read a little bit more about this and other similar problems in the following great guide: https://arxiv.org/pdf/2108.02497.pdf (see "Other common examples of information leakage are carrying out feature selection before
partitioning the data").
So, it would be correct to firstly do the split, and only then to apply the Feature Selection procedure.
I.e. instead of this:
do something like this:
I hope this info is helpful and will help to make a documentation better. <3
Beta Was this translation helpful? Give feedback.
All reactions