Guide train_test_split - improvement B - indepth analysis #498

MarieS-WiMLDS · 2024-10-15T15:07:41Z

Is your feature request related to a problem? Please describe.

As a data scientist, I want to be guided in the choice of the arguments in the scikit-learn train_test_split function, without having too many warnings to avoid being over my cognitive budget charge (let's say: 2 warnings max).
It's a follow-up of issue #492.
About the warning 7 on drift, we want to be able to dig further.

Describe the solution you'd like

We have a p-value saying that there is drift and the feature importance as input. We know that high cardinality features have a biais to increase feature importance in RF. To be really clean, we should compute feature importance on a test set (it's not something that is done today in scikit-learn, but they know they should).
We would like to remove little by little all the features for which we know there is drift or for which it has been analyzed, and check if, with the remaining features, there is still drift.

The design of the interaction still has to be drawn.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

MarieS-WiMLDS added enhancement New feature or request needs-triage This has been recently submitted and needs attention labels Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide train_test_split - improvement B - indepth analysis #498

Guide train_test_split - improvement B - indepth analysis #498

MarieS-WiMLDS commented Oct 15, 2024

Guide train_test_split - improvement B - indepth analysis #498

Guide train_test_split - improvement B - indepth analysis #498

Comments

MarieS-WiMLDS commented Oct 15, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered, if relevant

Additional context