Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add missing_only=True to all imputers to use in combination with variables=None #388

Open
solegalli opened this issue Mar 14, 2022 · 13 comments · May be fixed by #698
Open

add missing_only=True to all imputers to use in combination with variables=None #388

solegalli opened this issue Mar 14, 2022 · 13 comments · May be fixed by #698

Comments

@solegalli
Copy link
Collaborator

Add missing_only functionality to all imputers to use in combination with variables=None

When variables is None, the imputers select all numerical, or categorical or all variables by default. With the missing_only, it would select only those from each subgroup that show missing data during fit.

@solegalli
Copy link
Collaborator Author

As part of this change, ensure missing_only only works when variables is None in the AddMissingIndicator and DropMissingData classes.

At the moment, the transformers will also check the users variables.

@AMM53
Copy link

AMM53 commented Jul 12, 2022

Will be having a look at it later.

@Morgan-Sell
Copy link
Collaborator

@solegalli,

Has anyone worked on this issue? I don't see a PR.

Should we introduce missing_only as param for BaseImputer()?

@solegalli
Copy link
Collaborator Author

It's all yours if you want it @Morgan-Sell :)

The BaseImputer() looks like the right place. I look forward to the PR!

@Morgan-Sell
Copy link
Collaborator

Morgan-Sell commented Sep 18, 2023

Sounds good, @solegalli.

To confirm the intention, this functionality will review all the values for each variable. It will then return only the variables that have missing values. For example (imagine the dictionaries are dataframes):

data = {
var_A: [1, 2, 3, 4, 5],
var_B: ['a', NaN, 'b', NaN, NaN],
var_C: [1.3, 3.0, 4.0, 8.8, NaN,
var_D: ['z', 'y', 'x', 'w', 'v'],
}

If variables=None and missing_only=True, then which of the following would be returned after fit():

Option A:

data_tr = {
var_B: ['a', NaN, 'b', NaN, NaN],
var_C: [1.3, 3.0, 4.0, 8.8, NaN],
}

Option B:

data_tr = {
var_B: [NaN, NaN NaN,
var_C: [3.0, 8.8, NaN],
}

Or, is it a different option?

I'm assuming Option A, but I want to confirm ;)

@solegalli
Copy link
Collaborator Author

Option A.

But note that all fit() methods return self.

What needs to happen in fit() when variables=None and missing_only=True is that self.variables_=[var_B, var_C].

There are a few things to consider:

  • we can add this parameter in the init of BaseImputer
  • but we need to add the functionality individually perhaps in each imputer (don't remember the code base, maybe we have some generic for numerical imputers?)
  • numerical imputers will capture all numerical variables when variables=None, or all numerical variables with nan when variable=None and missing_only=True
  • the categorical imputer however, will capture all categorical variables when variables=None and all categorical with NAN when variable=None and missing_only=True

@Morgan-Sell
Copy link
Collaborator

Yeah, I see what you're saying about bullet #2.

Since self.variables_ is created at each individual imputer, we must develop the functionality for each imputer.

However, at a minimum, can we develop the _get_numerical_vars_with_missing_values() method at the BaseImputer?

It feels good to be back ;)

@solegalli
Copy link
Collaborator Author

Maybe we create the function you mention in the variable handling module: https://github.com/feature-engine/feature_engine/tree/main/feature_engine/variable_handling

Because I think the categorical imputer also uses the base imputer, and then it has a method that is irrelevant.

@Morgan-Sell
Copy link
Collaborator

Why doesn't this apply to categorical variables?

@solegalli
Copy link
Collaborator Author

the function you mention is to select numerical variables. You need a different function for categoricals

@Morgan-Sell
Copy link
Collaborator

Yeah, I don't understand why the function cannot look for missing values in all variables. What am I missing?

@Morgan-Sell
Copy link
Collaborator

Because I think the categorical imputer also uses the base imputer, and then it has a method that is irrelevant.

Does this also mean that we cannot introduce missing_only at the BaseImputer() level? Or, is it ok for a transformer not to use all the init params from its parent class?

@solegalli
Copy link
Collaborator Author

you gave the function this name: _get_numerical_vars_with_missing_values(), so I thought it was meant to scan numerical variables only.

I trust your judgement. Tag me when you want me to have a look :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants