Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OHE: User can select which categories to encoded for selected variables #667

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Morgan-Sell
Copy link
Collaborator

Closes #303.

Objective:
Create functionality so a user can encode certain categories that may not be the most frequent. The functionality is explained in this thread.

Will create a new init param called custom_categories that accepts a dictionary with selected features as keys and lists of the desired categories as values.

Both top_categories and user_categories cannot be used at the same time.

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

I think we're getting close on this one.

Are there any other unit tests that we should implement? According to the coverage report, OHE has 100% coverage.

I'll dig into the type error.

@Morgan-Sell
Copy link
Collaborator Author

Disregard my comment about code coverage. It seems that the transformer is very poorly "covered". I'll work on this.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Morgan-Sell

I had a quick look at the logic of the new functionality. There are some things that we need to give more thought to. Would be great it you could have a look and tell me what you think?

Thank you!

feature_engine/encoding/one_hot.py Outdated Show resolved Hide resolved
The dicitonary values are lists of the categories for each selected variable
that are to be one-hot encoded.

If `custom_categories` is being used, `top_categories` must equal None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things to think:

  • I think custom_categories should override top_categories. So no matter what the user enters, custom_categories takes precedence. It is easier to use like this imo. So here we should say "when custom_categories is entered, top_categories will be ignored.

The other thing to consider is, how do we want custom_categories to interplay with variables. Are we going to allow the user to specify categories for 3 variables for example, while encoding 6? What did you have in mind for this?

For examples vars a,b,c,d,e are categorical, if variables=None, all will be encoded. If user passes a dictionary with keys, b and c, will variables a,d and e also be encoded? or are we making custom_categories take precedence over variables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Dr. Che, you never let me get away easy. No wonder I enjoy working w/ you ;)

I'm going to suggest that variables is superior. I envision scenarios in which users would like to encode all variables but would like to select which categories are encoded in certain variables.

Consequently, this transformer will require a check when variables does not equal None and the user utilizes custom_categories. The transformer should confirm that the variables included in custom_categories are a subset of the variables inputted into the variables param.

What do you think? ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

drop_last: bool = False,
drop_last_binary: bool = False,
variables: Union[None, int, str, List[Union[str, int]]] = None,
variables: Union[
None, int, str, List[Union[str, int]], Iterable[Union[str, int]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you add iterable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before refactoring the code, MyPy raised the following errors:

feature_engine/encoding/one_hot.py:221: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"
feature_engine/encoding/one_hot.py:223: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"

feature_engine/encoding/one_hot.py Outdated Show resolved Hide resolved
f"its values. Got {custom_categories} instead."
)

# check that custom_categories variable match variables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if custom_categories should exactly match variables, I think we should just make this parameter take precendence over variables, and ignore variables altogether.

The alternative is, if we allow the user to specify None or a group of variables in variables, while custom_categories contains only a subset of these. Could be practical for the user, but a headache for us. Need to think / have a look at how much code we should change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this comment after responding to your prior question.

I agree that it will be a headache for us and useful for the user. I'll think about this more too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then, let me know if you find an elegant solution :)

"""
for var, categories in self.custom_categories.items():
unique_values = set(X[var].unique())
if not set(categories).issubset(unique_values):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if, the user wants a dummy for a category that is not present in the train set? should we contemplate this or is too unlikely?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When someone is OHE-ing a category that is not present in the training set, are they putting it into a "catch-all" bucket, e.g. unlearned_categories? Is this common?

I think it's unlikely and defeats the purpose of the custom_categories param. I see custom_categories being used to protect against unlearned categories.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example:

I work in a car insurance company. My historical data has 3 types of fuel: gas, petrol, diesel. But, I know that electric vehicles are coming. So I want an additional category: electric.

In the above scenario, it might help if we allow the additional bespoke category.

On the other hand, a model can't be train in a category that does not exist, so the results would be unreliable. Maybe we don't allow them for now?

@Morgan-Sell
Copy link
Collaborator Author

Hi @solegalli,

When you're back from vacay, lmk what you think.

Also, I don't know why I said that OHE has a low code coverage score. I just ran the coverage report - the transformer received 100% coverage!

@solegalli
Copy link
Collaborator

Hey @Morgan-Sell

I disappeared for a while, so I am a bit lost. I guess you were waiting for my input here, no? You let me know when I need to take another look? I am away from next Thursday, for a month, ehem, (red face)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OHE: allow encoding of specific, user desired categories
2 participants