Adding a MedianEncoder on encoding category #568

ricardordb · 2022-11-26T00:37:24Z

Hello!
I am working on a project and we found that the median encoding works better for our kind of data.
So I replaced the MeanEncoding mean function by the median function creating a new encoder.
Already tested the new encoder on some data and it works perfectly.
I forked it here: feature_engine

What do you think about adding a MedianEncoder to the project?

solegalli · 2022-11-28T09:28:43Z

Hi @ricardordb Thanks a lot for your suggestion!

@glevv what do you think about this suggestion?

My thoughts:

In principle, if we use target mean encoding, I don't see why not also use target median. Sounds like a small difference from a statistical perspective.

In practice, I don't know how much the median encoder would improve the model performance over the mean encoder (I guess we don't have enough data on that).

I guess we could leave the performance bit to the user, but by adding a transformer to the library, we are sort of legitimizing its use. Less experienced users may think this is mainstream encoding method, when this is probably not the case?

The MeanEncoder functionality is based on the article from Micci-Barrera, which explains the logic based on Bayes, and also the use of smoothing.

Looking at the class @ricardordb developed, it looks like the smoothing functionality should be removed and we would have to change the docstrings substantially not to mislead people to think that the new transformer is based on the same article, if we were to include the class?

@ricardordb do you have references supporting the use of this class?

glevv · 2022-11-28T13:22:26Z

@solegalli

This is a special case of quantile encoder
http://contrib.scikit-learn.org/category_encoders/quantile.html

I know of this method, but didn't use it at all, since I saw no point in using it over target encoding.

solegalli · 2022-11-29T13:02:27Z

Thank you @glevv

This tells us 2 things:

First, I still have a lot to learn (lol).

And second, if we were to implement median encoding, then, we should probably read the references in the Quantile encoder from category encoders to understand more of its use and functionality, and potentially create a quantile encoder and not just a median encoder, based on the literature.

Since it exists in category encoders, I don't think this is urgent, but if someone thinks it is worth it, I would be happy to make it part of feature-engine as well.

@ricardordb have you used the quantile encoder?

ricardordb · 2022-11-29T18:16:16Z

On my side I am doing some research and tests to check if we can justify this component. I will now look at Quantile Encoding as suggested by @glevv. Having more information will return my findings to you.

…

On Mon, Nov 28, 2022 at 6:28 AM Soledad Galli ***@***.***> wrote: Hi @ricardordb <https://github.com/ricardordb> Thanks a lot for your suggestion! @glevv <https://github.com/glevv> what do you think about this suggestion? *My thoughts:* In principle, if we use target mean encoding, I don't see why not also use target median. Sounds like a small difference from a statistical perspective. In practice, I don't know how much the median encoder would improve the model performance over the mean encoder (I guess we don't have enough data on that). I guess we could leave the performance bit to the user, but by adding a transformer to the library, we are sort of legitimizing its use. Less experienced users may think this is mainstream encoding method, when this is probably not the case? The MeanEncoder functionality is based on the article from Micci-Barrera, which explains the logic based on Bayes, and also the use of smoothing. Looking at the class @ricardordb <https://github.com/ricardordb> developed, it looks like the smoothing functionality should be removed and we would have to change the docstrings substantially not to mislead people to think that the new transformer is based on the same article, if we were to include the class? @ricardordb <https://github.com/ricardordb> do you have references supporting the use of this class? — Reply to this email directly, view it on GitHub <#568 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEHZWRNCA64C7CUEF4R7QYTWKR3NNANCNFSM6AAAAAASLWTU2M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Ricardo Rezende

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a MedianEncoder on encoding category #568

Adding a MedianEncoder on encoding category #568

ricardordb commented Nov 26, 2022

solegalli commented Nov 28, 2022

glevv commented Nov 28, 2022

solegalli commented Nov 29, 2022

ricardordb commented Nov 29, 2022 via email •

edited

Loading

Adding a MedianEncoder on encoding category #568

Adding a MedianEncoder on encoding category #568

Comments

ricardordb commented Nov 26, 2022

solegalli commented Nov 28, 2022

glevv commented Nov 28, 2022

solegalli commented Nov 29, 2022

ricardordb commented Nov 29, 2022 via email • edited Loading

ricardordb commented Nov 29, 2022 via email •

edited

Loading