-
-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a MedianEncoder on encoding category #568
Comments
Hi @ricardordb Thanks a lot for your suggestion! @glevv what do you think about this suggestion? My thoughts: In principle, if we use target mean encoding, I don't see why not also use target median. Sounds like a small difference from a statistical perspective. In practice, I don't know how much the median encoder would improve the model performance over the mean encoder (I guess we don't have enough data on that). I guess we could leave the performance bit to the user, but by adding a transformer to the library, we are sort of legitimizing its use. Less experienced users may think this is mainstream encoding method, when this is probably not the case? The MeanEncoder functionality is based on the article from Micci-Barrera, which explains the logic based on Bayes, and also the use of smoothing. Looking at the class @ricardordb developed, it looks like the smoothing functionality should be removed and we would have to change the docstrings substantially not to mislead people to think that the new transformer is based on the same article, if we were to include the class? @ricardordb do you have references supporting the use of this class? |
This is a special case of quantile encoder I know of this method, but didn't use it at all, since I saw no point in using it over target encoding. |
Thank you @glevv This tells us 2 things: First, I still have a lot to learn (lol). And second, if we were to implement median encoding, then, we should probably read the references in the Quantile encoder from category encoders to understand more of its use and functionality, and potentially create a quantile encoder and not just a median encoder, based on the literature. Since it exists in category encoders, I don't think this is urgent, but if someone thinks it is worth it, I would be happy to make it part of feature-engine as well. @ricardordb have you used the quantile encoder? |
On my side I am doing some research and tests to check if we can justify
this component.
I will now look at Quantile Encoding as suggested by @glevv.
Having more information will return my findings to you.
…On Mon, Nov 28, 2022 at 6:28 AM Soledad Galli ***@***.***> wrote:
Hi @ricardordb <https://github.com/ricardordb> Thanks a lot for your
suggestion!
@glevv <https://github.com/glevv> what do you think about this suggestion?
*My thoughts:*
In principle, if we use target mean encoding, I don't see why not also use
target median. Sounds like a small difference from a statistical
perspective.
In practice, I don't know how much the median encoder would improve the
model performance over the mean encoder (I guess we don't have enough data
on that).
I guess we could leave the performance bit to the user, but by adding a
transformer to the library, we are sort of legitimizing its use. Less
experienced users may think this is mainstream encoding method, when this
is probably not the case?
The MeanEncoder functionality is based on the article from Micci-Barrera,
which explains the logic based on Bayes, and also the use of smoothing.
Looking at the class @ricardordb <https://github.com/ricardordb>
developed, it looks like the smoothing functionality should be removed and
we would have to change the docstrings substantially not to mislead people
to think that the new transformer is based on the same article, if we were
to include the class?
@ricardordb <https://github.com/ricardordb> do you have references
supporting the use of this class?
—
Reply to this email directly, view it on GitHub
<#568 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHZWRNCA64C7CUEF4R7QYTWKR3NNANCNFSM6AAAAAASLWTU2M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ricardo Rezende
|
Hello!
I am working on a project and we found that the median encoding works better for our kind of data.
So I replaced the MeanEncoding mean function by the median function creating a new encoder.
Already tested the new encoder on some data and it works perfectly.
I forked it here: feature_engine
What do you think about adding a MedianEncoder to the project?
The text was updated successfully, but these errors were encountered: