Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circular categories encoding #226

Open
DelgadoPanadero opened this issue Dec 6, 2019 · 3 comments
Open

Circular categories encoding #226

DelgadoPanadero opened this issue Dec 6, 2019 · 3 comments

Comments

@DelgadoPanadero
Copy link

Hi! I came up here searching about how to encode categorical variables which have a circular distance relation (such as the days of the week, where the last day, sunday, is very close to the firstone, monday) preversing this characteristic.

I think that none of the encodings of this package support this bahaviour. Am i right? If this is true I have some ideas about how to implement this idea. If I develop this, would you like to add it as a pull request?

@janmotl
Copy link
Collaborator

janmotl commented Dec 6, 2019

Hi. I am curious to see what do you propose.

@DelgadoPanadero
Copy link
Author

DelgadoPanadero commented Dec 22, 2019

I think that may be different solutions according to the problem that you are tackling. For, instance, in the case of the days of the week that I have mentioned before, mostly everyone will use a integer variable from 1 to 7 to encode these days as a number:

int_day(thursday) = 4

In my opinion, a better aproach could be to use two variables rather than only only wich represent the x and y as follows:

x = cos( 2pi * int_day/7)
y = sin( 2pi * int_day/7)

With this representation the distance between every day of the week is the same still in the case of sunday and monday (last day and first day). With this "circular" representation, the transformation would be something like this

circular_representation(thursday) = ( cos(2pi * int_day(thursday)/7), sin(2pi * int_day(thursday)/7) )

I think that it just solves the euclidean problem but there are other problems that may require other representations as well as other kinds of dependance between categorical variables, not just this "circuar" case.

@janmotl
Copy link
Collaborator

janmotl commented Dec 23, 2019

I was testing this transformation in the past with different models. And it never lead to an improvement. From that time, whenever I have cyclical features and feel the need to preserve the circularity, I just use distance-based models with a distance, which respects the circularity.

For a non-exhaustive critique of this transformation, see the comment from T. Bush at https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/.

Nevertheless, if you find (and document) at least one scenario (on some real dataset) when this transformation improves the accuracy of the model (and it's not just random fluctuation), the transformation will be a welcomed extension of this library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants