-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Circular categories encoding #226
Comments
Hi. I am curious to see what do you propose. |
I think that may be different solutions according to the problem that you are tackling. For, instance, in the case of the days of the week that I have mentioned before, mostly everyone will use a integer variable from 1 to 7 to encode these days as a number:
In my opinion, a better aproach could be to use two variables rather than only only wich represent the x and y as follows:
With this representation the distance between every day of the week is the same still in the case of sunday and monday (last day and first day). With this "circular" representation, the transformation would be something like this
I think that it just solves the euclidean problem but there are other problems that may require other representations as well as other kinds of dependance between categorical variables, not just this "circuar" case. |
I was testing this transformation in the past with different models. And it never lead to an improvement. From that time, whenever I have cyclical features and feel the need to preserve the circularity, I just use distance-based models with a distance, which respects the circularity. For a non-exhaustive critique of this transformation, see the comment from T. Bush at https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/. Nevertheless, if you find (and document) at least one scenario (on some real dataset) when this transformation improves the accuracy of the model (and it's not just random fluctuation), the transformation will be a welcomed extension of this library. |
Hi! I came up here searching about how to encode categorical variables which have a circular distance relation (such as the days of the week, where the last day, sunday, is very close to the firstone, monday) preversing this characteristic.
I think that none of the encodings of this package support this bahaviour. Am i right? If this is true I have some ideas about how to implement this idea. If I develop this, would you like to add it as a pull request?
The text was updated successfully, but these errors were encountered: