Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Ideas for Classifiers with Categoricals #35

Closed
shaycrk opened this issue Oct 20, 2017 · 1 comment
Closed

Better Ideas for Classifiers with Categoricals #35

shaycrk opened this issue Oct 20, 2017 · 1 comment

Comments

@shaycrk
Copy link
Contributor

shaycrk commented Oct 20, 2017

The two classifiers in #34 are a first pass at handling categorical features in a smarter way than a bunch of independent dummies, but here are a couple of ideas for doing this a bit more intelligently, albeit with a deeper re-write of the sklearn decision tree code, ideally more efficiently than trying all 2^n partitions of a categorical:

  1. Respect categoricals when subsetting features at each node (using the same logic as in Classifiers that respect categoricals #34 but at each decision point rather than just overall for the entire tree)
  2. Respect categoricals when subsetting features at each node. Then, if a categorical value is chosen to split on, ensure that categorical is included in the considered features at subsequent nodes (or maybe with a selection probability falling off with some decay as you go down the tree)
  3. Respect categoricals when subsetting features at each node. Then, for each categorical in the selected subset, train a simple logit model for the outcome across the categorical values (since triage categoricals are actually aggregations of categoricals, better to use a model than simple correlations/conditional averages) - possibly trained on a sub-sample of the data for efficiency - and consider this score as a continuous variable to split on rather than the categorical value columns themselves. If chosen, the node will need to keep track of the categorical columns and logit model for predicting on new examples. This approach would allow for splitting based on all values of the categorical concurrently without having to attempt all possible combinations.
@jesteria
Copy link
Member

This issue was moved to dssg/triage#296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants