You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The two classifiers in #34 are a first pass at handling categorical features in a smarter way than a bunch of independent dummies, but here are a couple of ideas for doing this a bit more intelligently, albeit with a deeper re-write of the sklearn decision tree code, ideally more efficiently than trying all 2^n partitions of a categorical:
Respect categoricals when subsetting features at each node (using the same logic as in Classifiers that respect categoricals #34 but at each decision point rather than just overall for the entire tree)
Respect categoricals when subsetting features at each node. Then, if a categorical value is chosen to split on, ensure that categorical is included in the considered features at subsequent nodes (or maybe with a selection probability falling off with some decay as you go down the tree)
Respect categoricals when subsetting features at each node. Then, for each categorical in the selected subset, train a simple logit model for the outcome across the categorical values (since triage categoricals are actually aggregations of categoricals, better to use a model than simple correlations/conditional averages) - possibly trained on a sub-sample of the data for efficiency - and consider this score as a continuous variable to split on rather than the categorical value columns themselves. If chosen, the node will need to keep track of the categorical columns and logit model for predicting on new examples. This approach would allow for splitting based on all values of the categorical concurrently without having to attempt all possible combinations.
The text was updated successfully, but these errors were encountered:
The two classifiers in #34 are a first pass at handling categorical features in a smarter way than a bunch of independent dummies, but here are a couple of ideas for doing this a bit more intelligently, albeit with a deeper re-write of the sklearn decision tree code, ideally more efficiently than trying all 2^n partitions of a categorical:
The text was updated successfully, but these errors were encountered: