Classifiers that respect categoricals #34

shaycrk · 2017-10-20T16:48:24Z

This PR creates a new pair of classifiers, CatInATree and CatInAForest, that respect categoricals when randomly sampling features. A few notes:

To make use of these, it also adds a couple of utilities to detect columns that should be grouped together as categoricals and modifies the _train method of ModelTrainer to pass these detected column groupings.
Note that because detecting categoricals relies on the feature config, it will fail to properly group all of the categoricals if column names are getting truncated.
The implementation here is to simply select a random subset of features that ensure all categorical values are selected together and then train a decision tree using the entire pre-selected subset throughout (as well as using this modified decision tree in a bagging classifier).
It does not, however, subset the features at each node as a random forest would -- this would require a more substantial rewrite of the guts of sklearn's decision trees, but would be a useful future project.

Creates a new pair of classifiers, CatInATree and CatInAForest, that respect categoricals when randomly sampling features.

remove the min_impurity_decrease parameter from DecisionTreeClassifier call to conform to sklearn 0.18

codecov-io · 2017-10-20T18:56:16Z

Codecov Report

Merging #34 into master will increase coverage by <.01%.
The diff coverage is 90.72%.

@@            Coverage Diff             @@
##           master      #34      +/-   ##
==========================================
+ Coverage   91.23%   91.23%   +<.01%     
==========================================
  Files          13       13              
  Lines         844      993     +149     
==========================================
+ Hits          770      906     +136     
- Misses         74       87      +13

Impacted Files	Coverage Δ
catwalk/utils.py	`82.67% <100%> (+5.12%)`	⬆️
catwalk/model_trainers.py	`93.19% <100%> (+0.99%)`	⬆️
catwalk/estimators/transformers.py	`86.56% <80%> (-13.44%)`	⬇️
catwalk/estimators/classifiers.py	`92.92% <92.85%> (-0.27%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c08327...8a72cbf. Read the comment docs.

for non-categorical columns where imputation was performed, ensure the flag and underlying column always come together for models that respect categoricals (in the future, we may want to consider passing these separately but for the purposes here we just add them in with the categoricals)

shaycrk · 2017-10-21T02:11:08Z

@hakoenig made the good point that with the new imputation branch we should also ensure that imputed flags always get picked up along with their underlying columns, so pushed a quick change that adds those groupings in here as well.

shaycrk added 10 commits October 19, 2017 01:12

classifier that respects categoricals

792f5e1

Creates a new pair of classifiers, CatInATree and CatInAForest, that respect categoricals when randomly sampling features.

pass feature_config to ModelTrainer tests

4ae5483

use sklearn 0.18 interface

fed9977

remove the min_impurity_decrease parameter from DecisionTreeClassifier call to conform to sklearn 0.18

min_impurity_split cannot be None in sklearn 0.18

a775f62

new utils tests

f1d9fb6

fix subsetting bug

8d3633d

unit tests and random_state fixes

0562d19

model trainers unit test

931f3b3

debug model trainers test

105d999

sort test result

8b92dc8

shaycrk mentioned this pull request Oct 20, 2017

Update ModelTrainer call for categorical-using classifiers dssg/triage#232

Closed

sort test_find_cats result

d607cd3

shaycrk mentioned this pull request Oct 20, 2017

Better Ideas for Classifiers with Categoricals #35

Closed

shaycrk added 2 commits October 21, 2017 01:52

add comments

8a72cbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classifiers that respect categoricals #34

Classifiers that respect categoricals #34

shaycrk commented Oct 20, 2017

codecov-io commented Oct 20, 2017 •

edited

Loading

shaycrk commented Oct 21, 2017

Classifiers that respect categoricals #34

Are you sure you want to change the base?

Classifiers that respect categoricals #34

Conversation

shaycrk commented Oct 20, 2017

codecov-io commented Oct 20, 2017 • edited Loading

Codecov Report

shaycrk commented Oct 21, 2017

codecov-io commented Oct 20, 2017 •

edited

Loading