investigate feature importance of lightgbm #32

aranas · 2018-05-18T14:58:54Z

No description provided.

aranas · 2018-06-05T09:55:10Z

johannadevos · 2018-06-05T09:58:48Z

How are the numbers on the x-axis calculated?

andregalvez79 · 2018-06-05T15:16:35Z

Thank you! I would also like to know what the numbers in the x axis mean. perhaps then we can perform a cut-off around 50? or less than 50? anyone knows any "rule of thumb" or reference for selecting features?

johannadevos · 2018-06-05T15:21:04Z

In general you want each feature to explain variance that is not yet explained by any preceding features. Sophie mentioned on Slack that there is one feature that apparently "drives 99% of our prediction accuracy". That means that all of the other features together explain the other 1%. It is very likely that within this 1%, there are again one or a few features that are doing all the work. I don't know of any references about this, but common sense says to me that we just eliminate all of the features that are not explaining any variance in the data.

andregalvez79 · 2018-06-05T15:28:07Z

OK, but which ones are those variables that don't explain any variance? are you suggesting to only leave the variable that drives 99% of the predictions?

johannadevos · 2018-06-05T15:35:16Z

No, we can also leave one or more variables if they account for a substantial portion of the 1%. I don't know which variables we are talking about, perhaps Sophie can tell us that?

aranas · 2018-06-05T17:06:34Z

okay, so the 99% is not a real number but was just me expressing that if we only keep the feature with highest importance, we still get very good classification accuracies. You can simply run the script "feature_selection" for yourself to see the numbers but I have also posted the output below. Basically it leaves out the features in order of importance (leaving out the ones with low importance first, in the output below n = number of features left). Normally you would expect that at some point even less important features will still increase your accuracy so there should be a drop-off where having a simpler model also harms predictive power (see for example bottom output of this post: ([https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/]). In our case however, you can leave out almost all features, which I found very suprising:
Thresh=12.000, n=22, auc: 0.94%
Thresh=24.000, n=21, auc: 0.94%
Thresh=25.000, n=20, auc: 0.94%
Thresh=35.000, n=19, auc: 0.94%
Thresh=44.000, n=18, auc: 0.94%
Thresh=57.000, n=17, auc: 0.94%
Thresh=70.000, n=16, auc: 0.94%
Thresh=78.000, n=15, auc: 0.94%
Thresh=92.000, n=14, auc: 0.94%
Thresh=106.000, n=13, auc: 0.94%
Thresh=120.000, n=12, auc: 0.94%
Thresh=124.000, n=11, auc: 0.94%
Thresh=125.000, n=10, auc: 0.94%
Thresh=125.000, n=10, auc: 0.94%
Thresh=141.000, n=8, auc: 0.94%
Thresh=147.000, n=7, auc: 0.94%
Thresh=155.000, n=6, auc: 0.94%
Thresh=160.000, n=5, auc: 0.94%
Thresh=177.000, n=4, auc: 0.93%
Thresh=186.000, n=3, auc: 0.93%
Thresh=209.000, n=2, auc: 0.94%
Thresh=253.000, n=1, auc: 0.93%

With respect to the x-axis values. This is what the lightgbm documentation says:

feature_importance(importance_type='split', iteration=-1)
Parameters: importance_type (string, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

I do get warnings about the categorical features when I run the script, so I am still wondering, whether there might be something wrong with it.

aranas · 2018-06-07T08:50:53Z

fyi, this is the info when I leave out the confidence features:
Thresh=43.000, n=11, auc: 0.93%
Thresh=47.000, n=10, auc: 0.94%
Thresh=80.000, n=9, auc: 0.93%
Thresh=94.000, n=8, auc: 0.93%
Thresh=107.000, n=7, auc: 0.94%
Thresh=129.000, n=6, auc: 0.92%
Thresh=134.000, n=5, auc: 0.93%
Thresh=153.000, n=4, auc: 0.94%
Thresh=154.000, n=3, auc: 0.92%
Thresh=165.000, n=2, auc: 0.92%
Thresh=233.000, n=1, auc: 0.91%

aranas self-assigned this May 24, 2018

andregalvez79 assigned andregalvez79 and unassigned andregalvez79 May 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate feature importance of lightgbm #32

investigate feature importance of lightgbm #32

aranas commented May 18, 2018

aranas commented Jun 5, 2018

johannadevos commented Jun 5, 2018

andregalvez79 commented Jun 5, 2018

johannadevos commented Jun 5, 2018

andregalvez79 commented Jun 5, 2018

johannadevos commented Jun 5, 2018

aranas commented Jun 5, 2018

aranas commented Jun 7, 2018

investigate feature importance of lightgbm #32

investigate feature importance of lightgbm #32

Comments

aranas commented May 18, 2018

aranas commented Jun 5, 2018

johannadevos commented Jun 5, 2018

andregalvez79 commented Jun 5, 2018

johannadevos commented Jun 5, 2018

andregalvez79 commented Jun 5, 2018

johannadevos commented Jun 5, 2018

aranas commented Jun 5, 2018

aranas commented Jun 7, 2018