Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate feature importance of lightgbm #32

Open
aranas opened this issue May 18, 2018 · 8 comments
Open

investigate feature importance of lightgbm #32

aranas opened this issue May 18, 2018 · 8 comments
Assignees

Comments

@aranas
Copy link
Member

aranas commented May 18, 2018

No description provided.

@aranas aranas self-assigned this May 24, 2018
@aranas
Copy link
Member Author

aranas commented Jun 5, 2018

feature_importance

@johannadevos
Copy link
Member

How are the numbers on the x-axis calculated?

@andregalvez79
Copy link
Contributor

Thank you! I would also like to know what the numbers in the x axis mean. perhaps then we can perform a cut-off around 50? or less than 50? anyone knows any "rule of thumb" or reference for selecting features?

@johannadevos
Copy link
Member

In general you want each feature to explain variance that is not yet explained by any preceding features. Sophie mentioned on Slack that there is one feature that apparently "drives 99% of our prediction accuracy". That means that all of the other features together explain the other 1%. It is very likely that within this 1%, there are again one or a few features that are doing all the work. I don't know of any references about this, but common sense says to me that we just eliminate all of the features that are not explaining any variance in the data.

@andregalvez79
Copy link
Contributor

OK, but which ones are those variables that don't explain any variance? are you suggesting to only leave the variable that drives 99% of the predictions?

@johannadevos
Copy link
Member

No, we can also leave one or more variables if they account for a substantial portion of the 1%. I don't know which variables we are talking about, perhaps Sophie can tell us that?

@aranas
Copy link
Member Author

aranas commented Jun 5, 2018

okay, so the 99% is not a real number but was just me expressing that if we only keep the feature with highest importance, we still get very good classification accuracies. You can simply run the script "feature_selection" for yourself to see the numbers but I have also posted the output below. Basically it leaves out the features in order of importance (leaving out the ones with low importance first, in the output below n = number of features left). Normally you would expect that at some point even less important features will still increase your accuracy so there should be a drop-off where having a simpler model also harms predictive power (see for example bottom output of this post: ([https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/]). In our case however, you can leave out almost all features, which I found very suprising:
Thresh=12.000, n=22, auc: 0.94%
Thresh=24.000, n=21, auc: 0.94%
Thresh=25.000, n=20, auc: 0.94%
Thresh=35.000, n=19, auc: 0.94%
Thresh=44.000, n=18, auc: 0.94%
Thresh=57.000, n=17, auc: 0.94%
Thresh=70.000, n=16, auc: 0.94%
Thresh=78.000, n=15, auc: 0.94%
Thresh=92.000, n=14, auc: 0.94%
Thresh=106.000, n=13, auc: 0.94%
Thresh=120.000, n=12, auc: 0.94%
Thresh=124.000, n=11, auc: 0.94%
Thresh=125.000, n=10, auc: 0.94%
Thresh=125.000, n=10, auc: 0.94%
Thresh=141.000, n=8, auc: 0.94%
Thresh=147.000, n=7, auc: 0.94%
Thresh=155.000, n=6, auc: 0.94%
Thresh=160.000, n=5, auc: 0.94%
Thresh=177.000, n=4, auc: 0.93%
Thresh=186.000, n=3, auc: 0.93%
Thresh=209.000, n=2, auc: 0.94%
Thresh=253.000, n=1, auc: 0.93%

With respect to the x-axis values. This is what the lightgbm documentation says:

feature_importance(importance_type='split', iteration=-1)
Parameters: importance_type (string, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

I do get warnings about the categorical features when I run the script, so I am still wondering, whether there might be something wrong with it.

@aranas
Copy link
Member Author

aranas commented Jun 7, 2018

fyi, this is the info when I leave out the confidence features:
Thresh=43.000, n=11, auc: 0.93%
Thresh=47.000, n=10, auc: 0.94%
Thresh=80.000, n=9, auc: 0.93%
Thresh=94.000, n=8, auc: 0.93%
Thresh=107.000, n=7, auc: 0.94%
Thresh=129.000, n=6, auc: 0.92%
Thresh=134.000, n=5, auc: 0.93%
Thresh=153.000, n=4, auc: 0.94%
Thresh=154.000, n=3, auc: 0.92%
Thresh=165.000, n=2, auc: 0.92%
Thresh=233.000, n=1, auc: 0.91%

feature_importance_noconf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants