-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathxgboost.Rmd
executable file
·609 lines (453 loc) · 16.5 KB
/
xgboost.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
---
title: "xgboost"
author: "liuc"
date: "1/17/2022"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## xgboost
> https://juliasilge.com/blog/board-games/
> https://juliasilge.com/blog/xgboost-tune-volleyball/
> https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
> http://fancyerii.github.io/books/xgboost/
> https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting
> https://xgboost.readthedocs.io/en/latest/tutorials/model.html 重要的资料
> https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html
xgboost(Extreme gradient boosting)是tree-based模型中的boosted tree model,随机森林属于bagging。xgboost是Gradient Boosting的一种高效系统实现,并不是一种单一算法。xgboost里面的基学习器除了用tree(gbtree),也可用线性分类器(gblinear)。而GBDT则特指梯度提升决策树算法。 传统GBDT以CART作为基分类器,xgboost还支持线性分类器,这个时候xgboost相当于带L1和L2正则化项的逻辑斯蒂回归(分类问题)或者线性回归(回归问题)。
It works by building an ensemble of decision trees, so that each tree is trained on a subset of the data and predicts the outcome of the target variable. Then, the predictions from all the trees are combined to make a final prediction. The algorithm works by iteratively building the trees and minimizing a loss function - usually the sum of squared errors - to optimize the model. In each iteration, the algorithm looks for the best feature and the best split value for that feature, and then builds a tree based on those values. As the model trains, the parameters used to create the trees are optimized to minimize the loss function.
XGBoost库高度专注于计算速度和模型表现, XGboost支持三种主要的梯度提升形式:
1, Gradient Boosting 算法,也称为具有学习率的梯度提升机。
2, 对行、列以及分割列进行子采样的随机梯度提升。
3, L1 和 L2 正则化的正则化梯度提升。
*XGBoost 和 RandomForest:*The main difference is that in Random Forests, trees are independent and in boosting, the tree N+1 focus its learning on the loss (<=> what has not been well modeled by the tree N). This difference have an impact on a corner case in feature importance analysis: the correlated features.
*xgboost适合的数据:*
XGBoost manages only `numeric` vectors.
What to do when you have categorical data?
To answer the question above we will convert categorical variables to numeric one.Next step, we will transform the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. We will use the dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).
*xgboost重要的超参数:*mtry(Randomly Selected Predictors), trees, min_n(Minimal Node Size), learn_rate(Learning Rate), loss_reduction(Minimum Loss Reduction), tree_depth, stop_iter, sample_size(Proportion Observations Sampled)
```{r, echo=FALSE, include=FALSE}
library(tidyverse)
library(tidymodels)
library(xgboost)
library(SHAPforxgboost)
library(textrecipes)
library(finetune)
library(vip)
tidymodels_prefer()
```
### 回归问题
示例数据来自上述链接1,数据为预测rating。
```{r}
# df <- read_delim('./datasets/prostata.tab', delim = '\t') %>%
# select(-t) %>%
# filter(!class %in% c('discrete', 'class')) %>%
# mutate(across(ends_with('_at'), as.numeric))
# head(df)[1:5]
# dim(df)
ratings <- read_csv("./datasets/ratings.csv")
details <- read_csv("./datasets/details.csv")
ratings_joined <-
ratings %>%
left_join(details, by = "id")
# average的分布
ggplot(ratings_joined, aes(average)) +
geom_histogram(alpha = 0.8)
# 随意minimum recommended age和rating的关系
ratings_joined %>%
filter(!is.na(minage)) %>%
mutate(minage = cut_number(minage, 4)) %>%
ggplot(aes(minage, average, fill = minage)) +
geom_boxplot(alpha = 0.2, show.legend = FALSE)
```
start our modeling
```{r}
set.seed(123)
# 选择部分特征建模
game_split <-
ratings_joined %>%
select(name, average, matches("min|max"), boardgamecategory) %>%
na.omit() %>%
initial_split(strata = average)
game_train <- training(game_split)
game_test <- testing(game_split)
set.seed(234)
game_folds <- vfold_cv(game_train, strata = average)
game_folds
```
*set up our feature engineering:*
Sometimes a dataset requires more care and custom feature engineering; the tidymodels ecosystem provides lots of fluent options for common use cases and then the ability to extend our framework for more specific needs while maintaining good statistical practice.
```{r}
split_category <- function(x) {
x %>%
str_split(", ") %>%
map(str_remove_all, "[:punct:]") %>%
map(str_squish) %>%
map(str_to_lower) %>%
map(str_replace_all, " ", "_")
}
game_rec <-
recipe(average ~ ., data = game_train) %>%
update_role(name, new_role = "id") %>%
step_tokenize(boardgamecategory, custom_token = split_category) %>%
step_tokenfilter(boardgamecategory, max_tokens = 30) %>%
step_tf(boardgamecategory)
## just to make sure this works as expected
game_prep <- prep(game_rec)
bake(game_prep, new_data = NULL) %>% str()
```
Now let’s create a tunable xgboost model specification, with only some of the most important hyperparameters tunable, and combine it with our preprocessing recipe in a workflow().
```{r}
xgb_spec <-
boost_tree(
trees = tune(),
mtry = tune(),
min_n = tune(),
learn_rate = 0.01
) %>%
set_engine("xgboost") %>%
set_mode("regression")
# xgb_spec <- boost_tree(
# trees = 1000,
# tree_depth = tune(),
# min_n = tune(),
# loss_reduction = tune(),## first three: model complexity
# sample_size = tune(),
# mtry = tune(),
# learn_rate = tune()
# ) %>%
# set_engine("xgboost") %>%
# set_mode("classification")
xgb_wf <- workflow(game_rec, xgb_spec)
xgb_wf
```
use tune_race_anova() to eliminate parameter combinations that are not doing well.
```{r}
doParallel::registerDoParallel()
# Space-filling parameter grids
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), vb_train),
learn_rate(),
size = 30
)
xgb_grid
# IT’S TIME TO TUNE.
set.seed(234)
# xgb_res <- tune_grid(
# xgb_wf,
# resamples = game_folds,
# grid = xgb_grid,
# control = control_grid(save_pred = TRUE)
# )
#
# xgb_res
# tune_race_anova
# 其和tune_grid的区别在于
set.seed(234)
xgb_game_rs <-
tune_race_anova(
xgb_wf,
game_folds,
grid = 20,
control = control_race(verbose_elim = TRUE)
)
xgb_game_rs
```
```{r}
# load('./datasets/xgb_game_rs.rda')
xgb_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")
```
*Evaluate models*
Notice how we saved a TON of time by not evaluating the parameter combinations that were clearly doing poorly on all the resamples; we only kept going with the good parameter combinations.
```{r}
plot_race(xgb_game_rs)
show_best(xgb_game_rs)
# last_fit() to fit one final time to the training data and evaluate one final time on the testing data.
xgb_last <-
xgb_wf %>%
finalize_workflow(select_best(xgb_game_rs, "rmse")) %>%
last_fit(game_split)
xgb_last
```
Let’s start with model-based variable importance using the `vip` package.
xgboost不可以直接进行变量解释,但是可以通过一些方法得到prediction的重要性。
```{r}
# ?xgb.importance
xgb_fit <- extract_fit_parsnip(xgb_last)
vip::vip(xgb_fit, geom = "point", num_features = 12)
```
*interpret: *上图中可以看到maximum playing time and minimum age are the most important predictors driving the predicted game rating.
```{r}
xgb_last %>% collect_metrics()
```
*use Shapley Additive Explanations 确定特征重要性*
Why SHAP values
SHAP’s main advantages are local explanation and consistency in global model structure.
Tree-based machine learning models (random forest, gradient boosted trees, XGBoost) are the most popular non-linear models today. SHAP (SHapley Additive exPlanations) values is claimed to be the most advanced method to interpret results from tree-based models. It is based on Shaply values from game theory, and presents the feature importance using by marginal contribution to the model outcome.
This Github page explains the Python package developed by Scott Lundberg. Here we show all the visualizations in R. The xgboost::xgb.shap.plot function can also make simple dependence plot.
> https://liuyanguu.github.io/post/2019/07/18/visualization-of-shap-for-xgboost/
```{r}
# Please note that the SHAP values are generated by 'XGBoost' and 'LightGBM'; we just plot them.
# 此包目前只适用于'XGBoost' and 'LightGBM'
library(SHAPforxgboost)
# # To prepare the long-format data:
game_shap <-
shap.prep(
xgb_model = extract_fit_engine(xgb_fit),
X_train = bake(game_prep,
has_role("predictor"),
new_data = NULL,
composition = "matrix"
)
)
shap.plot.summary(game_shap)
# Or create partial dependence plots for specific variables:
shap.plot.dependence(
game_shap,
x = "minage",
color_feature = "minplayers",
size0 = 1.2,
smooth = FALSE, add_hist = TRUE
)
# SHAP force plot
# The SHAP force plot basically stacks these SHAP values for each observation, and show how the final output was obtained as a sum of each predictor’s attributions.
```
*解释:*图一可以理解为和VIP一样的变量重要性结果。图二为,某一变量的It plots the SHAP values against the feature values for each variable.
```{r}
# SHAP force plot
# To return the SHAP values and ranked features by mean|SHAP|
shap_values <- shap.values(xgb_model = extract_fit_engine(xgb_fit), X_train = game_train)
# choose to show top 4 features by setting `top_n = 4`,
# set 6 clustering groups of observations.
plot_data <- shap.prep.stack.data(shap_contrib = shap_values$shap_score, top_n = 4, n_groups = 6)
# you may choose to zoom in at a location, and set y-axis limit using `y_parent_limit`
shap.plot.force_plot(plot_data, zoom_in_location = 5000, y_parent_limit = c(-0.1,0.1))
# plot the 6 clusters
shap.plot.force_plot_bygroup(plot_data)
```
### 分类问题
这是一个分类问题的数据集,来自上面链接第2条:
```{r}
vb_matches <- readr::read_csv('./datasets/vb_matches.csv', guess_max = 80000)
vb_matches
vb_parsed <- vb_matches %>%
transmute(
circuit,
gender,
year,
w_attacks = w_p1_tot_attacks + w_p2_tot_attacks,
w_kills = w_p1_tot_kills + w_p2_tot_kills,
w_errors = w_p1_tot_errors + w_p2_tot_errors,
w_aces = w_p1_tot_aces + w_p2_tot_aces,
w_serve_errors = w_p1_tot_serve_errors + w_p2_tot_serve_errors,
w_blocks = w_p1_tot_blocks + w_p2_tot_blocks,
w_digs = w_p1_tot_digs + w_p2_tot_digs,
l_attacks = l_p1_tot_attacks + l_p2_tot_attacks,
l_kills = l_p1_tot_kills + l_p2_tot_kills,
l_errors = l_p1_tot_errors + l_p2_tot_errors,
l_aces = l_p1_tot_aces + l_p2_tot_aces,
l_serve_errors = l_p1_tot_serve_errors + l_p2_tot_serve_errors,
l_blocks = l_p1_tot_blocks + l_p2_tot_blocks,
l_digs = l_p1_tot_digs + l_p2_tot_digs
) %>%
na.omit()
winners <- vb_parsed %>%
select(circuit, gender, year,
w_attacks:w_digs) %>%
rename_with(~ str_remove_all(., "w_"), w_attacks:w_digs) %>%
mutate(win = "win")
losers <- vb_parsed %>%
select(circuit, gender, year,
l_attacks:l_digs) %>%
rename_with(~ str_remove_all(., "l_"), l_attacks:l_digs) %>%
mutate(win = "lose")
vb_df <- bind_rows(winners, losers) %>%
mutate_if(is.character, factor)
```
An XGBoost model is based on trees, so we don’t need to do much preprocessing for our data; we don’t need to worry about the factors or centering or scaling our data.
`xgboost`不是只能输入数字型数据? 可能在R中的实现上进行了优化,对于数据还是要再做一下检查。
```{r, eval=FALSE}
usemodels::use_xgboost(win ~ ., data = vb_df)
```
```{r}
set.seed(123)
vb_split <- initial_split(vb_df, strata = win)
vb_train <- training(vb_split)
vb_test <- testing(vb_split)
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune() ## step size
) %>%
set_engine("xgboost",
seed = 42 # 为了last_fit
) %>%
set_mode("classification")
xgb_spec
```
`Space-filling parameter grids`指的是,覆盖参数空间。
latin hypercube designs,
```{r}
# ?grid_latin_hypercube
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), vb_train),
learn_rate(),
size = 10
)
xgb_grid
xgb_wf <- workflow() %>%
add_formula(win ~ .) %>%
add_model(xgb_spec)
xgb_wf
set.seed(123)
vb_folds <- vfold_cv(vb_train, strata = win)
vb_folds
```
在我的小电脑上运行的时间还是蛮久的。
```{r}
doParallel::registerDoParallel()
set.seed(234)
xgb_res <- tune_grid(
xgb_wf,
resamples = vb_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
# 交叉验证、网格搜索后的模型list
xgb_res
```
```{r}
# load('./datasets/xgb_res.rda')
collect_metrics(xgb_res)
xgb_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")
```
*interpret:*每一个tune参数对auc的贡献。比如mtry为4时对auc贡献最大。min_n为11,learn_rate为0.007.
`select_best`得到the best tuning parameters,
```{r}
best_auc <- select_best(xgb_res, "roc_auc")
best_auc
```
```{r}
final_xgb <- finalize_workflow(
xgb_wf,
best_auc
)
final_xgb
```
```{r}
library(vip)
final_xgb %>%
fit(data = vb_train) %>%
pull_workflow_fit() %>%
vip(geom = "point")
```
*final_res*对象的详细解释:
`last_fit`对于优化完超参数的模型
```{r}
final_res <- last_fit(final_xgb, vb_split)
collect_metrics(final_res)
```
```{r}
xgb_fit <- extract_fit_parsnip(final_res)
vip::vip(xgb_fit, geom = "point", num_features = 12)
```
`f_model` & `xgb_fit`二者应该是一致的,但似乎并不是,考虑到fit并不涉及随机种子的问题,有必要理清原因。
`f_model` & `xgb_fit`二者的参数看起来是一致的。但fit得到的值却不一致。
> https://github.com/tidymodels/tune/issues/300
> https://github.com/tidymodels/tune/pull/275
似乎看来是对于有随机种子的模型,容易出现这个问题。在模型中指定seed。
```{r}
f_model <- final_xgb %>%
fit(data = vb_train)
xgb_fit
test_predict <- predict(f_model, new_data = vb_test, type = "prob")
# test_predict和下面的结果不一致。又没有啥随机因素为啥不一致呢
final_res$.predictions
```
测试集上的模型表现:
```{r}
final_res %>%
collect_predictions() %>%
roc_curve(win, .pred_win) %>% # 得到用来绘制roc曲线的数据
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(size = 1.5, color = "midnightblue") +
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
size = 1.2
)
```
保存模型:
```{r}
# final_res$.workflow[[1]]
extract_workflow(final_res) %>%
readr::write_rds('')
# yaml::write_yaml()
library(vetiver)
v <- vetiver_model(
extract_workflow(final_res),
"your model name",
metadata = list(metrics = collect_metrics(final_res) %>% dplyr::select(-.config))
)
v
```
*SHAP*
```{r}
library(SHAPforxgboost)
vb_shap <-
shap.prep(
xgb_model = extract_fit_engine(xgb_fit),
X_train = bake(vb_train,
has_role("predictor"),
new_data = NULL,
composition = "matrix"
)
)
shap.plot.summary(vb_shap)
# Or create partial dependence plots for specific variables:
shap.plot.dependence(
vb_shap,
x = "minage",
color_feature = "minplayers",
size0 = 1.2,
smooth = FALSE, add_hist = TRUE
)
```
### use xgboost package
`extract_fit_parsnip(final_res)`
```{r}
xgboost::xgb.train()
```