xgboost.Rmd

---
title: "xgboost"
author: "liuc"
date: "1/17/2022"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## xgboost

> https://juliasilge.com/blog/board-games/
> https://juliasilge.com/blog/xgboost-tune-volleyball/
> https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
> http://fancyerii.github.io/books/xgboost/
> https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting
> https://xgboost.readthedocs.io/en/latest/tutorials/model.html 重要的资料
> https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html


xgboost(Extreme gradient boosting)是tree-based模型中的boosted tree model，随机森林属于bagging。xgboost是Gradient Boosting的一种高效系统实现，并不是一种单一算法。xgboost里面的基学习器除了用tree(gbtree)，也可用线性分类器(gblinear)。而GBDT则特指梯度提升决策树算法。 传统GBDT以CART作为基分类器，xgboost还支持线性分类器，这个时候xgboost相当于带L1和L2正则化项的逻辑斯蒂回归（分类问题）或者线性回归（回归问题）。

It works by building an ensemble of decision trees, so that each tree is trained on a subset of the data and predicts the outcome of the target variable. Then, the predictions from all the trees are combined to make a final prediction. The algorithm works by iteratively building the trees and minimizing a loss function - usually the sum of squared errors - to optimize the model. In each iteration, the algorithm looks for the best feature and the best split value for that feature, and then builds a tree based on those values. As the model trains, the parameters used to create the trees are optimized to minimize the loss function.

XGBoost库高度专注于计算速度和模型表现, XGboost支持三种主要的梯度提升形式：
1, Gradient Boosting 算法，也称为具有学习率的梯度提升机。
2, 对行、列以及分割列进行子采样的随机梯度提升。
3, L1 和 L2 正则化的正则化梯度提升。


*XGBoost 和 RandomForest：*The main difference is that in Random Forests, trees are independent and in boosting, the tree N+1 focus its learning on the loss (<=> what has not been well modeled by the tree N). This difference have an impact on a corner case in feature importance analysis: the correlated features.


*xgboost适合的数据：*

XGBoost manages only `numeric` vectors.
What to do when you have categorical data?
To answer the question above we will convert categorical variables to numeric one.Next step, we will transform the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. We will use the dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).


*xgboost重要的超参数：*mtry(Randomly Selected Predictors), trees, min_n(Minimal Node Size), learn_rate(Learning Rate), loss_reduction(Minimum Loss Reduction), tree_depth, stop_iter, sample_size(Proportion Observations Sampled)


```{r, echo=FALSE, include=FALSE}
library(tidyverse)
library(tidymodels)
library(xgboost)
library(SHAPforxgboost)
library(textrecipes)
library(finetune)
library(vip)

tidymodels_prefer()
```


### 回归问题

示例数据来自上述链接1，数据为预测rating。
```{r}
# df <- read_delim('./datasets/prostata.tab', delim = '\t') %>% 
#   select(-t) %>% 
#   filter(!class %in% c('discrete', 'class')) %>% 
#   mutate(across(ends_with('_at'), as.numeric))
# head(df)[1:5]
# dim(df)

ratings <- read_csv("./datasets/ratings.csv")
details <- read_csv("./datasets/details.csv")

ratings_joined <-
  ratings %>%
  left_join(details, by = "id")

# average的分布
ggplot(ratings_joined, aes(average)) +
  geom_histogram(alpha = 0.8)

# 随意minimum recommended age和rating的关系
ratings_joined %>%
  filter(!is.na(minage)) %>%
  mutate(minage = cut_number(minage, 4)) %>%
  ggplot(aes(minage, average, fill = minage)) +
  geom_boxplot(alpha = 0.2, show.legend = FALSE)
```

start our modeling
```{r}
set.seed(123)

# 选择部分特征建模
game_split <-
  ratings_joined %>%
  select(name, average, matches("min|max"), boardgamecategory) %>%
  na.omit() %>%
  initial_split(strata = average)

game_train <- training(game_split)
game_test <- testing(game_split)

set.seed(234)
game_folds <- vfold_cv(game_train, strata = average)
game_folds
```


*set up our feature engineering:*
Sometimes a dataset requires more care and custom feature engineering; the tidymodels ecosystem provides lots of fluent options for common use cases and then the ability to extend our framework for more specific needs while maintaining good statistical practice.

```{r}
split_category <- function(x) {
  x %>%
    str_split(", ") %>%
    map(str_remove_all, "[:punct:]") %>%
    map(str_squish) %>%
    map(str_to_lower) %>%
    map(str_replace_all, " ", "_")
}

game_rec <-
  recipe(average ~ ., data = game_train) %>%
  update_role(name, new_role = "id") %>%
  step_tokenize(boardgamecategory, custom_token = split_category) %>%
  step_tokenfilter(boardgamecategory, max_tokens = 30) %>%
  step_tf(boardgamecategory)

## just to make sure this works as expected
game_prep <- prep(game_rec)
bake(game_prep, new_data = NULL) %>% str()
```


Now let’s create a tunable xgboost model specification, with only some of the most important hyperparameters tunable, and combine it with our preprocessing recipe in a workflow().
```{r}
xgb_spec <-
  boost_tree(
    trees = tune(),
    mtry = tune(),
    min_n = tune(),
    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# xgb_spec <- boost_tree(
#   trees = 1000,
#   tree_depth = tune(), 
#   min_n = tune(),
#   loss_reduction = tune(),## first three: model complexity
#   sample_size = tune(), 
#   mtry = tune(),
#   learn_rate = tune()
# ) %>%
#   set_engine("xgboost") %>%
#   set_mode("classification")

xgb_wf <- workflow(game_rec, xgb_spec)
xgb_wf
```


use tune_race_anova() to eliminate parameter combinations that are not doing well.

```{r}
doParallel::registerDoParallel()

# Space-filling parameter grids
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), vb_train),
  learn_rate(),
  size = 30
)

xgb_grid
# IT’S TIME TO TUNE. 
set.seed(234)
# xgb_res <- tune_grid(
#   xgb_wf,
#   resamples = game_folds,
#   grid = xgb_grid,
#   control = control_grid(save_pred = TRUE)
# )
# 
# xgb_res


# tune_race_anova
# 其和tune_grid的区别在于
set.seed(234)
xgb_game_rs <-
  tune_race_anova(
    xgb_wf,
    game_folds,
    grid = 20,
    control = control_race(verbose_elim = TRUE)
  )

xgb_game_rs
```
```{r}
# load('./datasets/xgb_game_rs.rda')

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")
```


*Evaluate models*

Notice how we saved a TON of time by not evaluating the parameter combinations that were clearly doing poorly on all the resamples; we only kept going with the good parameter combinations.
```{r}
plot_race(xgb_game_rs)

show_best(xgb_game_rs)


# last_fit() to fit one final time to the training data and evaluate one final time on the testing data.
xgb_last <-
  xgb_wf %>%
  finalize_workflow(select_best(xgb_game_rs, "rmse")) %>%
  last_fit(game_split)

xgb_last
```


Let’s start with model-based variable importance using the `vip` package.

xgboost不可以直接进行变量解释，但是可以通过一些方法得到prediction的重要性。

```{r}
# ?xgb.importance

xgb_fit <- extract_fit_parsnip(xgb_last)


vip::vip(xgb_fit, geom = "point", num_features = 12)
```
*interpret: *上图中可以看到maximum playing time and minimum age are the most important predictors driving the predicted game rating.


```{r}
xgb_last %>% collect_metrics()

```

*use Shapley Additive Explanations 确定特征重要性*
Why SHAP values
SHAP’s main advantages are local explanation and consistency in global model structure.

Tree-based machine learning models (random forest, gradient boosted trees, XGBoost) are the most popular non-linear models today. SHAP (SHapley Additive exPlanations) values is claimed to be the most advanced method to interpret results from tree-based models. It is based on Shaply values from game theory, and presents the feature importance using by marginal contribution to the model outcome.

This Github page explains the Python package developed by Scott Lundberg. Here we show all the visualizations in R. The xgboost::xgb.shap.plot function can also make simple dependence plot.

> https://liuyanguu.github.io/post/2019/07/18/visualization-of-shap-for-xgboost/

```{r}
# Please note that the SHAP values are generated by 'XGBoost' and 'LightGBM'; we just plot them.
# 此包目前只适用于'XGBoost' and 'LightGBM'
library(SHAPforxgboost)

# # To prepare the long-format data:
game_shap <-
  shap.prep(
    xgb_model = extract_fit_engine(xgb_fit),
    X_train = bake(game_prep,
      has_role("predictor"),
      new_data = NULL,
      composition = "matrix"
    )
  )


shap.plot.summary(game_shap)


# Or create partial dependence plots for specific variables:
shap.plot.dependence(
  game_shap,
  x = "minage",
  color_feature = "minplayers",
  size0 = 1.2,
  smooth = FALSE, add_hist = TRUE
)

# SHAP force plot
# The SHAP force plot basically stacks these SHAP values for each observation, and show how the final output was obtained as a sum of each predictor’s attributions.
```
*解释：*图一可以理解为和VIP一样的变量重要性结果。图二为，某一变量的It plots the SHAP values against the feature values for each variable.

```{r}
# SHAP force plot

# To return the SHAP values and ranked features by mean|SHAP|
shap_values <- shap.values(xgb_model = extract_fit_engine(xgb_fit), X_train = game_train)

# choose to show top 4 features by setting `top_n = 4`, 
# set 6 clustering groups of observations.  
plot_data <- shap.prep.stack.data(shap_contrib = shap_values$shap_score, top_n = 4, n_groups = 6)
# you may choose to zoom in at a location, and set y-axis limit using `y_parent_limit`  
shap.plot.force_plot(plot_data, zoom_in_location = 5000, y_parent_limit = c(-0.1,0.1))


# plot the 6 clusters
shap.plot.force_plot_bygroup(plot_data)
```


### 分类问题

这是一个分类问题的数据集，来自上面链接第2条：

```{r}
vb_matches <- readr::read_csv('./datasets/vb_matches.csv', guess_max = 80000)

vb_matches

vb_parsed <- vb_matches %>%
  transmute(
    circuit,
    gender,
    year,
    w_attacks = w_p1_tot_attacks + w_p2_tot_attacks,
    w_kills = w_p1_tot_kills + w_p2_tot_kills,
    w_errors = w_p1_tot_errors + w_p2_tot_errors,
    w_aces = w_p1_tot_aces + w_p2_tot_aces,
    w_serve_errors = w_p1_tot_serve_errors + w_p2_tot_serve_errors,
    w_blocks = w_p1_tot_blocks + w_p2_tot_blocks,
    w_digs = w_p1_tot_digs + w_p2_tot_digs,
    l_attacks = l_p1_tot_attacks + l_p2_tot_attacks,
    l_kills = l_p1_tot_kills + l_p2_tot_kills,
    l_errors = l_p1_tot_errors + l_p2_tot_errors,
    l_aces = l_p1_tot_aces + l_p2_tot_aces,
    l_serve_errors = l_p1_tot_serve_errors + l_p2_tot_serve_errors,
    l_blocks = l_p1_tot_blocks + l_p2_tot_blocks,
    l_digs = l_p1_tot_digs + l_p2_tot_digs
  ) %>%
  na.omit()

winners <- vb_parsed %>%
  select(circuit, gender, year,
         w_attacks:w_digs) %>%
  rename_with(~ str_remove_all(., "w_"), w_attacks:w_digs) %>%
  mutate(win = "win")

losers <- vb_parsed %>%
  select(circuit, gender, year,
         l_attacks:l_digs) %>%
  rename_with(~ str_remove_all(., "l_"), l_attacks:l_digs) %>%
  mutate(win = "lose")

vb_df <- bind_rows(winners, losers) %>%
  mutate_if(is.character, factor)
```

An XGBoost model is based on trees, so we don’t need to do much preprocessing for our data; we don’t need to worry about the factors or centering or scaling our data. 

`xgboost`不是只能输入数字型数据? 可能在R中的实现上进行了优化，对于数据还是要再做一下检查。

```{r, eval=FALSE}
usemodels::use_xgboost(win ~ ., data = vb_df)
```


```{r}
set.seed(123)
vb_split <- initial_split(vb_df, strata = win)
vb_train <- training(vb_split)
vb_test <- testing(vb_split)


xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()                          ## step size
) %>%
  set_engine("xgboost", 
             seed = 42 # 为了last_fit
             ) %>%
  set_mode("classification")

xgb_spec

```

`Space-filling parameter grids`指的是，覆盖参数空间。
latin hypercube designs, 
```{r}
# ?grid_latin_hypercube
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), vb_train),
  learn_rate(),
  size = 10
)

xgb_grid

xgb_wf <- workflow() %>%
  add_formula(win ~ .) %>%
  add_model(xgb_spec)

xgb_wf

set.seed(123)
vb_folds <- vfold_cv(vb_train, strata = win)

vb_folds
```

在我的小电脑上运行的时间还是蛮久的。
```{r}
doParallel::registerDoParallel()

set.seed(234)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = vb_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

# 交叉验证、网格搜索后的模型list
xgb_res
```


```{r}
# load('./datasets/xgb_res.rda')
collect_metrics(xgb_res)

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")
```
*interpret：*每一个tune参数对auc的贡献。比如mtry为4时对auc贡献最大。min_n为11，learn_rate为0.007.


`select_best`得到the best tuning parameters,
```{r}
best_auc <- select_best(xgb_res, "roc_auc")
best_auc
```

```{r}
final_xgb <- finalize_workflow(
  xgb_wf,
  best_auc
)

final_xgb
```


```{r}
library(vip)

final_xgb %>%
  fit(data = vb_train) %>%
  pull_workflow_fit() %>%
  vip(geom = "point")
```

*final_res*对象的详细解释：
`last_fit`对于优化完超参数的模型

```{r}
final_res <- last_fit(final_xgb, vb_split)

collect_metrics(final_res)
```

```{r}
xgb_fit <- extract_fit_parsnip(final_res)


vip::vip(xgb_fit, geom = "point", num_features = 12)
```

`f_model` & `xgb_fit`二者应该是一致的，但似乎并不是，考虑到fit并不涉及随机种子的问题，有必要理清原因。
`f_model` & `xgb_fit`二者的参数看起来是一致的。但fit得到的值却不一致。

> https://github.com/tidymodels/tune/issues/300
> https://github.com/tidymodels/tune/pull/275

似乎看来是对于有随机种子的模型，容易出现这个问题。在模型中指定seed。

```{r}
f_model <- final_xgb %>%
  fit(data = vb_train)

xgb_fit

test_predict <- predict(f_model, new_data = vb_test, type = "prob")
# test_predict和下面的结果不一致。又没有啥随机因素为啥不一致呢
final_res$.predictions
```

测试集上的模型表现：
```{r}
final_res %>%
  collect_predictions() %>%
  roc_curve(win, .pred_win) %>% # 得到用来绘制roc曲线的数据
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(size = 1.5, color = "midnightblue") +
  geom_abline(
    lty = 2, alpha = 0.5,
    color = "gray50",
    size = 1.2
  )
```


保存模型：

```{r}
# final_res$.workflow[[1]]
extract_workflow(final_res) %>% 
  readr::write_rds('')

# yaml::write_yaml()

library(vetiver)
v <- vetiver_model(
  extract_workflow(final_res), 
  "your model name", 
  metadata = list(metrics = collect_metrics(final_res) %>% dplyr::select(-.config))
)

v
```


*SHAP*
```{r}
library(SHAPforxgboost)

vb_shap <-
  shap.prep(
    xgb_model = extract_fit_engine(xgb_fit),
    X_train = bake(vb_train,
      has_role("predictor"),
      new_data = NULL,
      composition = "matrix"
    )
  )


shap.plot.summary(vb_shap)
# Or create partial dependence plots for specific variables:
shap.plot.dependence(
  vb_shap,
  x = "minage",
  color_feature = "minplayers",
  size0 = 1.2,
  smooth = FALSE, add_hist = TRUE
)
```


### use xgboost package

`extract_fit_parsnip(final_res)`

```{r}
xgboost::xgb.train()
```