generated from allisonhorst/meds-distill-template
-
Notifications
You must be signed in to change notification settings - Fork 5
/
lab1c_sdm-trees.Rmd
184 lines (139 loc) · 5.75 KB
/
lab1c_sdm-trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: 'Lab 1c. Species Distribution Modeling - Decision Trees'
editor_options:
chunk_output_type: console
---
# Learning Objectives {.unnumbered}
There are two broad categories of Supervised Classification based on the type of response your modeling $y \sim x$, where $y$ is either a continuous value, in which case **Regression**, or it is a categorical value, in which case it's a **Classification**.
A **binary** response, such as presence or absence, is a categorical value, so typically a Classification technique would be used. However, by transforming the response with a logit function, we were able to use Regression techniques like generalized linear (`glm()`) and generalized additive (`gam()`) models.
In this portion of the lab you'll use **Decision Trees** as a **Classification** technique to the data with the response being categorical (`factor(present)`).
- **Recursive Partitioning** (`rpart()`)\
Originally called classification & regression trees (CART), but that's copyrighted (Breiman, 1984).
- **Random Forest** (`ranger()`)\
Actually an ensemble model, ie trees of trees.
# Setup
```{r setup}
# global knitr chunk options
knitr::opts_chunk$set(
warning = FALSE,
message = FALSE)
# load packages
librarian::shelf(
caret, # m: modeling framework
dplyr, ggplot2 ,here, readr,
pdp, # X: partial dependence plots
ranger, # m: random forest modeling
rpart, # m: recursive partition modeling
rpart.plot, # m: recursive partition plotting
rsample, # d: split train/test data
skimr, # d: skim summarize data table
vip) # X: variable importance
# options
options(
scipen = 999,
readr.show_col_types = F)
set.seed(42)
# graphical theme
ggplot2::theme_set(ggplot2::theme_light())
# paths
dir_data <- here("data/sdm")
pts_env_csv <- file.path(dir_data, "pts_env.csv")
# read data
pts_env <- read_csv(pts_env_csv)
d <- pts_env %>%
select(-ID) %>% # not used as a predictor x
mutate(
present = factor(present)) %>% # categorical response
na.omit() # drop rows with NA
skim(d)
```
## Split data into training and testing
```{r dt-data-prereq, echo=TRUE}
# create training set with 80% of full data
d_split <- rsample::initial_split(d, prop = 0.8, strata = "present")
d_train <- rsample::training(d_split)
# show number of rows present is 0 vs 1
table(d$present)
table(d_train$present)
```
# Decision Trees
Reading: Boehmke & Greenwell (2020) Hands-On Machine Learning with R. [Chapter 9 Decision Trees](https://bradleyboehmke.github.io/HOML/DT.html)
## Partition, depth=1
```{=html}
<!--
Access and run the source code for this notebook [here](https://rstudio.cloud/project/801185). -->
```
```{r rpart-stump, echo=TRUE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree illustrating the single split on feature x (left).", out.width="48%"}
# run decision stump model
mdl <- rpart(
present ~ ., data = d_train,
control = list(
cp = 0, minbucket = 5, maxdepth = 1))
mdl
# plot tree
par(mar = c(1, 1, 1, 1))
rpart.plot(mdl)
```
## Partition, depth=default
```{r rpart-default, echo=TRUE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree $present$ classification.", out.width="48%"}
# decision tree with defaults
mdl <- rpart(present ~ ., data = d_train)
mdl
rpart.plot(mdl)
# plot complexity parameter
plotcp(mdl)
# rpart cross validation results
mdl$cptable
```
## Feature interpretation
```{r cp-table, fig.cap="Cross-validated accuracy rate for the 20 different $\\alpha$ parameter values in our grid search. Lower $\\alpha$ values (deeper trees) help to minimize errors.", fig.height=3}
# caret cross validation results
mdl_caret <- train(
present ~ .,
data = d_train,
method = "rpart",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 20)
ggplot(mdl_caret)
```
```{r dt-vip, fig.height=5.5, fig.cap="Variable importance based on the total reduction in MSE for the Ames Housing decision tree."}
vip(mdl_caret, num_features = 40, bar = FALSE)
```
```{r dt-pdp, fig.width=10, fig.height= 3.5, fig.cap="Partial dependence plots to understand the relationship between lat, WC_bio2 and present."}
# Construct partial dependence plots
p1 <- partial(mdl_caret, pred.var = "lat") %>% autoplot()
p2 <- partial(mdl_caret, pred.var = "WC_bio2") %>% autoplot()
p3 <- partial(mdl_caret, pred.var = c("lat", "WC_bio2")) %>%
plotPartial(levelplot = FALSE, zlab = "yhat", drape = TRUE,
colorkey = TRUE, screen = list(z = -20, x = -60))
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 3)
```
# Random Forests
Reading: Boehmke & Greenwell (2020) Hands-On Machine Learning with R. [Chapter 11 Random Forests](https://bradleyboehmke.github.io/HOML/random-forest.html)
See also: [Random Forest -- Modeling methods | R Spatial](https://rspatial.org/raster/sdm/6_sdm_methods.html\#random-forest)
## Fit
```{r out-of-box-rf}
# number of features
n_features <- length(setdiff(names(d_train), "present"))
# fit a default random forest model
mdl_rf <- ranger(present ~ ., data = d_train)
# get out of the box RMSE
(default_rmse <- sqrt(mdl_rf$prediction.error))
```
## Feature interpretation
```{r feature-importance}
# re-run model with impurity-based variable importance
mdl_impurity <- ranger(
present ~ ., data = d_train,
importance = "impurity")
# re-run model with permutation-based variable importance
mdl_permutation <- ranger(
present ~ ., data = d_train,
importance = "permutation")
```
```{r feature-importance-plot, fig.cap="Most important variables based on impurity (left) and permutation (right).", fig.height=4.5, fig.width=10}
p1 <- vip::vip(mdl_impurity, bar = FALSE)
p2 <- vip::vip(mdl_permutation, bar = FALSE)
gridExtra::grid.arrange(p1, p2, nrow = 1)
```