-
Notifications
You must be signed in to change notification settings - Fork 0
/
06_Practical_Trees.Rmd
executable file
·392 lines (238 loc) · 13.5 KB
/
06_Practical_Trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
---
title: "06_Practical; Trees"
author: "Ryan Greenup 17805315"
date: "23 August 2019"
output:
html_document:
code_folding: hide
keep_md: yes
theme: flatly
toc: yes
toc_depth: 4
toc_float: no
pdf_document:
toc: yes
always_allow_html: yes
##Shiny can be good but {.tabset} will be more compatible with PDF
##but you can submit HTML in turnitin so it doesn't really matter.
##If a floating toc is used in the document only use {.tabset} on more or less copy/pasted
#sections with different datasets
---
<!-- https://github.com/rstudio/rmarkdown/issues/1453#issuecomment-425327570 How to use echo -->
```{r setup, include=FALSE, include = FALSE, results = "hide", eval = TRUE}
knitr::opts_chunk$set(echo = TRUE)
if(require('pacman')){
library('pacman')
}else{
install.packages('pacman')
library('pacman')
}
pacman::p_load(caret, scales, ggplot2, rmarkdown, shiny, ISLR, class, BiocManager, corrplot, plotly, tidyverse, latex2exp, stringr, reshape2, cowplot, ggpubr, rstudioapi, wesanderson, RColorBrewer, colorspace, gridExtra, grid, car, boot, colourpicker, tree, ggtree, mise, rpart, rpart.plot, knitr, MASS, magrittr)
#Mass isn't available for R 3.5...
set.seed(1) # Set the seed such that we might have comparable results
# Wipe all the damn variables to prevent mistakes
mise()
kabstr <- function(df){
data.frame(variable = names(df),
class = sapply(df, typeof),
first_values = sapply(df, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable()
} # Pretty Structure Print
```
# (Wk 6) Tree Based Methods
Material of Tue 30 August 2019, week 6
## Classification Trees
Let's see what variables are predictive of high car sales, the number of sales is continuous, in order to fit a classification tree we will construct a binary variable.
When using `tree` the response variable MUST be a factor, be mindful of that.
When transforming columns to factors:
* Don't use `as.factor` because it doesn't allow specifying order
* Use tibbles because theyll tell you whether or not the column is ordered
* feed the column into `factor()`, specify the `levels = c()` in ascending order, specify `ordered = TRUE`
### Preview the data:
```{r}
CSeat.tb <- as_tibble(Carseats)
thresh <- (mean(CSeat.tb$Sales)+0.5*sd(CSeat.tb$Sales)) %>% round()
CSeat.tb$CatSales <- ifelse(CSeat.tb$Sales > thresh, "High", "Low")
CSeat.tb$CatSales <- factor(x = CSeat.tb$CatSales, levels = c("Low", "High"), ordered = TRUE)
CSeat.tb <- CSeat.tb[,c(12, 2:11)]
# Anotherway to reorder
dplyr::select(CSeat.tb, CatSales, everything())
```
### Fit
Now let's use all of the data to create a classification tree of the sales category.
Remember that we can use `.` to represent all the variables of a data frame;
* If `Sales` was still in the data frame we could pull it out by prefixing it with `-` but I think that's awfully confusing, instead I would recommend specifying a subset that pulls out undesired columns or just making another dataframe, it's way too easy to make a mistake like that IMO
```{r}
CatSalesModTree <- tree(formula = CatSales ~ ., data = as.data.frame(CSeat.tb))
CatSalesModTree
nrow(CSeat.tb)
CatSalesModTree %>% summary()
```
### Plot the Model {.tabset}
#### Base
This summary is next to useless so let's plot it, when plotting it it is necessary to call `text` afterwards in order order to have lables:
```{r}
plot(CatSalesModTree)
text (CatSalesModTree, pretty = 0, cex = 0.6)
# #
# CatSalesModTree.rpart <- rpart(formula = CatSales ~ ., data = CSeat.tb)
# rpart.plot(CatSalesModTree.rpart, box.palette="RdBu", shadow.col="gray", nn=TRUE)
```
#### Use `rpart`
The base plot looks truly awful, instead I would recommend using the `rpart` package, it creates a tree model using identical syntax to `tree` and then you can call `rpart.plot` directly over the model and it will work.
Be mindful that the `tree` package will split the data until each node has less than 20 observations, whereas the 'rpart' package automatically performs 10-fold cross validation [^rpart], the ratio contained in the plot is the ratio of terms satisfying the specified condition and the percentage is the percentage of all observations thence contained [^rpvig]
[^rpvig]: Refer to the [rpart Vignette](http://www.milbo.org/rpart-plot/prp.pdf)
[^rpart]: [rpart.control](https://www.rdocumentation.org/packages/rpart/versions/4.1-15/topics/rpart.control) has a default list entry of `xval=10` which corresponds to the number of CV folds.
```{r}
#plot(CatSalesModTree)
#text (CatSalesModTree, pretty = 0, cex = 0.6)
CatSalesModTree.rpart <- rpart(formula = CatSales ~ ., data = CSeat.tb)
rpart.plot(CatSalesModTree.rpart, box.palette="OrGy", shadow.col="gray", nn=TRUE)
```
I have no clue if an `rpart` tree object behaves well so I'm just going to use one for plotting and the other for modelling up until I have time to look into them.
### Evaluate the tree by using a validation Split
In order to very quickly evaluate the performance of this model, split the data and use the model on unseen data.
```{r}
train <- sample(1:nrow(Carseats), 200)
#Create the Categorical variable
CSeat <- Carseats
CSeat$CatSales <- ifelse(Carseats$Sales < round(mean(Carseats$Sales)), "Low", "High")
CSeat$CatSales <- factor(CSeat$CatSales, levels = c("Low", "High"), ordered = TRUE)
#Create the model on the training data only
carseats.tree <- tree(CatSales ~ ., data = CSeat[train,names(CSeat) != "Sales"])
# plot(carseats.tree) ;text(carseats.tree, pretty = 0, cex = 0.6)
carseats.tree.rpart <- rpart(CatSales ~ ., data = CSeat[train,-1], model = TRUE) # Model = True stops later errors
#rpart.plot(carseats.tree.rpart, nn = TRUE)
# Create Predictions on the testing data
test.pred <- predict(object = carseats.tree, newdata = CSeat[-train,], type = "class" )
# Filter out the observations in the testing data
test.obs <- CSeat$CatSales[-train]
# Now create the confusion Matrix
# This package prevents making mistakes
conf.mat <- caret::confusionMatrix(data = test.pred, reference = test.obs)
# This could otherwise be created by using, always go prediction, reference as a standard
table("prediction" = test.pred, "reference" = test.obs)
```
The Misclassification Rate for unseed testing data is `r (1-conf.mat$overall[1] )%>% round(2) %>% as.numeric() %>% percent()` which is reasonably low, suggesting that the accuracy of this model is acceptable, particularly because this models primary purpose is interetability not predictive capacity.
### Tree Pruning
This model may be overfit, it is necessary instead to determine whether or not to 'prune' the tree, that is reduce the model flexibility by reducing the number of nodes at the end of the tree. So, in this case, the number of nodes at the end of the tree is an indicator of the model flexibility, by default, the `tree` package creates nodes by choosing a threshold that minimises the RSS until the number of observations in each node is 20.
The problem with this strategy is that the model may be overfit, it may be to flexible and the results will suffer from very little model bias but far too much variance.
The process of reducing flexibility in order to improve the model by balancing bias and variation is called 'pruning'.
In order to prune the tree the best strategy to implement is 10-fold cross validation, fourtunately this is all built in and we don't really have to think about it, the `cv.tree()` function will do it automatically.
#### Using the the `cv.tree()` package
The `cv.tree` package minimises the deviance by default, the texbook provides ^[classref] that the misclassification rate is preferable if preiction accuracy of the final purened tree is the goal, deviance is an analogy to RSS in the classification sence [^devstack], but this is outside the scope of this unit.
In order to specify to `cv.tree` that the cross validation process should aim to minimise the '**misclassification rate**' rather than the '**deviance**' specify the function as `prune.misclass` to `cv.tree()`; be super careful, because we are changing the function, the `dev` output in the list will NOT be deviance, the values will be misclassificaiton
[^devstack]: [What is Deviance](https://stats.stackexchange.com/a/6610)
[^classreff]: At p. 312, Chapter 8.1.2, of ISL
```{r}
set.seed(3)
tree.mod.cv <- cv.tree(object = carseats.tree, FUN = prune.misclass)
# tree.mod.cv <- cv.tree(object = CatSalesModTree, FUN = prune.misclass)
names(tree.mod.cv)[2] <- "misclassification"
tree.mod.cv #%>% str()
```
This gives us a correspondence between the limiting size of the final node (which is inversely proportional to the flexibility of the model) and the expected misclassification rate on testing data.
Because this is Cross Validation, the error will be an estimation of error unseen by the model, hence the size that reflects the minimum value should be returned.
#### Plot the Testing Error
```{r}
nodeSize <- factor(tree.mod.cv$size, ordered = TRUE)
cv.error <- tibble(nodeSize = tree.mod.cv$size, error = tree.mod.cv$misclassification)
minval <- cv.error[cv.error$error<=min(cv.error$error),]
minNode <- min(minval$nodeSize)
ggplot(cv.error, aes(x = nodeSize, y = error), groups = 1) +
geom_point(size = 4, col = "IndianRed") +
geom_line(alpha = 0.7, aes(col = -error), lwd = 1, show.legend = FALSE) +
labs(y = "Estimated Misclassification Rate", x = "Maximum Leaf Size", ttitle = "Cross Validation of Tree Model") +
geom_vline(xintercept = minNode, col = "purple") +
# geom_hline(yintercept = minval$error) +
theme_classic()
```
The plot suggests that the minimum misclassification rate for this model will occur when the nodesize is limited to `r minNode`.
#### Prune the Tree
Prune the tree by using the `prune.misclass` function:
```{r}
Cars.Clas.prunce.cv <- prune.misclass(CatSalesModTree, best = minNode)
plot(Cars.Clas.prunce.cv)
text(Cars.Clas.prunce.cv, pretty = 0, cex = 0.7)
```
#### Asess the pruned model against the test data
Compare the model returned by CV to the unseen testing data:
```{r}
test.preds <- predict(object = Cars.Clas.prunce.cv, newdata = CSeat[-train, -1], type = "class")
test.obs <- CSeat[-train, -1]$CatSales
cfConfMat <- confusionMatrix(data = test.preds, reference = test.obs)
mcr <- 1-cfConfMat$overall[1]
```
from this it can be determined that the misclassification rate for this pruned model is now `r percent(round(mcr, 2))`
This tree represents the best trade off between a model with too much variance and a model with too much bias.
## Regression Trees
Consider the `Boston` Data set:
```{r}
# kable(head(Boston))
head(Boston)
# kabstr(Boston)
# dim(Boston)
# summary(Boston)
```
This data set is a measurement of the crime rate throughout the Boston area.
### Seperate training Data
Split the data into a training and test set in order to evaulate the difference caused by the cross validation technique
```{r}
set.seed(1)
train = sample(1:nrow(Boston), size = 0.52*nrow(Boston))
names(Boston)
```
### Fit a Regression Tree to the Training Data
In order to fit the regression tree to the training data use the train variable as a subset:
```{r}
bost.mod.tree <- tree(formula =medv ~., data = Boston, subset = train)
# summary(bost.mod.tree)
plot(bost.mod.tree)
text(bost.mod.tree, pretty = 0, cex = 0.7)
```
#### Evalueate the Performance
```{r}
modSum <- summary(bost.mod.tree)
rss <- modSum$dev
rmse <- sqrt(modSum$dev/length(length(bost.mod.tree$y)))
rmsernd <- signif(rmse, 1) * 1000
leaves <- summary(bost.mod.tree)$size
```
This model has `r leaves` terminal nodes, the performance of this model can be evaulated by measuring the ***RSSS*** which is evaulated by the mode, this provides that the average distance from data point to model prediction is $\pm$ `r signif(rmse, 1)` K USD.
### Prune the tree with CV
#### Perform Cross Validation
```{r}
cvvals <- cv.tree(object = bost.mod.tree, K = 10)
error <- sqrt(cvvals$dev)/length(bost.mod.tree$y)
TermNodes <- factor(x = cvvals$size, ordered = TRUE)
dvdf <- tibble(TermNodes, error)
#Return the best number of leaves
minerror <- min(dvdf$error)
optn <- min(dvdf$TermNodes[dvdf$error==minerror]) #There could be multiple optimum values
```
#### Visualise the validaiton Error
```{r}
ggplot(dvdf, aes(x = TermNodes, y = error, group = 1)) +
geom_point(col = "IndianRed", size = 3) +
geom_line(col = "RoyalBlue") +
geom_vline(xintercept = as.numeric(optn), col = "purple") +
theme_classic() +
labs(y = "Expected Error From model to Unseen Data ($50 K)", x = "Number of Terminal Nodes")
```
The model suggests that the appropriate number of terminal nodes to choose is `r optn` and hence the model ought to be left unchanged, however we will prune it for the sake of practice
#### Prune the Tree
In order to prune the tree use the `prune.tree` function and specify `best=` as the desired number of leaves:
```{r}
prun.reg.mod <- prune.tree(tree = bost.mod.tree, best = 5)
plot(prun.reg.mod)
text(prun.reg.mod, pretty = 0, cex = 0.7)
```
### Using rpart
This could have all been done in one call to rpart like so, however here the data is not split in order to create a more accurate model.
```{r}
BosnonSyn <- Boston
bosTree <- rpart(formula = medv ~ ., data = Boston)
rpart.plot(bosTree)
nicenames <- c("Crime Rate", "Large Zoning Proportion", "Proportion of Industry", "Near River?", "NO Polution Level PPM", "Number of Rooms", "Proportion of 'Old' Buildings", "Distance to CBD's", "Accessability of Highway", "Tax Rate", "Ratio of Teachers", "Number of Residents of African-Americans Descent", "Property Value")
```