-
Notifications
You must be signed in to change notification settings - Fork 13
/
M1_8_supervised_ml.Rmd
862 lines (613 loc) · 40.6 KB
/
M1_8_supervised_ml.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
---
title: 'M1-8: Supervised Machine Learning'
author: "Daniel S. Hain ([email protected])"
date: "12/09/2018"
output:
html_document:
df_print: paged
toc: yes
toc_float: true
number_sections: yes
---
```{r setup, include=FALSE}
### Generic preamble
Sys.setenv(LANG = "en")
### Clean Workspace (I like to start clean)
rm(list=ls()); graphics.off() # get rid of everything in the workspace
detachAllPackages <- function() { # Also, detach packages to avoid functions masked by others
basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
detachAllPackages(); rm(detachAllPackages)
### Load packages Standard
library(knitr) # For display of the markdown
### pimp up memory (to save on disk if necessary, only works on windows)
#memory.limit(10 * 10^10)
### Knitr options
opts_chunk$set(warning = FALSE,
message = FALSE,
echo = TRUE,
fig.align = "center"
)
```
Some housekeeping (again), installing necessary packages.
```{r}
list.of.packages <- c("devtools",
"rstudioapi",
"tidyverse",
"knitr",
"data.table",
"caret",
"caretEnsemble",
"recipes",
"ggridges",
"mlbench")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
rm(list.of.packages)
```
Load packages
```{r}
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr)
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives
```
Again, the whole exercise can be found executable on kaggle [HERE](https://www.kaggle.com/danielhain/sds-2018-m1-8-supervised-ml)
# Introduction
## Contrasting ML&AI with inferential statistics
### Inferential Statistics
* Mostly interested in producing good **parameter estimates**: Construct models with unbiased estimates of $\beta$, capturing the relationship $x$ and $y$.
* Supposedly \enquote{structural} models: Causal effect of directionality $x \rightarrow y$, robust across a variety of observed as well as up to now unobserved settings.
* How: Carefully draw from theories and empirical findings, apply logical reasoning to formulate hypotheses.
* Typically, multivariate testing, cetris paribus.
* Main concern: Minimize standard errors $\epsilon$ of $\beta$ estimates.
* Not overly concerned with overall predictive power (eg. $R^2$) of those models, but about various type of endogeneity issues, leading us to develop sophisticated **identification strategies**
### ML\&AI Approach
* To large extend driven by the needs of the private sector $\rightarrow$ data analysis is gear towards producing good **predictions** of outcomes $\rightarrow$ fits for $\hat{y}$, not $\hat{\beta}$
* Recommender systems: Amazon, Netflix, Sportify ect.
* Risk scores}: Eg.g likelihood that a particular person has an accident, turns sick, or defaults on their credit.
* Image classification: Finding Cats \& Dogs online
* Often rely on big data (N,$x_i$)
* Not overly concerned with the properties of parameter estimates, but very rigorous in optimizing the overall prediction accuracy.
* Often more flexibility wrt. the functional form, and non-parametric approaches.
* No "build-in"" causality guarantee $\rightarrow$ verification techniques.
* Often sporadically used in econometric procedures, but seen as "son of a lesser god".
## Issues with ML
### Generalization via "Out-of-Sample-Testing"
With so much freedom wrt. feature selection, functional form ect., models are prone to over-fitting. And no constraints by asymptotic properties, causality and so forth, how can we generalize anything?
In ML, generalization is not achived by statistical derivatives and theoretical argumentation, but rather by answering the practical question: **How well would my model perform with new data?** To answer this question, **Out-of-Sample-Testing** is usually taken as solution. Here, you do the following
1. Split the dataset in a training and a test sample.
2. Fit you regression (train your model) on one dataset
* Optimal: Tune hyperparameters by minimizing loss in a validation set.
* Optimal: Retrain final model configuration on whole training set
3. Finally, evaluate predictive power on test sample, on which model is not fitted.
An advanced version is a **N-fold-Crossvalidation**, where this process is repeated several time during the **hyperparameter-tuning** phase (more on that later).
![](media/m8_cv_steps.png){width=750px}
### Model complexity & Hyperparameter Tuning
As a rule-of-thumb: Richer and more complex functional forms and algorithms tend to be better in predictign complex real world pattern. This is particularly true for high-dimensional (big) data.
![](media/m7_learningmodels.png){width=750px}
However, flexible algorithms at one point become so good in mimicing the pattern in our data that they **overfit**, meaning are to much tuned towards a specific dataset and might not reproduce the same accuracy in new data. Therefore, we aim at finding the **sweet spot** of high model complexity yet high-performing out-of-sample predictions
![](media/m8_complexity_error.png){width=750px}
That we do usually in a process of **hyperparameter-tuning**, where we gear different options the models offer towards high out-of-sample predictive performance. Mathematically speaking, we try to minimize a loss function $L(.)$ (eg. RMSE) the following problem:
$$minimize \underbrace{\sum_{i=1}^{n}L(f(x_i),y_i),}_{in-sample~loss} ~ over \overbrace{~ f \in F ~}^{function~class} subject~to \underbrace{~ R(f) \leq c.}_{complexity~restriction}$$
# Regression
## Introduction (a brief reminder)
Lets for a second recap linear regression techniques, foremost the common allrounder and workhorse of statistical research since some 100 years.
**OLS = Ordinary Least Squares**
**Basic Properties**
* Outcome: contionous
* Predictors: continous, dichotonomous, categorical
* When to use: Predicting a phenomenon that scales and can be measured continuously
**Functional form**
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon $$
where:
* $y$ = Outcome, $x_i$ = observed value $ID_i$
* $\beta_0$ = Constant
* $\beta_i$ = Estimated effect of $x_i$ on $y$ , slope of the linear function
* $\epsilon$ = Error term
And that is what happens. Lets imagine we plot some feature against an outcome we want to predict. OLS will just fit a straight line through your data.
![](media/m8_reg1.png#center){width=500px}
We do so by minimizing the sum of (squared) errors between our prediction-line and the observed outcome.
![](media/m8_reg2.png#center){width=500px}
### Executing a regression
So, lets get your hand dirty. We will load a standard dataset from `mlbench`, the BostonHousing dataset. It comes as a dataframe with 506 observations on 14 features, the last one `medv` being the outcome:
* `crim` per capita crime rate by town
* `zn` proportion of residential land zoned for lots over 25,000 sq.ft
* `indus` proportion of non-retail business acres per town
* `chas` Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* `nox` nitric oxides concentration (parts per 110 million)
* `rm` average number of rooms per dwelling
* `age` proportion of owner-occupied units built prior to 1940
* `dis` weighted distances to five Boston employment centres
* `rad` index of accessibility to radial highways
* `tax` full-value property-tax rate per USD 10,000
* `ptratio` pupil-teacher ratio by town
* `b` 1000(B - 0.63)^2 where B is the proportion of blacks by town
* `lstat` lower status of the population
* `medv` median value of owner-occupied homes in USD 1000's
Source: Harrison, D. and Rubinfeld, D.L. "Hedonic prices and the demand for clean air", J. Environ. Economics & Management, vol.5, 81-102, 1978.
These data have been taken from the [UCI Repository Of Machine Learning Databases](ftp://ftp.ics.uci.edu/pub/machine-learning-databases)
```{r}
library(mlbench) # Library including many ML benchmark datasets
data(BostonHousing)
data <- BostonHousing %>% as_data_frame()
rm(BostonHousing)
data %<>%
select(medv, everything()) %>%
mutate(chas = as.numeric(chas))
```
Lets inspect the data for a moment.
```{r}
head(data)
glimpse(data)
skim(data) %>% skimr::kable()
```
Ok, us straight run our first linear model with the `lm()` function:
```{r}
fit.lm <- lm(medv ~ ., data = data)
summary(fit.lm)
```
So, how would you interpret the results? Estimate? P-Value? $R^2$? If you don't know, consider taking the SDS track on Datacamp for statistics basics.
However, since this is a predictive exercise, we are foremost interested in how well the model predicts. So, lets use the `predict()` function.
```{r}
pred.lm <- predict(fit.lm)
head(pred.lm, 10)
```
A common measure of predictive power of regresdsions models is the *Root-Mean-Squared-Error* (RSME), calculate as follows:
$$ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(y_i - \hat{y_i} \Big)^2}}$$
Keep in mind, this rood&squared thingy does nothing with the error term except of transforming negative to positive values. Lets calculate:
```{r}
error <- pull(data, medv) - pred.lm
# Calculate RMSE
sqrt(mean(error ^ 2))
```
We can also visualize the error term:
```{r}
# Some visualizatiom
error %>% as_data_frame() %>%
ggplot(aes(value)) +
geom_histogram()
```
## ML workflows for regression analysis
### Define a subset for the hyperparameter tuning
However, how well does our OLS model predict out-of-sample? To do so, we have to set up an out-of-sample testing workflow. While that cound be done manually, we will use the (my very favorite) `caret` (=ClassificationAndREgressionTree) package, which has by now grown to be the most popular `R` package for ML, including a standardized ML workflow over more than 100 model classes.
First, we will now split our data in a train and test sample. We could create an index for subsetting manually. However, caret provides the nice `createDataPartition()` function, which does that for you. Nice feature is that it takes care that the outcomes (y) are proportionally distrubuted in the subset. That is particularly important in case the outcomes are very unbalanced.
```{r}
library(caret)
index <- createDataPartition(y = data$medv, p = 0.75, list = FALSE) # 75% to 25% split
training <- data[index,]
test <- data[-index,]
```
### Preprocessing
Before tuning the models, there is still some final preprocessing to do. For typical preprocessing tasks for ML, I like to use the `reciepes` package. It lets you conveniently define a recipe of standard ML preprocessing tasks. Afterwards, we can just can use this recipe to "bake" our data, meaning performing all the steps in the recipe. Here, we do only some simple transformations. We normalizizing all numeric data by centering (subtracting the mean) and scaling (divide by standard deviation). Finally, we remove features with zero-variance (we dont have any here, but its a good practice to do that with unknown data).
```{r}
library(recipes)
reci <- recipe(medv ~ ., data = training) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
prep(data = train)
reci
```
Now we can "bake" our data, executing the recipe, and split in outcomes and features (is not necessary, but usually an easier workflow lateron).
```{r}
# Training: split in y and x
x_train <-bake(reci, newdata = training) %>% select(-medv)
y_train <- training %>% pull(medv)
# test: split in y and x
x_test <-bake(reci, newdata = test) %>% select(-medv)
y_test <- test %>% pull(medv)
```
```{r, include = FALSE}
rm(reci, training, test)
```
### Defining the training and control feature
We now get seriously started with the hyperparameter tuning. We will do so in the following way. We here do a n-fold crossvalidation, meaning that we again split the train sample again in n subsamples, on which we both fit and test every hyperparameter configuration. That minimizes the chance that the results are driven by a particular sampling. Sounds cumbersome. However, for this workflow caret allows us to automatize this workflow with defining a `trainControl()` object, which we pass to every model we fit lateron.
Here, we have to define a set of parameters. First, the number of crossvalidations. He here for the sake of time do just 2 (`method = "cv", number = 5`), but given enough computing power, RAM and/or time, for a real life calibration, I would recommend 5 or 10 folds. To be even more save, this whole process can be repeated serveral time (`method = "repeatedcv", repeats = x`). Be aware, though, that time increases exponentially, which might make a difference when working with large data.
```{r}
ctrl <- trainControl(method = "cv",
number = 10)
```
### Digression: Multicore usage and parallel computing
Since we here have quite some models to run, lets speed it up by utilizing parallel processing. Me here use the packages `parallel` and `doMC`, which together with caret implement parallel training super easy. So, lets exploit the cores we have on our server.
Note: doMC only properly works on Linux, for windows check out other alternatives. Here, use instead doMC the `doParallel` package and its `registerDoParallel()` function. Note: This one is a bit experimental, and does not always work reliable.
```{r}
# library(parallel)
# # Linux option
# library(doMC) # nice package that distributes among others caret train processes (if allowParallel = TRUE)
# registerDoMC(cores = round(detectCores() - 1) ) # On the server we have 16 cores, I want to leave one for my fellows.
# getDoParWorkers() # check how many "workers" are deployed
# # Windows option
# library(doParallel)
# cl <- parallel::makeCluster(detectCores() - 1)
# doParallel::registerDoParallel(cl)
# # Note : In doParallel, you need to stop your cluster when you are done with "parallel::stopCluster(cl)"
```
## Hyperparameter Tuning
So, we finally start fitting models. The logit we skip in this phase, since it has no tunable parameters.
### Linear Regression
```{r}
fit.lm <- train(x = x_train,
y = y_train,
trControl = ctrl,
method = "lm")
summary(fit.lm)
fit.lm
```
Notice the higher RMSE compared to the initial in-sample testing example.
### Elastic Net
The elastic net has the functional form of a generalized linear model, plus an adittional tgerm $\lambda$ a parameter which penalizes the coefficient by its contribution to the models loss in the form of:
$$\lambda \sum_{p=1}^{P} [ 1 - \alpha |\beta_p| + \alpha |\beta_p|^2]$$
Here, we have 2 tunable parameters, $\lambda$ and $\alpha$. If $\alpha = 0$, we are left with $|\beta_i|$, turning it to a lately among econometricians very popular **Least Absolute Shrinkage and Selection Operator** (LASSO) regression. Obviously, when $\lambda = 0$, the whole term vanishes, and we are again left with a generalized linear model. The first thing we do now is to set up a tuneGrid, a matrix of value combinations of tunable hyperparameters we aim at optimizing. This is easily done with the base-R `expand.grid()` function.
```{r}
tune.glmnet = expand.grid(alpha = seq(0, 1, length = 5),
lambda = seq(0, 1, length = 5))
tune.glmnet
```
Now, we utulize the caret `train` function to carry out the grid search as well as crossvalidation of our model. The train function is a wrapper that uses a huge variety (150+) of models from other packages, and carries out the model fitting workflow. Here, we pass the function our feature and outcome vectors (X, y), the parameters in the formerly defined trainControl object, the metric to be optimized, the model class (here "glmnet" from the corresponding package), and our formerly defined tuneGrid. Lets go!
```{r}
fit.glmnet <- train(x = x_train,
y = y_train,
trControl = ctrl,
method = "glmnet", family = "gaussian",
tuneGrid = tune.glmnet)
fit.glmnet
```
We can also plot the results.
```{r}
ggplot(fit.glmnet)
```
### Model evaluation
So, taking all together, lets see which model performs best out-of-sample:
```{r}
pred.lm <-
pred.glm <- predict(fit.glmnet, newdata = x_test)
# RMSE OLS
sqrt(mean( (y_test - predict(fit.lm, newdata = x_test) ) ^ 2))
sqrt(mean( (y_test - predict(fit.glmnet, newdata = x_test) ) ^ 2))
```
Notice again the higher RMSE for total out-of-sample testing. Here, the results are pretty similar, so we in fact see noa advantage of variable selection via `glmnet`
# ML for Classification
## Introduction
### Reminder: Model assesments and metrics for classification problems
We remeber that the most commonly used performance measure for regression problems is the **RMSE**. However, how to we assess models aimed to solve classification problems? Here, it is not that straightforward, and we could (depending on the task) use different ones.
#### The Confusion matrix and its metrics
The **Confusion Matrix** (in inferential statistics you would call it **Classification Table**, so don't get confused) is the main source
![](media/m8_cf1.jpg){width=750px}
It is the 2x2 matrix with the foillowing cells:
* **True Positive:** (TP)
* Interpretation: You predicted positive and it's true.
* You predicted that a woman is pregnant and she actually is.
* **True Negative:** (TN)
* Interpretation: You predicted negative and it's true.
* You predicted that a man is not pregnant and he actually is not.
* **False Positive:** (FP) - (Type 1 Error)
* Interpretation: You predicted positive and it's false.
* You predicted that a man is pregnant but he actually is not.
* **False Negative:** (FN) - (Type 2 Error)
* Interpretation: You predicted negative and it's false.
* You predicted that a woman is not pregnant but she actually is.
Just remember, We describe predicted values as **Positive** and **Negative** and actual values as **True** and **False**. Out of combinations of these values, we dan derive a set of different quality measures.
![](media/m8_metrics2.png){width=750px}
**Accuracy** (ACC)
$$ {ACC} ={\frac {\mathrm {TP} +\mathrm {TN} }{P+N}}={\frac {\mathrm {TP} +\mathrm {TN} }{\mathrm {TP} +\mathrm {TN} +\mathrm {FP} +\mathrm {FN} }} $$
** Sensitivity,** also called recall, hit rate, or true positive rate (TPR)
$$ {TPR} ={\frac {\mathrm {TP} }{P}}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}=1-\mathrm {FNR} $$
**Specificity**, also called selectivity or true negative rate (TNR)
$$ {TNR} ={\frac {\mathrm {TN} }{N}}={\frac {\mathrm {TN} }{\mathrm {TN} +\mathrm {FP} }}=1-\mathrm {FPR} $$
**Precision**, also called positive predictive value (PPV)
$$ {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }} $$
**F1 score**: is the harmonic mean of precision and sensitivity, meaning a weighted average of the true positive rate (recall) and precision.
$$ F_{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }} $$
So, which one should we use? Answer is: Depends on the problem. Are you more sensitive towards false negative or false positives? Do you have a balanced or unbalanced distribution of classes? Think about it for a moment.
#### ROC and AUC
An ROC curve (receiver operating characteristic curve, weird name, i know. Comes originally from signal processing) is a derivative of the confusion matrix and predicted class-probabilities.
![](media/m8_cf2.jpg){width=750px}
So, what does it tell us? The ROC is a graph showing the performance of a classification model at all classification thresholds. It plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.
![](media/m8_roc1.png){width=750px}
**AUC** stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). It provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
AUC is desirable for the following two reasons:
1. AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
2. AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.
However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:
* Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won't tell us about that.
* Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.
## Introduction tom the case
![](media/m8_churn.jpg){width=750px}
### Digression: Customer Churn: Hurts Sales, Hurts Company
Customer churn refers to the situation when a customer ends their relationship with a company, and it's a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it's much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.
The good news is that machine learning can help. For many businesses that offer subscription based services, it's critical to both predict customer churn and explain what features relate to customer churn.
### IBM Watson Dataset
We now dive into the IBM Watson Telco Dataset. According to IBM, the business challenge is.
> A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you're an analyst at this company and you have to find out who is leaving and why.
The dataset includes information about:
* Customers who left within the last month: `Churn`
* Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information: how long they've been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers: gender, age range, and if they have partners and dependents
Lets load it:
```{r}
rm(list=ls()); graphics.off() # get rid of everything in the workspace
data <- readRDS("data/telco_churn.rds")
```
## Data Inspection, exploration, preprocessing
Lets inspect our data
```{r}
head(data)
glimpse(data)
skim(data) %>% skimr::kable()
```
Lets do right away some preprocessing, and drop the customer ID, which we do not need. Let's also say we're lazy today, so we just prop observations with missing features (not good practice).
```{r}
data %<>%
select(-customerID) %>%
drop_na() %>%
select(Churn, everything())
```
Next, lets have a first visual inspections. Many models in our prediction exercise to follow require the conditional distribution of the features to be different for the outcomes states to be predicted. So, lets take a look. Here, `ggplot2` plus the `ggridges` package is my favorite.
```{r,fig.height=5,fig.width=12.5}
require(ggridges)
data %>%
gather(variable, value, -Churn) %>%
ggplot(aes(y = as.factor(variable),
fill = as.factor(Churn),
x = percent_rank(value)) ) +
geom_density_ridges(alpha = 0.75)
```
Again, we split in a training and test dataset.
```{r}
library(caret)
index <- createDataPartition(y = data$Churn, p = 0.75, list = FALSE)
training <- data[index,]
test <- data[-index,]
```
We already see the numeric variable `TotalCharges` appears to be right skewed. That is not a problem for predictive modeling per se, yet some transformation might increase still its predictive power. Lets see. The easiest approximation is just to check the correlation.
```{r}
library(corrr)
training %>%
select(Churn, TotalCharges) %>%
mutate(
Churn = Churn %>% as.factor() %>% as.numeric(),
LogTotalCharges = log(TotalCharges)
) %>%
correlate() %>%
focus(Churn) %>%
fashion()
```
Allright, seems as if it could be worth it. Now we can already write down or recipe.
```{r}
library(recipes)
reci <- recipe(Churn ~ ., data = training) %>%
step_discretize(tenure, options = list(cuts = 6)) %>%
step_log(TotalCharges) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_center(all_predictors(), -all_outcomes()) %>%
step_scale(all_predictors(), -all_outcomes()) %>%
prep(data = training)
reci
```
Now we just split again in predictors and outcomes, bake it all, and we are good to go.
```{r}
# Predictors
x_train <- bake(reci, newdata = training) %>% select(-Churn)
x_test <- bake(reci, newdata = test) %>% select(-Churn)
# Response variables for training and testing sets
y_train <- pull(training, Churn) %>% as.factor()
y_test <- pull(test, Churn) %>% as.factor()
```
We again define a `trainControl()` object. However, now we need some adittional parameters.
```{r}
ctrl <- trainControl(method = "repeatedcv", # repeatedcv, boot, cv, LOOCV, timeslice OR adaptive etc.
number = 5, # Number of CV's
repeats = 3, # Number or repeats -> repeats * CV's
classProbs = TRUE, # Include probability of class prediction
summaryFunction = twoClassSummary, # Which type of summary statistics to deliver
returnData = FALSE, # Don't include the original data in the train() output
returnResamp = "final", # Impåortant for resampling later
savePredictions = "final", # The predictions of every tune or only the final to be saved?
allowParallel = TRUE, # For parallel processing
verboseIter = FALSE)
metric <- "ROC" # Which metric should be optimized (more on that later)
```
### Standard logistic regression
Lets run the common allrounder first:
```{r}
fit.logit <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "glm", family = "binomial")
fit.logit
summary(fit.logit)
```
### Elastic net
We can surely also run an elasticnet with an logit-link (`family = "binomial`). Now, we utulize the caret `train()` function to carry out the grid search as well as crossvalidation of our model. The train function is a wrapper that uses a huge variety (150+) of models from other packages, and carries out the model fitting workflow. Here, we pass the function our feature and outcome vectors (X, y), the parameters in the formerly defined trainControl object, the metric to be optimized, the model class (here "glmnet" from the corresponding package), and our formerly defined tuneGrid. Lets go!
```{r}
tune.glmnet = expand.grid(alpha = seq(0, 1, length = 3),
lambda = seq(0, 0.3, length = 7))
fit.glmnet <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "glmnet", family = "binomial",
tuneGrid = tune.glmnet)
fit.glmnet
```
We can also plot the results.
```{r}
ggplot(fit.glmnet)
```
### Decision tree
#### Introduction
Ok, next we will do the same exercise for a classification tree. Some brief reminder what this interesting family of models is about:
* Mostly used in classification problems on continuous or categorical variables.
* Idea: split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
* Repeat till stop criterium reachesd. leads to a tree-like structure.
![](media/m8_regtree0.png){width=750px}
This class became increasingly popular in business and other applications. Some reasons are:
* Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background.
* Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables.
* Data type is not a constraint: It can handle both numerical and categorical variables.
* Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
Some tree terminology:
* **Root Node:** Entire population or sample and this further gets divided into two or more homogeneous sets.
* **Splitting:** It is a process of dividing a node into two or more sub-nodes.
* **Decision Node:** When a sub-node splits into further sub-nodes, then it is called decision node.
* **Leaf/ Terminal Node:** Nodes do not split is called Leaf or Terminal node.
![](media/m8_regtree2.png){width=750px}
The decision of making strategic splits heavily affects a tree's accuracy. So, How does the tree decide to split? This is different across the large family of tree-like models. Common approaches:
* Gini Index
* $\chi^2$
* Reduction in $\sigma^2$
Some common complexity restrictions are:
* Minimum samples for a node split
* Minimum samples for a terminal node (leaf)
* Maximum depth of tree (vertical depth)
* Maximum number of terminal nodes
* Maximum features to consider for split
![](media/m8_regtree1.png){width=750px}
Likewise, there are a veriety of tunable hyperparameters across different applications of this model family.
![](media/m8_regtree3.png){width=750px}
#### Application
Ok, lets run it! We here use `rpart`, but there is a huge variety available. This one has only one tunable parameter. Here we are able to restrict the complexity via a hyperparameter cp. This parameter represents the complexity costs of every split, and allows further splits only if it leads to an decrease in model loss below this threshold.
```{r}
library(rpart)
tune.dt = expand.grid(cp = c(0.001, 0.005 ,0.010, 0.020, 0.040))
fit.dt <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "rpart", tuneGrid = tune.dt)
fit.dt
ggplot(fit.dt)
```
We directly see that in this case, increasing complexity costs lead to decreasing model performance. Such results are somewhat typical for large datasets, where high complexity costs prevent the tree to fully exploit the richness of information. Therefore, we settle for a minimal cp of 0.001.
We can also plot the final plot structure.
```{r,fig.width=15,fig.height=5}
require(rpart.plot)
rpart.plot(fit.dt$finalModel)
```
This plot is usually informative, and gives us a nice intuition how the prediction works in classification trees. However, with that many dummies, it's a bit messy.
### Random Forest
![](media/m8_rf1.png){width=750px}
#### Introduction
Finally, we fit another class of models which has gained popularity in the last decade, and proven to be a powerful and versatile prediction technique which performs well in almost every setting, a random forest. It is among the most popular non-neural-network ML algorithms, and by some considered to be a panacea of all data science problems. Some say: "when you can't think of any algorithm (irrespective of situation), use random forest!""
As a continuation of tree-based classification methods, random forests aim at reducing overfitting by introducing randomness via bootstrapping, boosting, and ensemble techniques. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model. The idea here is to create an "ensemble of classification trees"", all grown out of a different bootstrap sample. Having grown a forest of trees, every tree performs a prediction, and the final model prediction is formed by a majority vote of all trees. This idea close to Monte Carlo simulation approaches, tapping in the power of randomness.
![](media/m8_rf2.png){width=750px}
#### Application
In this case, we draw from a number of tunable hyperparameters. First, we tune the number of randomly selected features which are available candidates for every split on a range $[1,k-1]$, where lower values introduce a higher level of randomness to every split. Our second hyperparameter is the minimal number of observations which have to fall in every split, where lower numbers increase the potential precision of splits, but also the risk of overfitting. Finally, we also use the general splitrule as an hyperparameter, where the choice is between i.) a traditional split according to a the optimization of the gini coefficient of the distribution of classes in every split, and ii.) according to the "Extremely randomized trees"" (`ExtraTree`) procedure by Geurts (2016), where adittional randomness is introduced to the selection of splitpoints.
Note: We here use the `ranger` instead of the more popular `randomforest` package. For those who wonder: It basically does the same, only faster.
```{r}
library(ranger)
tune.rf <- expand.grid(mtry = round(seq(1, ncol(x_train)-1, length = 3)),
min.node.size = c(10, 50, 100),
splitrule = c("gini", "extratrees") )
fit.rf <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "ranger",
importance = "impurity", # To define how to measure variable importantce (later more)
num.trees = 25,
tuneGrid = tune.rf)
fit.rf
ggplot(fit.rf)
```
we see that number of randomly selected features per split of roughly half of all available features in all cases maximizes model performance. Same goes for a high minimal number of observations (100) per split. Finally, the ExtraTree procedure first underperforms at a a minimal amount of randomly selected features, but outperforms the traditional gini-based splitrule when the number of available features increases. Such results are typical for large samples, where a high amount of injected randomness tends to make model predictions more robust.
## Final prediction
The adittional `caretEnsemble` package also has the `caretList()` function, where one can fit a set of different models jointly. In adittion, it takes care of the indexing for the resamples, so that they are the same for al models. Neath, isnt it? We could have done just and only that from the very beginning. We just did seperate models for illustration before. Note: Be prepared to wait some time.
```{r}
library(caretEnsemble)
models <- caretList(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
continue_on_fail = T,
tuneList = list(logit = caretModelSpec(method = "glm",
family = "binomial"),
elasticnet = caretModelSpec(method = "glmnet",
family = "binomial",
tuneGrid = tune.glmnet),
dt = caretModelSpec(method = "rpart",
tuneGrid = tune.dt),
rf = caretModelSpec(method = "ranger",
importance = "impurity",
num.trees = 25,
tuneGrid = tune.rf)
)
)
```
## Performance Evaluation
And done. Now, lets evaluate the results. This evaluation is now on the average prediction of the different folds on the validation fold in the train data (final evaluation will still come later). Lets take a look. We can also plot the model comparison.
```{r}
models
```
The last step is to recollect the results from each model and compare them. We here use the `resamples()` function of caret. This can be done because we ran all models within a `caretList`, keeping track of all resamples within the different models and enabling us to compare them. Lets check how the model perform.
```{r}
results <- resamples(models)
results$values %>%
select(-Resample) %>%
tidy()
bwplot(results)
```
Lets see how similar the models predict on the validation sample.
```{r}
modelCor(results)
```
## Evaluation via final out-of-sample prediction
Ok, to be as confident as possible, lets do the final evaluation. generally, many ML exercises can be done without it, and it is often common practice to go with the performance in the k-fold crossfalidation. However, I believe this final step should be done, if data permits. Only with this final prediction we expose our model to data it yet was not exposed to, neiter directly nor indirectly.
we first let all models predict all models do their prediction on "test". We create 2 objects, one for the probability prediction (`type = "prob"`), and one for the predicted class of the outcome (where positive means $P \leq 0.5$). We do so, becuase we need both for different summaries to come.
```{r}
models.preds <- data.frame(lapply(models, predict, newdata = x_test))
models.preds.prob <- data.frame(lapply(models, predict, newdata = x_test, type = "prob"))
glimpse(models.preds)
glimpse(models.preds.prob)
```
Let's plot the corresponding confusion matrix for all models
```{r,fig.height=10,fig.width=10}
cm <- list()
cm$logit <- confusionMatrix(factor(models.preds$logit), y_test, positive = "Yes")
cm$elasticnet <- confusionMatrix(factor(models.preds$elasticnet), y_test, positive = "Yes")
cm$dt <- confusionMatrix(factor(models.preds$dt), y_test, positive = "Yes")
cm$rf <- confusionMatrix(factor(models.preds$rf), y_test, positive = "Yes")
par(mfrow=c(2,2)) # Note: One day find a neath way to make it prettier
for(i in 1:length(cm)) {fourfoldplot(cm[[i]]$table, color = c("darkred", "darkgreen") ); title(main = names(cm[i]), cex.main = 1.5)}
```
And also the corresponding ROC curve...
```{r,fig.height=10,fig.width=10}
library(caTools)
# Note: Is kind of ugly. At one point I will try making it prettier with the library "plotROC""
ROC <- list()
ROC$logit <- colAUC(models.preds.prob$logit.Yes, y_test, plotROC = F)
ROC$elasticnet <- colAUC(models.preds.prob$elasticnet.Yes, y_test, plotROC = F)
ROC$dt <- colAUC(models.preds.prob$dt.Yes, y_test, plotROC = F)
ROC$rf <- colAUC(models.preds.prob$rf.Yes, y_test, plotROC = F)
par(mfrow=c(2,2))
for(i in 1:length(ROC)) {colAUC(models.preds.prob[[i*2]], y_test, plotROC = T); title(main = names(ROC[i]), cex.main = 1.5)}
```
So, now only one nice final table for the sake of overview.
```{r}
model_eval <- tidy(cm$logit$overall) %>%
select(names) %>%
mutate(Logit = round(cm$logit$overall,3) ) %>%
mutate(ElasticNet = round(cm$elasticnet$overall,3)) %>%
mutate(ClassTree = round(cm$dt$overall,3)) %>%
mutate(RandForest = round(cm$rf$overall,3))
model_eval2 <- tidy(cm$logit$byClass) %>%
select(names) %>%
mutate(Logit = round(cm$logit$byClass,3) ) %>%
mutate(ElasticNet = round(cm$elasticnet$byClass,3)) %>%
mutate(ClassTree = round(cm$dt$byClass,3)) %>%
mutate(RandForest = round(cm$rf$byClass,3)) %>%
filter( !(names %in% c("AccuracyPValue", "McnemarPValue")) )
model_eval_all <- model_eval %>%
rbind(model_eval2) %>%
rbind(c("AUC", round(ROC$logit,3), round(ROC$elasticnet,3), round(ROC$dt,3), round(ROC$rf,3) ))
model_eval_all
rm(model_eval, model_eval2)
```
# Your turn
So, it's time to predict something: [---> HERE <---](https://www.kaggle.com/danielhain/sds-2018-m1-8-supervised-ml-exercise-1). More info's can be found [---> HERE <---](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) (YEs, it's a kaggle set). Go and find a clever way to predict whine quality! :)