-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
04-Data-Sets.Rmd
1000 lines (750 loc) · 60.9 KB
/
04-Data-Sets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
# Datasets and Models {#dataSetsIntro}
```{r, echo=FALSE, warning=FALSE }
source("code_snippets/ema_init.R")
```
We will illustrate the methods presented in this book by using three datasets related to:
* predicting probability of survival for passengers of the *RMS Titanic*;
* predicting prices of *apartments in Warsaw*;
* predicting the value of the football players based on the *FIFA* dataset.
The first dataset will be used to illustrate the application of the techniques in the case of a predictive (classification) model for a binary dependent variable. It is mainly used in the examples presented in the second part of the book. The second dataset will be used to illustrate the exploration of prediction models for a continuous dependent variable. It is mainly used in the examples in the third part of this book. The third dataset will be introduced in Chapter \@ref(UseCaseFIFA) and will be used to illustrate the use of all of the techniques introduced in the book.
In this chapter, we provide a short description of the first two datasets, together with results of exploratory analyses. We also introduce models that will be used for illustration purposes in subsequent chapters.
## Sinking of the RMS Titanic {#TitanicDataset}
The sinking of the RMS Titanic is one of the deadliest maritime disasters in history (during peacetime). Over 1500 people died as a consequence of a collision with an iceberg. Projects like *Encyclopedia titanica* (https://www.encyclopedia-titanica.org/) are a source of rich and precise data about Titanic's passengers. \index{dataset ! RMS Titanic}
The `stablelearner` package in R includes a data frame with information about passengers' characteristics. The dataset, after some data cleaning and variable transformations, is also available in the `DALEX` package for R and in the `dalex` library for Python. In particular, the `titanic` data frame contains 2207 observations (for 1317 passengers and 890 crew members) and nine variables:\index{package | stablelearner}
* *gender*, person's (passenger's or crew member's) gender, a factor (categorical variable) with two levels (categories): "male" (78%) and "female" (22%);
* *age*, person's age in years, a numerical variable; the age is given in (integer) years, in the range of 0--74 years;
* *class*, the class in which the passenger travelled, or the duty class of a crew member; a factor with seven levels: "1st" (14.7%), "2nd" (12.9%), "3rd" (32.1%), "deck crew" (3%), "engineering crew" (14.7%), "restaurant staff" (3.1%), and "victualling crew" (19.5%);
* *embarked*, the harbor in which the person embarked on the ship, a factor with four levels: "Belfast" (8.9%), "Cherbourg" (12.3%), "Queenstown" (5.6%), and "Southampton" (73.2%);
* *country*, person's home country, a factor with 48 levels; the most common levels are "England" (51%), "United States" (12%), "Ireland" (6.2%), and "Sweden" (4.8%);
* *fare*, the price of the ticket (only available for passengers; 0 for crew members), a numerical variable in the range of 0--512;
* *sibsp*, the number of siblings/spouses aboard the ship, a numerical variable in the range of 0--8;
* *parch*, the number of parents/children aboard the ship, a numerical variable in the range of 0--9;
* *survived*, a factor with two levels: "yes" (67.8%) and "no" (32.2%) indicating whether the person survived or not.
The first six rows of this dataset are presented in the table below.
```{r, warning=FALSE, message=FALSE, echo=FALSE}
library("DALEX")
library("knitr")
library("kableExtra")
#ktab <- kable(head(titanic), "html")
ktab <- kable(head(titanic)[,-5])
kable_styling(ktab, bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```
Models considered for this dataset will use *survived* as the (binary) dependent variable.
### Data exploration {#exploration-titanic}
As discussed in Chapter \@ref(modelDevelopmentProcess), it is always advisable to explore data before modelling. However, as this book is focused on model exploration, we will limit the data exploration part.
Before exploring the data, we first conduct some pre-processing. In particular, the value of variables *age*, *country*, *sibsp*, *parch*, and *fare* is missing for a limited number of observations (2, 81, 10, 10, and 26, respectively). Analyzing data with missing values\index{Missing value} is a topic on its own [@Schafer1997; @LittleRubin2002; @MolKen2007]. An often-used approach is to impute the missing values. Toward this end, multiple imputations\index{Imputation ! multiple} should be considered [@Schafer1997; @MolKen2007; @vanBuuren2012]. However, given the limited number of missing values and the intended illustrative use of the dataset, we will limit ourselves to, admittedly inferior, single imputation.\index{Imputation ! single} In particular, we replace the missing *age* values by the mean of the observed ones, i.e., 30. Missing *country* is encoded by `"X"`. For *sibsp* and *parch*, we replace the missing values by the most frequently observed value, i.e., 0. Finally, for *fare*, we use the mean fare for a given *class*, i.e., 0 pounds for crew, 89 pounds for the first, 22 pounds for the second, and 13 pounds for the third class.
```{r, warning=FALSE, message=FALSE, echo=FALSE}
titanic$age[is.na(titanic$age)] = 30
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
titanic$country <- as.character(titanic$country)
titanic$country[is.na(titanic$country)] = "X"
titanic$country <- factor(titanic$country)
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
titanic$sibsp[is.na(titanic$sibsp)] = 0
titanic$parch[is.na(titanic$parch)] = 0
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
titanic$fare[is.na(titanic$fare) & titanic$class == "1st"] = 89
titanic$fare[is.na(titanic$fare) & titanic$class == "2nd"] = 22
titanic$fare[is.na(titanic$fare) & titanic$class == "3rd"] = 13
```
After imputing the missing values, we investigate the association between survival status and other variables. Most variables in the Titanic dataset are categorical, except of *age* and *fare*. Figure \@ref(fig:titanicExplorationHistograms) shows histograms for the latter two variables. In order to keep the exploration uniform, we transform the two variables into categorical ones. In particular, *age* is discretized into five categories by using cutoffs equal to 5, 10, 20, and 30, while *fare* is discretized by applying cutoffs equal to 1, 10, 25, and 50.
(ref:titanicExplorationHistogramsCaption) Histograms for variables *age* and *fare* from the Titanic data.
```{r titanicExplorationHistograms, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:titanicExplorationHistogramsCaption)', out.width = '100%', fig.align='center'}
library("ggplot2")
library("ggmosaic")
library("patchwork")
library("forcats")
titanic$age_cat <- cut(titanic$age, c(0,5,10,20,30,100), c("0-5","6-10","11-20","21-30","30+"), include.lowest = TRUE)
titanic$parch_cat <- cut(titanic$parch, c(-1, 0, 1, 2, 100), labels = c("0", "1", "2", "3+"))
titanic$sibsp_cat <- cut(titanic$sibsp, c(-1, 0, 1, 2, 100), labels = c("0", "1", "2", "3+"))
titanic$country_cat <- fct_lump(titanic$country, 8)
titanic$fare_cat <- cut(titanic$fare, c(-1,0,10,25,50,520), c("0","1-10","10-24","25-50","50+"), include.lowest = TRUE)
pl01 <- ggplot(titanic, aes(age)) +
geom_histogram(binwidth = 5, color = "white") +
theme_drwhy() + ggtitle("Histogram for age") +
theme(legend.position = "none",
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 12),
panel.grid = element_blank()) + theme_ema
pl02 <- ggplot(titanic, aes(fare)) +
geom_histogram(binwidth = 10, color = "white") +
theme_drwhy() + ggtitle("Histogram for fare") +
theme(legend.position = "none",
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 12),
panel.grid = element_blank()) + theme_ema
pl01 + pl02
```
Figures \@ref(fig:titanicExplorationGenderAge)--\@ref(fig:titanicExplorationCountryHarbor) present graphically, with the help of mosaic plots, the proportion of non- and survivors for different levels of other variables. The width of the bars (on the x-axis) reflects the marginal distribution (proportions) of the observed levels of the variable. On the other hand, the height of the bars (on the y-axis) provides information about the proportion of non- and survivors. The graphs for *age* and *fare* were constructed by using the categorized versions of the variables.
Figure \@ref(fig:titanicExplorationGenderAge) indicates that the proportion of survivors was larger for females and children below 5 years of age. This is most likely the result of the "women and children first" principle that is often evoked in situations that require the evacuation of persons whose life is in danger.
(ref:titanicExplorationGenderAgeCaption) Survival according to gender and age category in the Titanic data.
```{r titanicExplorationGenderAge, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:titanicExplorationGenderAgeCaption)', out.width = '100%', fig.align='center'}
theme_mos <- theme_drwhy() %+replace%
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, size = 12, vjust=1, hjust=1),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 12),
panel.grid = element_blank())
theme_mos2 <- theme_drwhy() %+replace%
theme(legend.position = "none",
axis.text.x = element_text(angle = 0, size = 12, vjust=1, hjust=1),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 12))
pl1 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, gender), fill=survived)) +
labs(x="Gender", y="Survived?", title='Survival by gender') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos + theme_ema
pl2 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, age_cat), fill=survived)) +
labs(x="Age", y="Survived?", title='Survival by age category') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos + theme_ema
pl1 + pl2
```
The principle can, perhaps, partially explain the trend seen in Figure \@ref(fig:titanicExplorationParch), i.e., a higher proportion of survivors among those with 1-2 parents/children and 1-2 siblings/spouses aboard.
(ref:titanicExplorationParchCaption) Survival according to the number of parents/children and siblings/spouses in the Titanic data.
```{r titanicExplorationParch, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:titanicExplorationParchCaption)', out.width = '100%', fig.align='center'}
pl3 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, parch_cat), fill=survived)) +
labs(x="Number of parents/children aboard", y="Survived?", title='Survival by no. parents/children') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos +
theme(axis.text.x = element_text(angle = 0, size = 12)) + theme_ema
pl4 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, sibsp_cat), fill=survived)) +
labs(x="Number of siblings/spouses aboard", y="Survived?", title='Survival by no. siblings/spouses') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos +
theme(axis.text.x = element_text(angle = 0, size = 12)) + theme_ema
pl3 + pl4
```
Figure \@ref(fig:titanicExplorationClassFare) indicates that passengers travelling in the first and second class had a higher chance of survival, perhaps due to the proximity of the location of their cabins to the deck. Interestingly, the proportion of survivors among the deck crew was similar to the proportion of the first-class passengers. The figure also shows that the proportion of survivors increased with the fare, which is consistent with the fact that the proportion was higher for passengers travelling in the first and second class.
(ref:titanicExplorationClassFareCaption) Survival according to travel-class and ticket-fare in the Titanic data.
```{r titanicExplorationClassFare, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:titanicExplorationClassFareCaption)', out.width = '100%', fig.align='center'}
pl5 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, class), fill=survived)) +
labs(x="Passenger class", y="Survived?", title='Survival by class') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos + theme_ema
pl6 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, fare_cat), fill=survived)) +
labs(x="Fare", y="Survived?", title='Survival by fare category') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos + theme_ema
pl5 + pl6
```
Finally, Figure \@ref(fig:titanicExplorationCountryHarbor) does not suggest any noteworthy trends.
(ref:titanicExplorationCountryHarborCaption) Survival according to the embarked harbour and country in the Titanic data.
```{r titanicExplorationCountryHarbor, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:titanicExplorationCountryHarborCaption)', out.width = '100%', fig.align='center'}
pl7 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, embarked), fill=survived)) +
labs(x="Embarked harbor", y="Survived?", title='Survival by harbor') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos + theme_ema
pl8 <- ggplot(data = titanic) +
geom_mosaic(aes(x = product(survived, country_cat), fill=survived)) +
labs(x="Country", y="Survived?", title='Survival by country') +
scale_fill_manual(values = colors_discrete_drwhy(2)) +
theme_mos + theme_ema
pl7 + pl8
```
## Models for RMS Titanic, snippets for R
### Logistic regression model {#model-titanic-lmr}
The dependent variable of interest, *survived*, is binary. Thus, a natural choice is to start the predictive modelling with a logistic regression model. As there is no reason to expect a linear relationship between age and odds of survival, we use linear tail-restricted cubic splines, available in the `rcs()` function of the `rms` package [@rms], to model the effect of age. We also do not expect linear relation for the *fare* variable, but because of its skewness (see Figure \@ref(fig:titanicExplorationHistograms)), we do not use splines for this variable. The results of the model are stored in model-object `titanic_lmr`, which will be used in subsequent chapters.\index{Logistic regression}
```{r, warning=FALSE, message=FALSE}
library("rms")
titanic_lmr <- lrm(survived == "yes" ~ gender + rcs(age) + class +
sibsp + parch + fare + embarked, titanic)
```
Note that we are not very much interested in the assessment of the model's predictive performance, but rather on understanding how the model yields its predictions. This is why we do not split the data into the training and testing subsets. Instead, the model is fitted to the entire dataset and will be examined on the same dataset.
### Random forest model {#model-titanic-rf}
As an alternative to the logistic regression model we consider a random forest model. Random forest modelling is known for good predictive performance, ability to grasp low-order variable interactions, and stability [@randomForestBreiman]. To fit the model, we apply the `randomForest()` function, with default settings, from the package with the same name [@randomForest]. In particular, we fit a model with the same set of explanatory variables as the logistic regression model (see Section \@ref(model-titanic-lmr)). The results of the random forest model are stored in model-object `titanic_rf`.\index{Random forest model}
```{r titanicRandomForest01, warning=FALSE, message=FALSE}
library("randomForest")
set.seed(1313)
titanic_rf <- randomForest(survived ~ class + gender + age +
sibsp + parch + fare + embarked, data = titanic)
```
<!-- For comparison purposes, we also consider a model with only three explanatory variables: *class*, *gender*, and *age*. The results of the model are stored in model-object `titanic_rf_v3`. -->
```{r titanicRandomForest02, warning=FALSE, message=FALSE, eval=FALSE, echo=FALSE}
set.seed(1313)
titanic_rf_v3 <- randomForest(survived ~ class + gender + age, data = titanic)
titanic_rf_v3
```
### Gradient boosting model {#model-titanic-gbm}
Additionally, we consider the gradient boosting model [@Friedman00greedyfunction]. Tree-based boosting models are known for being able to accommodate higher-order interactions between variables. We use the same set of six explanatory variables as for the logistic regression model (see Section \@ref(model-titanic-lmr)). To fit the gradient boosting model, we use function `gbm()` from the `gbm` package [@gbm]. The results of the model are stored in model-object `titanic_gbm`.\index{GBM| see {Gradient boosting machines}}\index{Gradient boosting machines}\index{Gradient boosting | model}
```{r titanicGBM01, warning=FALSE, message=FALSE}
library("gbm")
set.seed(1313)
titanic_gbm <- gbm(survived == "yes" ~ class + gender + age +
sibsp + parch + fare + embarked, data = titanic,
n.trees = 15000, distribution = "bernoulli")
```
### Support vector machine model {#model-titanic-svm}
Finally, we also consider a support vector machine (SVM) model [@svm95vapnik]. We use the C-classification mode. Again, we fit a model with the same set of explanatory variables as in the logistic regression model (see Section \@ref(model-titanic-lmr)) To fit the model, we use function `svm()` from the `e1071` package [@e1071]. The results of the model are stored in model-object `titanic_svm`.\index{package | e1071}\index{SVM| see {Support vector machine}}\index{Support vector machine}
```{r titanicSVM01, warning=FALSE, message=FALSE}
library("e1071")
titanic_svm <- svm(survived == "yes" ~ class + gender + age +
sibsp + parch + fare + embarked, data = titanic,
type = "C-classification", probability = TRUE)
```
### Models' predictions {#predictions-titanic}
Let us now compare predictions that are obtained from the different models. In particular, we compute the predicted probability of survival for Johnny D, an 8-year-old boy who embarked in Southampton and travelled in the first class with no parents nor siblings, and with a ticket costing 72 pounds.
First, we create a data frame `johnny_d` that contains the data describing the passenger.
```{r titanicPred01, warning=FALSE, message=FALSE}
johnny_d <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd",
"deck crew", "engineering crew",
"restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 8, sibsp = 0, parch = 0, fare = 72,
embarked = factor("Southampton", levels = c("Belfast",
"Cherbourg","Queenstown","Southampton")))
```
Subsequently, we use the generic function `predict()` to obtain the predicted probability of survival for the logistic regression model.
```{r, warning=FALSE, message=FALSE}
(pred_lmr <- predict(titanic_lmr, johnny_d, type = "fitted"))
```
The predicted probability is equal to `r round(pred_lmr, 2)`.
We do the same for the remaining three models.
```{r, warning=FALSE, message=FALSE}
(pred_rf <- predict(titanic_rf, johnny_d, type = "prob"))
(pred_gbm <- predict(titanic_gbm, johnny_d, type = "response",
n.trees = 15000))
(pred_svm <- predict(titanic_svm, johnny_d, probability = TRUE))
```
As a result, we obtain the predicted probabilities of `r round(pred_rf[1,2], 2)`, `r round(pred_gbm, 2)`, and `r round(attr(pred_svm,"probabilities")[,2], 2)` for the random forest, gradient boosting, and SVM models, respectively. The models lead to different probabilities. Thus, it might be of interest to understand the reason for the differences, as it could help us decide which of the predictions we might want to trust. We will investigate this issue in the subsequent chapters.
Note that, for some examples later in the book, we will use another observation (instance). We will call this passenger Henry.
```{r, warning=FALSE, message=FALSE}
henry <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd",
"deck crew", "engineering crew",
"restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 47, sibsp = 0, parch = 0, fare = 25,
embarked = factor("Cherbourg", levels = c("Belfast",
"Cherbourg","Queenstown","Southampton")))
```
For Henry, the predicted probability of survival is lower than for Johnny D.
```{r, warning=FALSE, message=FALSE}
predict(titanic_lmr, henry, type = "fitted")
predict(titanic_rf, henry, type = "prob")[,2]
predict(titanic_gbm, henry, type = "response", n.trees = 15000)
attr(predict(titanic_svm, henry, probability = TRUE),"probabilities")[,2]
```
### Models' explainers {#ExplainersTitanicRCode}
Model-objects created with different libraries may have different internal structures. Thus, first, we have got to create an "explainer,"\index{Explainer} i.e., an object that provides an uniform interface for different models. Toward this end, we use the `explain()` function from the `DALEX` package [@DALEX]. As it was mentioned in Section \@ref(infoDALEX), there is only one argument that is required by the function, i.e., `model`. The argument is used to specify the model-object with the fitted form of the model. However, the function allows additional arguments that extend its functionalities. In particular, the list of arguments includes the following: \index{package | DALEX}
* `data`, a data frame or matrix providing data to which the model is to be applied; if not provided (`data = NULL` by default), the data are extracted from the model-object. Note that the data object should not, in principle, contain the dependent variable.
* `y`, observed values of the dependent variable corresponding to the data given in the `data` object; if not provided (`y = NULL` by default), the values are extracted from the model-object;
* `predict_function`, a function that returns prediction scores; if not specified (`predict_function = NULL` by default), then a default `predict()` function is used (note that this may lead to errors);
* `residual_function`, a function that returns model residuals; if not specified (`residual_function = NULL` by default), then model residuals defined in equation \@ref(eq:modelResiduals) are calculated;
* `verbose`, a logical argument (`verbose = TRUE` by default) indicating whether diagnostic messages are to be printed;
* `precalculate`, a logical argument (`precalculate = TRUE` by default) indicating whether predicted values and residuals are to be calculated when the explainer is created. Note that this will also happen if `verbose = TRUE`. To skip the calculations, both `verbose` and `precalculate` should be set to FALSE .
* `model_info`, a named list (with components `package`, `version`, and `type`) providing information about the model; if not specified (`model_info = NULL` by default), `DALEX` seeks for information on its own;
* `type`, information about the type of the model, either `"classification"` (for a binary dependent variable) or `"regression"` (for a continuous dependent variable); if not specified (`type = NULL` by default), then the value of the argument is extracted from `model_info`;
* `label`, a unique name of the model; if not specified (`label = NULL` by default), then it is extracted from `class(model)`.
Application of function `explain()` provides an object of class `explainer`. It is a list of many components that include:
* `model`, the explained model;
* `data`, the data to which the model is applied;
* `y`, observed values of the dependent variable corresponding to `data`;
* `y_hat`, predictions obtained by applying `model` to `data`;
* `residuals`, residuals computed based on `y` and `y_hat`;
* `predict_function`, the function used to obtain the model's predictions;
* `residual_function`, the function used to obtain residuals;
* `class`, class/classes of the model;
* `label`, label of the model/explainer;
* `model_info`, a named list (with components `package`, `version`, and `type`) providing information about the model.
Thus, each explainer-object contains all elements needed to create a model explanation. The code below creates explainers for the models (see Sections \@ref(model-titanic-lmr)--\@ref(model-titanic-svm)) fitted to the Titanic data. Note that, in the `data` argument, we indicate the `titanic` data frame without the ninth column, i.e., without the *survived* variable. The variable is used in the `y` argument to explicitly define the binary dependent variable equal to 1 for survivors and 0 for passengers who did not survive.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
titanic_lmr_exp <- explain(model = titanic_lmr,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Logistic Regression",
type = "classification")
titanic_rf_exp <- explain(model = titanic_rf,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Random Forest")
titanic_gbm_exp <- explain(model = titanic_gbm,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Generalized Boosted Regression")
titanic_svm_exp <- explain(model = titanic_svm,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Support Vector Machine")
```
```{r eval=FALSE, echo=FALSE}
# saveToLocalRepo(model_titanic_lmr, repoDir = "models")
# "56d8a46955e91f0472243e1af8021b96"
# saveToLocalRepo(explain_titanic_lmr, repoDir = "models")
# "ff1cd6221c34ea70a9e033b5725c9585"
# saveToLocalRepo(titanic_rf, repoDir = "models")
# "31570ec57a3b72d3ec83a5f9b22cbaaa"
# saveToLocalRepo(explain_titanic_rf, repoDir = "models")
# "6ed54968790fbbe291acceb7dd6bc2ad"
# saveToLocalRepo(titanic_rf_v3, repoDir = "models")
# "855c117e1d08e793b820da14ccfdc7a5"
# saveToLocalRepo(explain_titanic_rf_v3, repoDir = "models")
# "5b32a9ed8ce5a1d63833706dbe38e221"
# saveToLocalRepo(titanic_gbm, repoDir = "models")
# "0854469e60467cb25c3b7c48da5fd3dd"
# saveToLocalRepo(explain_titanic_gbm, repoDir = "models")
# "87271254388a263c90ef3fa3a7a806ee"
# saveToLocalRepo(titanic_svm, repoDir = "models")
# "be26e200fd6453088f5db791ac07471c"
# saveToLocalRepo(explain_titanic_svm, repoDir = "models")
# "21966045bff86bba565814e8f1a18384"
```
### List of model-objects {#ListOfModelsTitanic}
In the previous sections, we have built four predictive models for the Titanic dataset. The models will be used in the rest of the book to illustrate model-explanation methods and tools.
For the ease of reference, we summarize the models in Table \@ref(tab:archivistHooksOfModelsTitanic). The binary model-objects can be downloaded by using the indicated `archivist` hooks [@archivist]. By calling a function specified in the last column of the table, one can restore a selected model in its local R environment.\index{package | archivist}
Table: (\#tab:archivistHooksOfModelsTitanic) Predictive models created for the `titanic` dataset. All models are fitted with following variables: *gender*, *age*, *class*, *sibsp*, *parch*, *fare*, *embarked.*
| Model name / library | Link to this object |
|-----------------------|------------|
| `titanic_lmr` | Get the model: `archivist::` |
| `rms:: lmr` v.5.1.3 | `aread("pbiecek/models/58b24")`. |
| `titanic_rf` | Get the model: `archivist::` |
| `randomForest:: randomForest` v.4.6.14 | `aread("pbiecek/models/4e0fc")`. |
| `titanic_gbm` | Get the model: `archivist::` |
| `gbm:: gbm` v.2.1.5 | `aread("pbiecek/models/b7078")`. |
| `titanic_svm` | Get the model: `archivist::` |
| `e1071:: svm` v.1.7.3 | `aread("pbiecek/models/9c27f")`. |
<!--
Get the explainer: `archivist:: aread("pbiecek/models/ff1cd")`
Get the explainer: `archivist:: aread("pbiecek/models/6ed54")`
Get the explainer: `archivist:: aread("pbiecek/models/87271")`
Get the explainer: `archivist:: aread("pbiecek/models/21966")`
| `titanic_rf_v3` | `randomForest:: randomForest` v.4.6.14 | gender, age, class | Get the model: `archivist:: aread("pbiecek/models/293e8")`. Get the explainer: `archivist:: aread("pbiecek/models/5b32a")` |
-->
Table \@ref(tab:archivistHooksOfDataFramesTitanic) summarizes the data frames that will be used in examples in the subsequent chapters.
Table: (\#tab:archivistHooksOfDataFramesTitanic) Data frames created for the Titanic use-case. All frames have following variables: *gender*, *age*, *class*, *embarked*, *country*, *fare*, *sibsp*, *parch*. The `titanic` data frame includes also the *survived* variable.
| Description | Link to this object |
|--------------|-----------------------|
| `titanic` dataset with 2207 observations with imputed missing values | `archivist:: aread("pbiecek/models/27e5c")` |
| `johnny_d` 8-year-old boy from the 1st class without parents, paid 72 pounds, embarked in Southampton | `archivist:: aread("pbiecek/models/e3596")` |
| `henry` 47-year-old male from the 1st class, travelled alone, paid 25 pounds, embarked in Cherbourg | `archivist:: aread("pbiecek/models/a6538")` |
## Models for RMS Titanic, snippets for Python
Titanic data are provided in the `titanic` dataset, which is available in the `dalex` library. The values of the dependent binary variable are given in the `survived` column; the remaining columns give the values of the explanatory variables that are used to construct the classifiers.
The following instructions load the `titanic` dataset and split it into the dependent variable `y` and the explanatory variables `X`. Note that, for the purpose of this example, we do not divide the data into the training and testing sets. Instructions on how to deal with the situation when you want to analyze the model on data other than the training set will be presented in the subsequent chapters.
```{python, eval=FALSE}
import dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived
```
Dataset `X` contains numeric variables with different ranges (for instance, *age* and *fare*) and categorical variables. Machine-learning algorithms in the `sklearn` library require data in a numeric form. Therefore, before modelling, we use a pipeline that performs data pre-processing. In particular, we scale the continuous variables (*age*, *fare*, *parch*, and *sibsp*) and one-hot-encode the categorical variables (*gender*, *class*, *embarked*).
```{python, eval=FALSE}
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
preprocess = make_column_transformer(
(StandardScaler(), ['age', 'fare', 'parch', 'sibsp']),
(OneHotEncoder(), ['gender', 'class', 'embarked']))
```
### Logistic regression model {#model-titanic-python-lr}
To fit the logistic regression model (see Section \@ref(model-titanic-lmr)), we use the `LogisticRegression` algorithm from the `sklearn` library. By default, the implementation uses the ridge penalty, defined in \@ref(eq:ridgePenalty). For this reason it is important to scale continuous variables like `age` and `fare`.
The fitted model is stored in object `titanic_lr`, which will be used in subsequent chapters.
```{python, eval=FALSE}
from sklearn.linear_model import LogisticRegression
titanic_lr = make_pipeline(
preprocess,
LogisticRegression(penalty = 'l2'))
titanic_lr.fit(X, y)
```
<!----Note that we are not very much interested in the assessment of the model's predictive performance, but rather on understanding how does the model yield its predictions. This is why we do not split the data into the training and testing subsets. Instead, the model is fitted to the entire dataset and will be examined on the same dataset.
--->
### Random forest model {#model-titanic-python-rf}
To fit the random forest model (see Section \@ref(model-titanic-rf)), we use the `RandomForestClassifier` algorithm from the `sklearn` library. We use the default settings with trees not deeper than three levels, and the number of trees set to 500. The fitted model is stored in object `titanic_rf`.
```{python, eval=FALSE}
from sklearn.ensemble import RandomForestClassifier
titanic_rf = make_pipeline(
preprocess,
RandomForestClassifier(max_depth = 3, n_estimators = 500))
titanic_rf.fit(X, y)
```
### Gradient boosting model {#model-titanic-python-gbm}
To fit the gradient boosting model (see Section \@ref(model-titanic-gbm)), we use the `GradientBoostingClassifier` algorithm from the `sklearn` library. We use the default settings, with the number of trees in the ensemble set to 100. The fitted model is stored in object `titanic_gbc`.
```{python, eval=FALSE}
from sklearn.ensemble import GradientBoostingClassifier
titanic_gbc = make_pipeline(
preprocess,
GradientBoostingClassifier(n_estimators = 100))
titanic_gbc.fit(X, y)
```
### Support vector machine model {#model-titanic-python-svm}
Finally, to fit the SVM model with C-Support Vector Classification mode (see Section \@ref(model-titanic-svm)), we use the `SVC` algorithm from the `sklearn` library based on `libsvm`. The fitted model is stored in object `titanic_svm`. \index{package | libsvm}
```{python, eval=FALSE}
from sklearn.svm import SVC
titanic_svm = make_pipeline(
preprocess,
SVC(probability = True))
titanic_svm.fit(X, y)
```
### Models' predictions {#predictions-titanic-python}
Let us now compare predictions that are obtained from the different models. In particular, we compute the predicted probability of survival for Johnny D, an 8-year-old boy who embarked in Southampton and travelled in the first class with no parents nor siblings, and with a ticket costing 72 pounds (see Section \@ref(predictions-titanic)).
First, we create a data frame `johnny_d` that contains the data describing the passenger.
```{python, eval=FALSE}
import pandas as pd
johnny_d = pd.DataFrame({'gender': ['male'],
'age' : [8],
'class' : ['1st'],
'embarked': ['Southampton'],
'fare' : [72],
'sibsp' : [0],
'parch' : [0]},
index = ['JohnnyD'])
```
Subsequently, we use the method `predict_proba()` to obtain the predicted probability of survival for the logistic regression model.
```{python, eval=FALSE}
titanic_lr.predict_proba(johnny_d)
# array([[0.35884528, 0.64115472]])
```
We do the same for the three remaining models.
```{python, eval=FALSE}
titanic_rf.predict_proba(johnny_d)
# array([[0.63028556, 0.36971444]])
titanic_gbc.predict_proba(johnny_d)
# array([[0.1567194, 0.8432806]])
titanic_svm.predict_proba(johnny_d)
# array([[0.78308146, 0.21691854]])
```
We also create data frame for passenger Henry (see Section \@ref(predictions-titanic)) and compute his predicted probability of survival.
```{python, eval=FALSE}
henry = pd.DataFrame({'gender' : ['male'],
'age' : [47],
'class' : ['1st'],
'embarked': ['Cherbourg'],
'fare' : [25],
'sibsp' : [0],
'parch' : [0]},
index = ['Henry'])
titanic_lr.predict_proba(henry)
# array([[0.56798421 0.43201579]])
titanic_rf.predict_proba(henry)
# array([[0.69917845 0.30082155]])
titanic_gbc.predict_proba(henry)
# array([[0.78542886 0.21457114]])
titanic_svm.predict_proba(henry)
# array([[0.81725832 0.18274168]])
```
### Models' explainers {#ExplainersTitanicPythonCode}
The Python-code examples shown above use functions from the `sklearn` library, which facilitates uniform working with models. However, we may want to, or have to, work with models built by using other libraries. To simplify the task, the `dalex` library wraps models in objects of class `Explainer` that contain, in a uniform way, all the functions necessary for working with models.
There is only one argument that is required by the `Explainer()` constructor, i.e., `model`. However, the constructor allows additional arguments that extend its functionalities. In particular, the list of arguments includes the following:
* `data`, a data frame or `numpy.ndarray` providing data to which the model is to be applied. It should be an object of the `pandas.DataFrame` class, otherwise it will be converted to `pandas.DataFrame`.
* `y`, values of the dependent variable/target variable corresponding to the data given in the `data` object;
* `predict_function`, a function that returns prediction scores; if not specified, then `dalex` will make a guess which function should be used (`predict()`, `predict_proba()`, or something else). Note that this function should work on `pandas.DataFrame` objects; if it works only on `numpy.ndarray` then an appropriate conversion should also be included in `predict_function`.
* `residual_function`, a function that returns model residuals;
* `label`, a unique name of the model;
* `model_class`, the class of actual model;
* `verbose`, a logical argument (`verbose = TRUE` by default) indicating whether diagnostic messages are to be printed;
* `model_type`, information about the type of the model, either `"classification"` (for a binary dependent variable) or `"regression"` (for a continuous dependent variable);
* `model_info`, a dictionary with additional information about the model.
Application of constructor `Explainer()` provides an object of class `Explainer`. It is an object with many components that include:
* `model`, the explained model;
* `data`, the data to which the model is applied;
* `y`, observed values of the dependent variable corresponding to `data`;
* `y_hat`, predictions obtained by applying `model` to `data`;
* `residuals`, residuals computed based on `y` and `y_hat`;
* `predict_function`, the function used to obtain the model's predictions;
* `residual_function`, the function used to obtain residuals;
* `class`, class/classes of the model;
* `label`, label of the model/explainer;
* `model_info`, a dictionary (with components `package`, `version`, and `type`) providing information about the model.
Thus, each explainer-object contains all elements needed to create a model explanation. The code below creates explainers for the models (see Sections \@ref(model-titanic-python-lr)--\@ref(model-titanic-python-svm)) fitted to the Titanic data.
```{python, eval=FALSE}
titanic_rf_exp = dx.Explainer(titanic_rf,
X, y, label = "Titanic RF Pipeline")
titanic_lr_exp = dx.Explainer(titanic_lr,
X, y, label = "Titanic LR Pipeline")
titanic_gbc_exp = dx.Explainer(titanic_gbc,
X, y, label = "Titanic GBC Pipeline")
titanic_svm_exp = dx.Explainer(titanic_svm,
X, y, label = "Titanic SVM Pipeline")
```
When an explainer is created, the specified model and data are tested for consistency.
Diagnostic information is printed on the screen.
The following output shows diagnostic information for the `titanic_rf` model.
```
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Argument 'y' was converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.pipeline.Pipeline (default)
-> label : Titanic RF Pipeline
-> predict function : <yhat_proba> will be used (default)
-> predicted values : min = 0.171, mean = 0.322, max = 0.893
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.826, mean = 4.89e-05, max = 0.826
-> model_info : package sklearn
A new explainer has been created!
```
## Apartment prices {#ApartmentDataset}
Predicting house prices is a common exercise used in machine-learning courses. Various datasets for house prices are available at websites like [Kaggle](https://www.kaggle.com) or [UCI Machine Learning Repository](https://archive.ics.uci.edu). \index{dataset ! Apartments in Warsaw}
In this book, we will work with an interesting variant of this problem. The `apartments` dataset contains simulated data that match key characteristics of real apartments in Warsaw, the capital of Poland. However, the dataset is created in a way that two very different models, namely linear regression and random forest, offer almost exactly the same overall accuracy of predictions. The natural question is then: which model should we choose? We will show that model-explanation tools provide important insight into the key model characteristics and are helpful in model selection.
The dataset is available in the `DALEX` package in R and the `dalex` library in Python. It contains 1000 observations (apartments) and six variables:
* *m2.price*, apartment's price per square meter (in EUR), a numerical variable in the range of 1607--6595;
* *construction.year*, the year of construction of the block of flats in which the apartment is located, a numerical variable in the range of 1920--2010;
* *surface*, apartment's total surface in square meters, a numerical variable in the range of 20--150;
* *floor*, the floor at which the apartment is located (ground floor taken to be the first floor), a numerical integer variable with values ranging from 1 to 10;
* *no.rooms*, the total number of rooms, a numerical integer variable with values ranging from 1 to 6;
* *district*, a factor with 10 levels indicating the district of Warsaw where the apartment is located.
The first six rows of this dataset are presented in the table below.
```{r, warning=FALSE, message=FALSE, echo=FALSE}
#ktab <- kable(head(apartments), "html")
ktab <- kable(head(apartments))
kable_styling(ktab, bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```
Models considered for this dataset will use *m2.price* as the (continuous) dependent variable. Models' predictions will be validated on a set of 9000 apartments included in data frame `apartments_test`.
Note that, usually, the training dataset is larger than the testing one. In this example, we deliberately use a small training set, so that model selection may be more challenging.
### Data exploration {#exploration-apartments}
Note that `apartments` is an artificial dataset created to illustrate and explain differences between random forest and linear regression. Hence, the structure of the data, the form and strength of association between variables, plausibility of distributional assumptions, etc., is less problematic than in a real-life dataset. In fact, all these characteristics of the data are known. Nevertheless, we present some data exploration below to illustrate the important aspects of the data.
The variable of interest is *m2.price*, the price per square meter. The histogram presented in Figure \@ref(fig:apartmentsExplorationMi2) indicates that the distribution of the variable is slightly skewed to the right.
(ref:apartmentsExplorationMi2Caption) Distribution of the price per square meter in the apartment-prices data.
```{r apartmentsExplorationMi2, warning=FALSE, message=FALSE, echo=FALSE, fig.width=5.5, fig.height=5, fig.cap='(ref:apartmentsExplorationMi2Caption)', out.width='60%', fig.align='center'}
ggplot(data = apartments) +
geom_histogram(aes(m2.price), binwidth = 100, color = "white") +
labs(x="Price per square meter", title='Histogram of apartment prices') +
theme_mos2 + theme_ema
```
Figure \@ref(fig:apartmentsMi2Construction) suggests (possibly) a non-linear relationship between *construction.year* and *m2.price* and a linear relation between *surface* and *m2.price*.
(ref:apartmentsMi2ConstructionCaption) Apartment-prices data. Price per square meter vs. year of construction (left-hand-side panel) and vs. surface (right-hand-side panel).
```{r apartmentsMi2Construction, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:apartmentsMi2ConstructionCaption)', out.width = '100%', fig.align='center'}
pa1 <- ggplot(data = apartments, aes(construction.year, m2.price)) +
geom_point(size = 0.3) +
geom_smooth(se = FALSE, size=1, color = "#371ea3") +
labs(y="Price per square meter", x = "Construction year",
title='Apartment price per m2 vs. construction year') + theme_mos2 + theme_ema
pa2 <- ggplot(data = apartments, aes(surface, m2.price)) +
geom_point(size = 0.3) +
geom_smooth(se = FALSE, size=1, color = "#371ea3") +
labs(y="Price per square meter", x = "Surface (square meter)",
title='Apartment price per m2 vs. surface') + theme_mos2 + theme_ema
pa1 + pa2
```
Figure \@ref(fig:apartmentsMi2Floor) indicates that the relationship between *floor* and *m2.price* is also close to linear, as well as is the association between *no.rooms* and *m2.price* .
(ref:apartmentsMi2FloorCaption) Apartment-prices data. Price per square meter vs. floor (left-hand-side panel) and vs. number of rooms (right-hand-side panel).
```{r apartmentsMi2Floor, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:apartmentsMi2FloorCaption)', out.width = '100%', fig.align='center'}
pa3 <- ggplot(data = apartments, aes(floor, m2.price)) +
geom_boxplot(aes(group = floor), se = FALSE, size=0.5, fill = "#371ea3", color = "white", alpha=0.3) +
geom_jitter(size = 0.3, width = 0.15, height = 0) +
# geom_smooth(se = FALSE, size=1, color = "#371ea3") +
labs(y="Price per square meter", x = "Floor", title='Apartment price per m2 vs. floor') + theme_mos2 + scale_x_continuous(breaks = 1:10) + theme_ema
pa4 <- ggplot(data = apartments, aes(no.rooms, m2.price, group = no.rooms)) +
geom_boxplot(aes(group = no.rooms), se = FALSE, size=0.5, fill = "#371ea3", color = "white", alpha=0.3) +
geom_jitter(size = 0.3, width = 0.15, height = 0) +
labs(y="Price per square meter", x = "Number of rooms", title='Apartment price per m2 vs. number of rooms') + theme_mos2 + scale_x_continuous(breaks = 1:6) + theme_ema
pa3 + pa4
```
Figure \@ref(fig:apartmentsSurfaceNorooms) shows that *surface* and *number of rooms* are positively associated and that prices depend on the district. In particular, box plots in Figure \@ref(fig:apartmentsSurfaceNorooms) indicate that the highest prices per square meter are observed in Srodmiescie (Downtown).
(ref:apartmentsSurfaceNoroomsCaption) Apartment-prices data. Surface vs. number of rooms (left-hand-side panel) and price per square meter for different districts (right-hand-side panel).
```{r apartmentsSurfaceNorooms, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9.5, fig.height=5, fig.cap='(ref:apartmentsSurfaceNoroomsCaption)', out.width = '100%', fig.align='center'}
pa5 <- ggplot(data = apartments, aes(no.rooms, surface, group = no.rooms)) +
geom_boxplot(aes(group = no.rooms), se = FALSE, size=0.5, fill = "#371ea3", color = "white", alpha=0.3) +
geom_jitter(size = 0.3, width = 0.15, height = 0) +
labs(y="Surface (square meter)", x = "Number of rooms", title='Surface vs. number of rooms') + theme_mos2 + scale_x_continuous(breaks = 1:6) + theme_ema
apartments$district <- reorder(apartments$district, apartments$m2.price, mean)
pa6 <- ggplot(data = apartments, aes(district, m2.price)) +
geom_boxplot(aes(group = district), se = FALSE, size=0.5, fill = "#371ea3", color = "white", alpha=0.3) +
geom_jitter(size = 0.3, width = 0.15, height = 0) +
labs(y="Price per square meter", x = "", title='Price per square meter for districts') + theme_drwhy_vertical() %+replace%
theme(legend.position = "none",
axis.text.x = element_text(angle = 0, size = 12, vjust=1, hjust=1),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 12)) + coord_flip() + theme_ema
pa5 + pa6
```
## Models for apartment prices, snippets for R
### Linear regression model {#model-Apartments-lr}
The dependent variable of interest, *m2.price*, is continuous. Thus, a natural choice to build a predictive model is linear regression. We treat all the other variables in the `apartments` data frame as explanatory and include them in the model. To fit the model, we apply the `lm()` function. The results of the model are stored in model-object `apartments_lm`.\index{Regression}\index{Regression|model}
```{r, warning=FALSE, message=FALSE}
library("DALEX")
apartments_lm <- lm(m2.price ~ ., data = apartments)
anova(apartments_lm)
```
### Random forest model {#model-Apartments-rf}
As an alternative to linear regression, we consider a random forest model. Again, we treat all the variables in the `apartments` data frame other than *m2.price* as explanatory and include them in the model. To fit the model, we apply the `randomForest()` function, with default settings, from the package with the same name [@randomForest]. The results of the model are stored in model-object `apartments_rf`.
```{r, warning=FALSE, message=FALSE, echo = TRUE, eval = FALSE}
library("randomForest")
set.seed(72)
apartments_rf <- randomForest(m2.price ~ ., data = apartments)
```
<!--
```{r, warning=FALSE, message=FALSE, echo=FALSE}
apartments_rf
```
-->
### Support vector machine model {#model-Apartments-svm}
Finally, we consider an SVM model, with all the variables in the `apartments` data frame other than *m2.price* treated as explanatory. To fit the model, we use the `svm()` function, with default settings, from package `e1071` [@e1071]. The results of the model are stored in model-object `apartments_svm`.
```{r, warning=FALSE, message=FALSE, eval = FALSE}
library("e1071")
apartments_svm <- svm(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
```
<!--
```{r, warning=FALSE, message=FALSE, echo=FALSE}
apartments_svm
```
-->
### Models' predictions {#predictionsApartments}
The `predict()` function calculates predictions for a specific model. In the example below, we use model-objects `apartments_lm`, `apartments_rf`, and `apartments_svm`, to calculate predictions for prices of the apartments from the `apartments_test` data frame. Note that, for brevity's sake, we compute the predictions only for the first six observations from the data frame.
The actual prices for the first six observations from `apartments_test` are provided below.
```{r, warning=FALSE, message=FALSE}
apartments_test$m2.price[1:6]
```
Predicted prices for the linear regression model are as follows:
```{r, warning=FALSE, message=FALSE}
predict(apartments_lm, apartments_test[1:6,])
```
Predicted prices for the random forest model take the following values:
```{r, warning=FALSE, message=FALSE}
predict(apartments_rf, apartments_test[1:6,])
```
Predicted prices for the SVM model are as follows:
```{r, warning=FALSE, message=FALSE}
predict(apartments_svm, apartments_test[1:6,])
```
By using the code presented below, we summarize the predictive performance of the linear regression and random forest models by computing the square root of the mean-squared-error (RMSE). For a "perfect" predictive model, which would predict all observations exactly, RMSE should be equal to 0. More information about RMSE can be found in Section \@ref(modelPerformanceMethodCont). \index{Mean | squared-error}
```{r, warning=FALSE, message=FALSE}
predicted_apartments_lm <- predict(apartments_lm, apartments_test)
sqrt(mean((predicted_apartments_lm - apartments_test$m2.price)^2))
predicted_apartments_rf <- predict(apartments_rf, apartments_test)
sqrt(mean((predicted_apartments_rf - apartments_test$m2.price)^2))
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
predicted_apartments_lm <- predict(apartments_lm, apartments_test)
rmsd_lm <- sqrt(mean((predicted_apartments_lm - apartments_test$m2.price)^2))
predicted_apartments_rf <- predict(apartments_rf, apartments_test)
rmsd_rf <- sqrt(mean((predicted_apartments_rf - apartments_test$m2.price)^2))
```
For the random forest model, RMSE is equal to `r round(rmsd_rf, 1)`. It is almost identical to the RMSE for the linear regression model, which is equal to `r round(rmsd_lm, 1)`. Thus, the question we may face is: should we choose the more complex but flexible random forest model, or the simpler and easier to interpret linear regression model? In the subsequent chapters, we will try to provide an answer to this question. In particular, we will show that a proper model exploration may help to discover weak and strong sides of any of the models and, in consequence, allow the creation of a new model, with better performance than either of the two.
### Models' explainers {#ExplainersApartmentsRCode}
The code presented below creates explainers for the models (see Sections \@ref(model-Apartments-lr)--\@ref(model-Apartments-svm)) fitted to the apartment-prices data. Note that we use the `apartments_test` data frame without the first column, i.e., the *m2.price* variable, in the `data` argument. This will be the dataset to which the model will be applied (see Section \@ref(ExplainersTitanicRCode)). The *m2.price* variable is explicitly specified as the dependent variable in the `y` argument (see Section \@ref(ExplainersTitanicRCode)).
```{r, warning=FALSE, message=FALSE, eval=FALSE}
apartments_lm_exp <- explain(model = apartments_lm,
data = apartments_test[,-1],
y = apartments_test$m2.price,
label = "Linear Regression")
apartments_rf_exp <- explain(model = apartments_rf,
data = apartments_test[,-1],
y = apartments_test$m2.price,
label = "Random Forest")
apartments_svm_exp <- explain(model = apartments_svm,
data = apartments_test[,-1],
y = apartments_test$m2.price,
label = "Support Vector Machine")
```
### List of model-objects {#ListOfModelsApartments}
In Sections \@ref(model-Apartments-lr)--\@ref(model-Apartments-svm), we have built three predictive models for the `apartments` dataset. The models will be used in the rest of the book to illustrate the model-explanation methods and tools.
For the ease of reference, we summarize the models in Table \@ref(tab:archivistHooksOfModelsApartments). The binary model-objects can be downloaded by using the indicated `archivist` hooks [@archivist]. By calling a function specified in the last column of the table, one can restore a selected model in a local R environment.\index{package | stats}
Table: (\#tab:archivistHooksOfModelsApartments) Predictive models created for the dataset Apartment prices. All models are fitted by using *construction.year*, *surface*, *floor*, *no.rooms*, and *district* as explanatory variables.
| Model name / library | Link to this object |
|-----------------------|------------|
| `apartments_lm` | Get the model: `archivist::` |
| `stats:: lm` v.3.5.3 | `aread("pbiecek/models/55f19")`. |
| `apartments_rf` | Get the model: `archivist::` |
| `randomForest:: randomForest` v.4.6.14 | `aread("pbiecek/models/fe7a5")`. |
| `apartments_svm` | Get the model: `archivist::` |
| `e1071:: svm` v.1.7.3 | `aread("pbiecek/models/d2ca0")`. |
<!--
Get the explainer: `archivist:: aread("pbiecek/models/78d4e")`
Get the explainer: `archivist:: aread("pbiecek/models/b1739")`
Get the explainer: `archivist:: aread("pbiecek/models/16602")`
-->
## Models for apartment prices, snippets for Python
Apartment-prices data are provided in the `apartments` dataset, which is available in the `dalex` library. The values of the continuous dependent variable are given in the `m2_price` column; the remaining columns give the values of the explanatory variables that are used to construct the predictive models.
The following instructions load the `apartments` dataset and split it into the dependent variable `y` and the explanatory variables `X`.
```{python, eval=FALSE}
import dalex as dx
apartments = dx.datasets.load_apartments()
X = apartments.drop(columns='m2_price')
y = apartments['m2_price']
```
Dataset `X` contains numeric variables with different ranges (for instance, *surface* and *no.rooms*) and categorical variables (*district*). Machine-learning algorithms in the `sklearn` library require data in a numeric form. Therefore, before modelling, we use a pipeline that performs data pre-processing. In particular, we scale the continuous variables (*construction.year*, *surface*, *floor*, and *no.rooms*) and one-hot-encode the categorical variables (*district*).
```{python, eval=FALSE}
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
preprocess = make_column_transformer(
(StandardScaler(), ['construction_year', 'surface', 'floor', 'no_rooms']),
(OneHotEncoder(), ['district']))
```
### Linear regression model {#model-Apartments-python-lr}
To fit the linear regression model (see Section \@ref(model-Apartments-lr)), we use the `LinearRegression` algorithm from the `sklearn` library. The fitted model is stored in object `apartments_lm`, which will be used in subsequent chapters.
```{python, eval=FALSE}
from sklearn.linear_model import LinearRegression
apartments_lm = make_pipeline(
preprocess,
LinearRegression())
apartments_lm.fit(X, y)
```
### Random forest model {#model-Apartments-python-rf}
To fit the random forest model (see Section \@ref(model-Apartments-rf)), we use the `RandomForestRegressor` algorithm from the `sklearn` library. We apply the default settings with trees not deeper than three levels and the number of trees in the random forest set to 500. The fitted model is stored in object `apartments_rf` for purpose of illustrations in subsequent chapters.
```{python, eval=FALSE}
from sklearn.ensemble import RandomForestRegressor
apartments_rf = make_pipeline(
preprocess,
RandomForestRegressor(max_depth = 3, n_estimators = 500))
apartments_rf.fit(X, y)
```
### Support vector machine model {#model-Apartments-python-svm}
Finally, to fit the SVM model (see Section \@ref(model-Apartments-svm)), we use the `SVR` algorithm from the `sklearn` library. The fitted model is stored in object `apartments_svm`, which will be used in subsequent chapters.
```{python, eval=FALSE}
from sklearn.svm import SVR
apartments_svm = make_pipeline(
preprocess,
SVR())
apartments_svm.fit(X, y)
```
### Models' predictions {#predictions-apartments-python}
Let us now compare predictions that are obtained from the different models for the `apartments_test` data. In the code below, we use the `predict()` method to obtain the predicted price per square meter for the linear regression model.
```{python, eval=FALSE}
apartments_test = dx.datasets.load_apartments_test()
apartments_test = apartments_test.drop(columns='m2_price')
apartments_lm.predict(apartments_test)
# array([4820.00943156, 3292.67756996, 2717.90972101, ..., 4836.44370353,
# 3191.69063189, 5157.93680175])
```
In a similar way, we obtain the predictions for the two remaining models.
```{python, eval=FALSE}
apartments_rf.predict(apartments_test)
# array([4708, 3819, 2273, ..., 4708, 4336, 4916])
```
```{python, eval=FALSE}
apartments_svm.predict(apartments_test)
# array([3344.48570564, 3323.01215313, 3321.97053977, ..., 3353.19750146,
# 3383.51743883, 3376.31070911])
```
### Models' explainers {#ExplainersApartmentsPythonCode}
The Python-code examples presented for the models for the apartment-prices dataset use functions from the `sklearn` library, which facilitates uniform working with models. However, we may want to, or have to, work with models built by using other libraries. To simplify the task, the `dalex` library wraps models in objects of class `Explainer` that contain, in a uniform way, all the functions necessary for working with models (see Section \@ref(ExplainersTitanicPythonCode)). The code below creates explainer-objects for the models (see Sections \@ref(model-Apartments-python-lr)--\@ref(model-Apartments-python-svm)) fitted to the apartment-prices data.
```{python, eval=FALSE}
apartments_lm_exp = dx.Explainer(apartments_lm, X, y,
label = "Apartments LM Pipeline")
apartments_rf_exp = dx.Explainer(apartments_rf, X, y,
label = "Apartments RF Pipeline")
apartments_svm_exp = dx.Explainer(apartments_svm, X, y,
label = "Apartments SVM Pipeline")
```
When an explainer is created, the specified model and data are tested for consistency.
Diagnostic information is printed on the screen.
The following output shows diagnostic information for the `apartments_lm` model.
```
Preparation of a new explainer is initiated
-> data : 1000 rows 5 cols
-> target variable : Argument 'y' converted to a numpy.ndarray.
-> target variable : 1000 values
-> model_class : sklearn.pipeline.Pipeline (default)
-> label : Apartments LM Pipeline
-> predict function : <yhat at 0x117090840> will be used (default)
-> predicted values : min = 1.78e+03, mean = 3.49e+03, max = 6.18e+03
-> residual function : difference between y and yhat (default)
-> residuals : min = -2.47e+02, mean = 2.06e-13, max = 4.69e+02
-> model_info : package sklearn
A new explainer has been created!
```