forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter3.Rmd
731 lines (564 loc) · 35.6 KB
/
chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
---
title: "IODS"
subtitle: "Assignment 3"
author: "Rong Guang"
date: "16/11/2022"
output:
html_document:
fig_caption: yes
theme: flatly
highlight: haddock
toc: true
toc_depth: 3
toc_float: true
number_section: false
bibliography: citations.bib
---
# **Chapter 3: Logistic regression**
# 1 Preparing
## 1.1 read the data set
```{r}
library(tidyverse)
alc <- read_csv(file = "data/alc.csv")
```
## 1.2 check the data set
```{r}
glimpse(alc)
```
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
# 2 Hypothesis
## 2.1 introduction
Despite the health risk and public harm associated with heavy drinking, alcohol is the most commonly used substance in developed countries [@Flor2020]. A large scale longitudinal study has identified Finland as the only Nordic country whose alcohol-attributable harms has increased [@Room2013]. Given that alcohol consumption typically starts in late adolescence or early adulthood [@Lees2020], measures to detect alcohol misuse among young people, especially college students, should be a top public health priority. Identifying a comprehensive set of early life factors associated with college students' alcohol use disorders could be an important starting point.
## 2.2 literature review
College students typically spend a tremendous amount of time with their family members, emphasizing the influence of family quality on any type of habit acquisitions. Evidence has shown family relationship quality is strongly correlated with early alcohol use [@Kelly2011; @Brody1993], and the effect is interactive with gender [@Kelly2011]. Since studying also comprises an important part of college life, it is important to evaluate how college life and alcohol use interact with each other. An 21 year follow-up of 3,478 Australian since they were child has found level of academic performance predicts their drinking problems, independently of a selected group of individual and family con-founders [@Hayatbakhsh2011]. College students start to build up their social networks. An increased exposure to social communications is reasonably expected among them, which might incur alcohol involvement. A survey has found typical social drinking contexts were associated with men's average daily number of drinks and frequency of drunkenness, indicating social communications, interacted with gender, might have influence on college students' alcohol usage[@Senchak1998].
## 2.3 Proposing hypothesis
According to the literature review, 4 potential early-life factors is identified to predict excessive alcohol usage among college students. They are *a.* family relationship quality (interactive with gender); *b.* school performance; *c.* social communication (interactive with gender). I herein proposed a 3-factor alcohol high-use model for college students and test it using a secondary data set collected for other purposes.
In the data set, variables including gender, quality of family relationships ("famrel"), number of school absences ("absences"), weekly study time ("studytime") and frequency of going out with friends ("goout") could be candidate indicators for the current model. The variable "gender" and "famrel"'s relevance to the predictors are self-explanatory. School performance includes college students' in-class and off-class performance, which could be reflected by variables "absences" and "studytime", respectively. Variable "goout" captures the involvement of social activity, which is a good indicator to social communication. Note that base on the well-reported evidence introduced above, gender will not enter the model independently. Instead, it will comprise interaction terms with family relationship quality and social communication, respectively, and then enter the model.
# 3 Data exploration
## 3.1 The distribution of the chosen variables
```{r}
wrap.lab <- c("in-class performance (absences, smaller is better)",
"family relationship quality (famrel)",
"social (goout)",
"off-class performance (studytime)")
names(wrap.lab) <- c("absences", "studytime", "famrel", "goout")
alc %>%
select(absences, studytime, famrel, goout) %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value))+
geom_bar(width =1, fill = "white", color = "black")+
facet_wrap(~variable, scales = "free",
labeller = labeller(variable = wrap.lab))+
scale_fill_brewer(palette = "Greys") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "lightgrey"))+
labs(title = "Distribution of the interested variables",
x = "Values of each variable",
y = "Frequency")
```
The distribution of in-class performance reflected by class absences is skewed to right with very long tail. This indicates that most students have full or almost full attendance of class, while a small number of students might be absent for quite a number of classes. Other three variables are semi-numeric ordinal obtained by item with Likert-marks, with labeling value ranging from 1 to 5. It is found most Finnish students tend to be very social, with more around 2/3 of them being at the high end the choices. For off-class performance reflected by study time, it is found that most Finnish students spend 2~10 hours a week studying. For family relationship quality, it is surprising to find only roughly 1/3 of the students having good or very good family relationship quality, and none of them believe they have excellent family relationship quality.
## 3.2 exploring the association between faimly relationship quality and alcohol high-use
The variable "famrel" in original data set elicited quality of family relationships (numeric: from 1 - very bad to 5 - excellent). In the current analysis, it is selected as a candidate predictor for the model to reflect the same idea--quality of family relationship.
### 3.2.1 numerically explore the association
```{r}
alc %>% count(high_use, famrel)
```
It is found the absolute sample of participants with very bad (level 1) and/or bad (level 2) family quality is very small in number (n = 8). Caution should be taken about the potential large error.
### 3.2.2 graphically explore the association
```{r}
#adapt the titles for each wrapped graph
sex.labs <- c("Female", "Male")
names(sex.labs) <- c("F", "M")
#draw the bar plot
p1 <- alc %>%
ggplot(aes(x = factor(famrel), fill = high_use)) +
geom_bar(position = "fill", color ="black") +
facet_wrap(~sex, #warp by sex
labeller = labeller(sex = sex.labs)) + #label each
labs(x = "Family relationship quality (larger is better)",
y = "Proportion of high-user",
title =
"Proportion of alcohol high-use by family relationship quality and sex")+
theme(legend.position = "bottom")+ #adapt the legend position
guides(fill=guide_legend(title = "Alcohol high-use"))+ #define legend title
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))+ #define legend text
scale_fill_brewer(palette = "Greys") #define color theme
p1
```
The value of the variable "famrel" includes numbers from 1 - very bad to 5 - excellent. In the current study, I presume that the intervals between each consecutive pair of value is consistent, and hence see it as a numeric variable.
According to the bar plot of proportion, the hypothesis of using the current variable in model fitting is validated. It is found that with the increasing of family relationship quality, the proportion of alcohol high-use decreases, except for female from a very bad (level 1) relationship family, which had a proportion of high-users at zero. However, this low proportion suffers from a risk of error due to the small sample in the level (n = 8). The result should be interpreted with caution.
To facilitate understanding, the variable's name will be changed to family.quality according to the hypothesis.
### 3.2.3 re-code the variable of family relationship quality
```{r}
alc <- alc %>%
mutate(family.quality = famrel) #create a new variable family.quality
#it has the same value with famrel
```
## 3.3 exploring the association between school performance (absences) and alcohol high-use
The variable "studytime" in original data set captured participants' weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours). The variable "absences" in original data set captured participants' number of school absences (numeric: from 0 to 93). It is presumed in the current analysis that they reflect off-class and in-class school performance, respectively, and hence they are selected as candidate predictors.
### 3.3.1 exploring the association between in-class performance and alcohol high-use
### 3.3.1.1 numerically explore the association
```{r}
library(DT)
alc %>% group_by(high_use, sex) %>%
summarise(mean = mean(absences),
sd = sd(absences),
median = median(absences),
Q1 = quantile(absences, prob = 0.25),
Q3 = quantile(absences, prob = 0.75),
sampleSize = n()) %>%
datatable() %>%
formatRound(columns = c(3:4), digits = 2)
```
From the table, it is found that both the frequency and median(Q1,Q3) of class absences differed greatly between alcohol high-users and non-high-users, indicating its validity in entering the model.
### 3.3.1.2 graphically explore the association
```{r}
p2 <- alc %>%
ggplot(aes(x = high_use, y = absences, fill = high_use)) +
geom_boxplot() +
geom_jitter(width=0.25, alpha=0.5)+
facet_wrap(~sex, labeller = labeller(sex = sex.labs)) +
scale_fill_brewer(palette = "Blues")+
labs(x = "Alcohol high-user",
y = "Freuqncy of class absences",
title =
"Frequency of class absences by alcohol high-use and gender")+
theme(legend.position = "none")+
scale_x_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p2
```
The box plot showed similar information to the previous table. No noticeable difference in proportions of absences can be observed between genders, and hence their interaction would not be considered in fitting the model.
### 3.3.1.3 rename the variable
To facilitate understanding, the name of variable "absences" will be changed to in.class.performance according to the hypothesis of current study.
```{r}
alc <- alc %>%
mutate(in.class.performance = absences)
```
### 3.3.2 exploring the association between off-class performance (study time) and alcohol high-use
### 3.3.2.1 numerically explore the association
```{r}
alc %>% count(high_use, studytime)
```
From the table, it is found the sample of participants with long and very long (level 4 and 5) study time in alcohol high-user group is very small in number (n = 12). Caution should be taken about the potential large error.
### 3.3.2.2 graphically explore the association
```{r}
p3 <- alc %>%
ggplot(aes(x = factor(studytime), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
facet_wrap(~sex,
labeller = labeller(sex = sex.labs)) +
labs(x = "Study time ranges (larger is longer)",
y = "Proportion of high-user",
title = "Proportion of alcohol high-use by study time ranges and sex")+
theme(legend.position = "bottom")+
guides(fill=guide_legend(title = "Alcohol high-use"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))+
scale_fill_brewer(palette = "Greys")
p3
```
The levels of study time ranges include 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours. Their intervals are inconsistent, and hence it is not appropriate to enter a model as numeric variables, and it will be transformed into a categorical variable.
According to the bar plot of proportion, it is found that with the increasing of study time, the proportion of alcohol high-use decreases, indicating its validity in entering the model. However, male with long study time (level 3) is an exception, which had the lowest proportion of high-users across the levels. Notably, this low proportion suffers from a risk of error due to the small sample in the level. To address the risk of error, the levels of study time ranges will be re-coded as Long study(original level 3 + original level 4), Moderate study(original level 2) and Light study (original level 1). Besides, no noticeable difference in proportions of study length can be observed between genders, and hence their interaction would not be considered in fitting the model.
To facilitate understanding, the name of variable "studytime" will be changed to off.class.performance according to the hypothesis.
### 3.3.3 re-code the variable of study length
```{r}
alc <- alc %>%
mutate(off.class.performance =
case_when(studytime == 3 |studytime == 4~"Long study",
studytime == 2~"Moderate study",
studytime == 1~"Light study") %>%
factor(levels = c("Light study", "Moderate study", "Long study")))
```
## 3.4 exploring the association between social communication frequency and alcohol high-use
The variable "goout" in original data set captured participants' frequency of going out with friends (numeric: from 1 - very low to 5 - very high). It is presumed in the current analysis that it reflects social involvement, and hence it is selected as a candidate predictor.
### 3.4.1 numerically explore the association
```{r}
alc %>% count(high_use, goout)
```
From the table, it is found the sample of participants having very low frequency of social communication (level 1) is small in number (n = 21). Caution should be taken about the potential large error.
### 3.4.2 graphically explore the association
```{r}
p4 <- alc %>%
ggplot(aes(x = factor(goout), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
facet_wrap(~sex,
labeller = labeller(sex = sex.labs)) +
labs(x = "Social Communication Frequency (larger is more frequent)",
y = "Proportion of alcohol high-users",
title =
"Proportion of social communication frequency ranges by alcohol high-use and sex")+
theme(legend.position = "bottom",
plot.title = element_text(size = 10))+
guides(fill=guide_legend(title = "Alcohol high-use"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))+
scale_fill_brewer(palette = "Greys")
p4
```
According to the bar plot, the proportion of alcohol high-users changed tremendously across different levels of social communication, indicating good validity of our model hypothesis about this variable. There is a clear borderline between social communication levels 1-3 and levels 4-5, though the difference is varied across genders. The levels are hence re-coded into two--Infrequent (original level 1-3) and Frequent (original level 4+5). Its interaction with sex will also be considered in fitting the model. This corresponds to the finding of previous evidence[@Senchak1998].
### 3.4.3 re-code the variable of social communication
```{r}
alc <- alc %>%
mutate(social = goout>3)
alc <- alc %>%
mutate(social = social %>%
factor() %>%
fct_recode("Frequent" = "TRUE",
"Infrequent" = "FALSE"))
```
# 4 Model fitting
## 4.1 fitting base on the original hypothesis
```{r}
fit1 <- glm(high_use~ family.quality:sex + social:sex + off.class.performance + in.class.performance, data = alc, family = "binomial")
summary(fit1)
```
All of the hypothesized predictors have at least one level being significant in the model. Comparing to light study participants, moderate study participants is not significant in predicting alcohol high-use. Hence,this variable will be dichotomized into Light study and moderate to long study for better model performance and parsimony of levels. The reason why it is not dichotomized into long study and moderate to short study is because the sample of long study category is extremely small, risking introducing error in our model.
## 4.2 re-code variable with insignificant levels
```{r}
alc <- alc %>%
mutate(off.class.performance =
case_when(off.class.performance == "Light study"~ "Light study",
TRUE~ "Moderate to long study") %>%
factor(levels = c("Light study",
"Moderate to long study")))
```
## 4.3 fitting the model again
```{r}
fit2 <- glm(high_use~ family.quality:sex + social:sex + off.class.performance + in.class.performance, data = alc, family = "binomial")
summary(fit2)
```
Now all of the hypothesized predictors are significant in predicting alcohol high-use, except for off-class performance, which has a a _p_ value of 0.05489, being very close to 0.05. An increase in sample size would very possibly make it significant. I hence keep this predictor in the model. Consequently, fit2 will be our final model.
## 4.4 interpreting the model results
### 4.4.1 transforming the coeficients to ORs
```{r}
OR <- coef(fit2) %>% exp()
CI <- confint(fit2) %>% exp()
ORCI <- cbind(OR,CI)
print(ORCI, digits = 2)
```
Our hypothesis that *a.* family relationship quality (interactive with gender); *b.* school performance; *c.* social communication (interactive with gender) could be predictors for alcohol high-use among college students is justified. According to the final model, comparing to participants who study less than 5 hours per week, those who study more than 5 hours have on average 0.56 (95%CI: 0.31~1.02) times less odds to be an alcohol high-user (95%CI: 0.31~1.02). Participants who have one more time of absence from class will on average have 1.07 (95%CI: 1.03~1.13) times more odds being an alcohol high-user. These findings about the predictive effect of academic performance on alcohol use is consistent with previous evidence[@Hayatbakhsh2011].For female college students, every one unit of family relationship quality increase would lead to 0.67 (95%CI: 0.49~9.90) times less odds being alcohol high-user. For male students, every one unit of family relationship quality increase would lead to 0.71 (95%CI: 0.53~0.94) times less odds being alcohol high-user. These indicate the predictive effects of family relationship on alcohol use are present and different across genders. This finding is consistent with previous evidence [@Kelly2011]. For female college students, comparing to students who do not have social involvement frequently, those who usually have social engagement have 2.77 (95%CI: 12.36~5.84) times more odds of being alcohol high-users. For male students, this effect is also present but the effect size goes as high as 12.36 times more odds of being alcohol high-users. These indicate the predictive effects of social engagement on alcohol use are present and tremendously different across genders. This finding is consistent with previous evidence[@Senchak1998].
### 4.4.2 exploring predictions
#### 4.4.2.1 cross tabulation of predcition versus the actual values
```{r}
prob <- predict(fit2, type = "response")
alc <- alc %>%
mutate(probability = prob)
alc <- alc %>%
mutate(prediction = probability>0.5)
high_use <- alc$high_use %>%
factor(level = c("TRUE", "FALSE"))
prediction <- alc$prediction %>%
factor(level = c("TRUE", "FALSE"))
accuracy.table <- table(high_use = high_use, prediction = prediction)%>%
addmargins
accuracy.table
```
```{r}
#generate proportion of predictive performance
table(high_use = high_use, prediction = prediction) %>%
prop.table %>%
addmargins %>%
print(digits = 2)
```
```{r}
#generate a function that calculates some indicators for sensitivity and specificity
my.fun <- function(array){
TP <- array[1,1] #true positive
FN <- array[1,2] #false negative
FP <- array[2,1] #false positive
TN <- array[2,2] #true negative
PP <- array[3,1] #positive
PN <- array[3,2] #negative
P <- array[1,3] # positive
N <- array[2,3] # negative
PPV <- TP/PP #positive predictive value
FOR <- FN/PN #false omission rate
FDR <- FP/PP # false discovery rate
NPV <- TN/PN # negative predictive value
TPR <- TP/P #true positive rate
FPR <- FP/N #false positive rate
FNR <- FN/P #false negative rate
TNR <- TN/N #true negative rate
a <- paste("Positive predictive value is", round(PPV,2))
b <- paste("False omission rate is", round(FOR,2))
c <- paste("False discovery rate is", round(FDR,2))
d <- paste("Negative predictive value is", round(NPV,2))
e <- paste("True positive rate is", round(TPR,2))
f <- paste("False positive rate is", round(FPR,2))
g <- paste("False negative rate is", round(FNR,2))
h <- paste("True negative rate is", round(TNR,2))
output <- list(a,b,c,d,e,f,g,h)
return(output)
}
my.fun(accuracy.table)
```
Among 259 participants who are not alcohol high-users, our model correctly predicts 236 (91%) of them (True negative rate). Among 111 participants who are alcohol high-users, our model correctly predicts 54 of them (49%) of them (True positive rate). In all, among the 370 predicts, 80(21.6%) were inaccurate.
### 4.4.2.2 scatter plot of the prediction versus the actual values
```{r}
library(dplyr); library(ggplot2)
p5 <- alc %>%
ggplot(aes(x = probability,
y = high_use,
color = prediction,
shape =factor(probability>0.5))) +
geom_point(position = position_jitter(0.01),
alpha = 0.8, size =2)
p5
```
### 4.4.2.3 comparing the model to the performance of random guess
```{r}
random.guess <- runif(n= nrow(alc), min = 0, max = 1)
alc <- alc %>%
mutate(random.guess = random.guess)
alc <- alc %>%
mutate(prediction.guess = random.guess>0.5)
high_use = alc$high_use %>%
factor(levels = c("TRUE", "FALSE"))
prediction.guess = alc$prediction.guess %>%
factor(levels = c("TRUE", "FALSE"))
accuracy.tab.rand<- table(high_use = high_use, prediction = prediction.guess) %>%
addmargins()
accuracy.tab.rand
```
```{r}
my.fun(accuracy.tab.rand)
```
Among 259 participants who are not alcohol high-users, random guess correctly guesses 126 (57%) of them. Among 111 participants who are alcohol high-users, random guess correctly guesses 55 of them (47%) of them. In all, among the 370 predicts, 181(49%) were inaccurate. Our model shows a tremendously better overall performance than random guess. However, its effect on correctly predicting the alcohol high-users is roughly equal to random guess, indicating the model is better applied in predicting who is a non-alcohol-high-user.
## 4.5 cross validation (Bonus)
### 4.5.1 define loss function
```{r}
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
```
### 4.5.2 compute prediction error base on traing data set
```{r}
# compute the average number of wrong predictions in the (training) data
training.error.full <- loss_func(alc$high_use, alc$probability)
training.error.full
```
The prediction error rate is 21.6%, outperforming the model in Exercise Set 3, which had about 26% error.
### 4.5.3 compute prediction error base on 10-fold cross validation
```{r}
# 10-fold cross-validation
set.seed(16)
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = fit2, K = 10)
cross.val.error.full <- cv$delta[1]
cross.val.error.full
```
According to the result of 10 fold cross-validation, the model has an average error rate of 22.2%, a bit larger than the results from training model, but the error rate is still notably lower than the model in Exercise.
# 5 Observing the relationship between prediction error and number of predictors (Super Bonus)
## 5.1 preparation
```{r}
library(utils)#install.packages("utils") library for generating all possible combinations of n elements
#pass the name of 4 predictors for our final model into an object
used.predictor <- c("family.quality:sex",
"social:sex",
"off.class.performance",
"in.class.performance")
```
## 5.2 generating prediction error for all possible combinations of subsets of selected predictors
```{r}
#define a list "mylist"
mylist <- list()
#define a matrix with 2 rows and 6 columns, "ct.error", which means
#cross-validation and training error
ct.error <- matrix(nrow=2, ncol = 6)
#start a loop that generate all possible combinations of the 4 used predictors,
#the combination could have 1-4 elements. Four each number of element, start a
#loop (i in 1:4); Within the loop, another loop is used to pass all the prediction
#error results from cross validation and training data set into a matrix. Each
#Matrix will have two rows saving results of cv and training data set, respectively.
#The number of columns will be dependent on how many combinations will be produced,
#with the maximum number being 6 (number of possible combinations of 2 predictors).
#Base on the number of i, 4 matrices will be generated, and saved in mylist.
for(i in 1:4){
combinations <- combn(used.predictor, i)
all.formula.text <- apply(combinations, 2,
function(x)paste("high_use~",
paste(x, collapse = "+")))
for(j in 1:length(all.formula.text)){
all.formula <- as.formula(all.formula.text[j])
model <- glm(all.formula, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K =10)
ct.error[1,j] <- cv$delta[1]
alc <- mutate(alc, probability = predict(model, type = "response"))
ct.error[2,j] <- loss_func(alc$high_use, alc$probability)
}
mylist[[i]] <- ct.error
ct.error <- matrix(nrow=2, ncol = 6)
}
#collapse the 4 matrices in mylist into 4 data frames.
for(w in 1:4){
assign(paste0("df",w), as.data.frame(mylist[[w]]))
}
#merge the 4 data frames into 1 by row.
all.error <- rbind(df1,df2,df3,df4) #name the data set as all.error
#add a new column in all.error, which reflects if the result of this row is
#from cross validation or training set
tag <- rep(c("pred_cv", "pred_training"), times = 4)
#add another new column in all.error, which reflects if the result of this row is
#base on 1, 2, 3 or 4 predictors.
predictor_number <- rep(c(1,2,3,4), each = 2)
all.error <- all.error %>%
mutate(tag = tag,
predictor_number = predictor_number)
#calculate the mean and sd for each row.
#note that the rows base on 4 predictor will not have sd, since there is only
#one combination.
all.error <- all.error %>%
mutate(mean = rowMeans(select(.,V1:V6), na.rm = T),
sd = apply(.[,1:6], 1, function(x)sd(x, na.rm=T)))
#check the all.error data set
all.error
```
## 5.3 plotting the trends of training&validation prediction errors by different number of predictors
```{r}
#plot all.error
#the error ribbon is 95% confidence interval
#4 predictors (the fitted model) do now have a error range because there is only one combination
all.error %>% ggplot(aes(x = factor(predictor_number), y = mean, group = tag), color = tag) +
geom_line(aes(color = tag))+
geom_point()+
geom_ribbon(aes(ymin = mean-1.96*sd/sqrt(rowSums(!is.na(select(all.error,V1:V6)))),
ymax = mean+1.96*sd/sqrt(rowSums(!is.na(select(all.error,V1:V6)))),
fill = tag), alpha =0.25,
position = position_dodge(0.05))+
guides(fill = guide_legend(title = "Training/Cross-validation", title.position = "top"),
color = guide_legend(title = "Training/Cross-validation", title.position = "top"))+
theme_bw()+
theme(legend.position = c(0.82,0.85), axis.text.x = element_text(size=12),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = "", y = "Prediction Error",
title = "Trends of training&validation prediction errors by different number of predictors")+
scale_x_discrete(labels = c("1 predictor", "2 predictors",
"3 predictors",
"4 predictors \n(fitted model)"))+
scale_fill_discrete(labels = c("pred_cv" = "error base on cross-validation",
"pred_training" = "error base on training set"))+
scale_color_discrete(labels = c("pred_cv" = "error base on cross-validation",
"pred_training" = "error base on training set"))+
annotate("text",
label =
"↓4-predictor model does \nnot have CI since it only \nhas one type of combination",
x = 3.9, y = 0.24, color = "red")+
geom_point(aes(x=4, y= 0.22), shape = 1, size = 10, color = "red")
```
According to the plot, the prediction error of our 4-predictor final model is low comparing to the mean prediction errors of the possible combinations of either of the 1-,2-, and 3- predictor models. However, this goodness is only statistically significant comparing to 1- predictor models.(falls with 95%ci of 2- and 3- predictor models). This might be due to the possible combinations of 4 predictors is very small in number, resulting in large error ranges.
# Super Bonus: Continued
Inspired by the bonus task, where different <4 number of predictors' influence on prediction error was observed, I started to get interested in how different number of random combinations of predictors added to the final model would affect the error. There are almost 30 variables that were not used in the final model. Twenty-three of them do not have direct relationship with the entered predictors, and hence they were selected to be a free predictor pool. One to 15 different predictors were randomly selected from the pool, each with 100 random repetitions (if all possible combinations of the number of predictors <100, then all possible combination will be used), resulting in 1423 models. The error rate base on training dataset and 10 fold cross validation were computed and plotted in a line chart. 95% confidence interval for each number of added predictors were also calculated and visualized.
The reason why only 15 maximum added predictors will be used instead of all 23 predictors is because the current sample size could not faithfully support model with more than 19 predictors, according to a rule of thumb that for each predictor used in model, a sample of 20 is required.
**Preparing the predictor pool**
```{r}
# The predictors used in final 4-factor model
fixed.predictor <- c("family.quality:sex",
"social:sex",
"off.class.performance",
"in.class.performance")
# The variables not used in final model
not.used.predictor <- c("sex", "famsize", "studytime", "famrel", "Dalc",
"Walc", "G1", "G2", "G3", "alc_use", "high_use",
"family.quality", "social", "probability",
"prediction", "random.guess", "prediction.guess",
"goout")
#The set of free predictor pool
free.predictor<- setdiff(names(alc), fixed.predictor)
free.predictor<- setdiff(free.predictor, not.used.predictor)
```
**Building a loop that generates the result of 1423 models with 5~19 predictors (4 predictors in final model are fixed)**
```{r, cache = T}
mylist <- list()
ct.error <- matrix(nrow=2, ncol = 100)
for(i in 1:15){
combinations <- combn(free.predictor, i)
if(choose(23,i)>100){
ss = 100
}else{
ss = choose(23,i)
}
for(j in 1:ss){
rn <- round(runif(1,min = 1, max = choose(23,i)), 0)
sample.comb <- combinations[,rn]
formula.text <- paste(
"high_use ~ family.quality:sex + social:sex + off.class.performance + in.class.performance+",
paste(sample.comb, collapse = "+"))
model <- glm(formula.text, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K =10)
ct.error[1,j] <- cv$delta[1]
alc <- mutate(alc, probability = predict(model, type = "response"))
ct.error[2,j] <- loss_func(alc$high_use, alc$probability)
}
mylist[[i]] <- ct.error
ct.error <- matrix(nrow=2, ncol = 100)
}
```
**Collapsing the results into different data frame and merge them**
```{r}
for(w in 1:15){
assign(paste0("df",w), as.data.frame(mylist[[w]]))
}
all.error <- rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,df15)
tag <- rep(c("pred_cv", "pred_training"), times = 15)
#add another new column in all.error, which reflects if the result of this row is
#base on 1-15 predictors.
predictor_number <- rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), each = 2)
all.error <- all.error %>%
mutate(tag = tag,
predictor_number = predictor_number)
#calculate the mean and sd for each row.
all.error <- all.error %>%
mutate(mean = rowMeans(select(.,V1:V100), na.rm = T),
sd = apply(.[,1:100], 1, function(x)sd(x, na.rm=T)))
```
**Plotting**
```{r}
#plot all.error
#the error ribbon is 95% confidence interval
#4 predictors (the fitted model) do now have a error range because there is only one combination
all.error %>% ggplot(aes(x = factor(predictor_number), y = mean, group = tag)) +
geom_line()+
geom_point()+
geom_ribbon(aes(ymin =
mean-1.96*sd/sqrt(rowSums(!is.na(select(all.error,V1:V100)))),
ymax =
mean+1.96*sd/sqrt(rowSums(!is.na(select(all.error,V1:V100)))),
fill = tag), alpha =0.25,
position = position_dodge(0.05))+
guides(fill = guide_legend(title = "Training/Cross-validation",
title.position = "top"),
color = guide_legend(title = "Training/Cross-validation",
title.position = "top"))+
theme_bw()+
theme(legend.position = c(0.2,0.15),
axis.text.x = element_text(size=12),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = "Number of random predictors added to final model(4 predictors)",
y = "Prediction Error",
title="Trends of training&validation error rates of final model plus \ndifferent number of random predictors")+
scale_fill_discrete(labels = c("pred_cv" = "error base on cross-validation",
"pred_training" = "error base on training set"))+
scale_color_discrete(labels = c("pred_cv" = "error base on cross-validation",
"pred_training" = "error base on training set"))+
geom_line(aes(y = 0.22), color = "coral", size = 0.2, alpha = 0.8)+
geom_line(aes(y = 0.21), color = "cyan3", size = 0.2, alpha =1)+
annotate("text",
label = "↓Error rate of final model base on cross validation",
x = 11, y = 0.22, vjust = -0.5)+
annotate("text",
label = "↓Error rate of final model base on training dataset",
x = 11, y = 0.21, vjust = -0.5)
```
It is found in the plot that the prediction error rate of the 4-predictor final model by cross validation is always lower than the mean prediction error (and their lower ends of confidence interval) of the final model plus 1 to 15 randomly selected predictors, indicating the goodness of our final model.
It is also interesting to observe that the more predictors introduced to the model, the error rate by training data set keeps decreasing, indicating more predictors produce better models. However, the results of error rate by cross validation show an opposite effect, where the error rates generally increase with more predictors (though some flucutations are present). Put together, it can be infered that measuring model error rates using training data set itself would lead to over-estimation of the model goodness when more predictors enter the model.
******
**Reference**