forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathExercise3.Rmd
847 lines (548 loc) · 33.7 KB
/
Exercise3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
---
title: "**Introduction to Open Data Science, Exercise Set 3**"
subtitle: "**Logistic regression**"
output:
html_document:
theme: flatly
highlight: haddock
toc: true
toc_depth: 2
number_section: false
---
This set consists of a few numbered exercises.
Go to each exercise in turn and do as follows:
1. Read the brief description of the exercise.
2. Run the (possible) pre-exercise-code chunk.
3. Follow the instructions to fix the R code!
## 3.0 INSTALL THE REQUIRED PACKAGES FIRST!
One or more extra packages (in addition to `tidyverse`) will be needed below.
```{r}
# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):
# install.packages("boot")
# install.packages("readr")
```
## 3.1 More datasets
We will be combining, wrangling and analysing two new data sets retrieved from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html), a great source for open data.
The data are from two identical questionnaires related to secondary school student alcohol consumption in Portugal. Read about the data and the variables [here](https://archive.ics.uci.edu/ml/datasets/Student+Performance).
R offers the convenient `paste()` function which makes it easy to combine characters. Let's utilize it to get our hands on the data!
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/"
```
### Instructions
- Create and print out the object `url_math`.
- Create object `math` by reading the math class questionaire data from the web address defined in `url_math`.
- Create and print out `url_por`.
- Adjust the code: similarily to `url_math`, make `url_por` into a valid web address using `paste()` and the `url` object.
- Create object `por` by reading the Portuguese class questionaire data from the web address defined in `url_por`.
- Print out the names of the columns in both data sets.
Hint:
- You can see the `paste()` functions help page with `?paste` or `help(paste)`
### R code
```{r}
# This is a code chunk in RStudio editor.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/"
# web address for math class data
url_math <- paste(url, "student-mat.csv", sep = "/")
# print out the address
url_math
# read the math class questionnaire data into memory
math <- read.table(url_math, sep = ";" , header = TRUE)
# web address for Portuguese class data
url_por <- paste(url, "student-por.csv", sep = "/")
# print out the address
url_por
# read the Portuguese class questionnaire data into memory
por <- read.table(url_por, sep = ";", header = TRUE)
# look at the column names of both data sets
colnames(math); colnames(por)
dim(math); dim(por)
```
## 3.2 Joining two datasets
There are multiple students who have answered both questionnaires in our two datasets. Unfortunately we do not have a single identification variable to identify these students. However, we can use a bunch of background questions together for identification.
Combining two data sets is easy if the data have a mutual identifier column or if a combination of mutual columns can be used as identifiers. (That's not always the case.)
Here we'll use `inner_join()` function from the dplyr package to combine the data - remember the [dplyr etc. cheatsheets!](https://www.rstudio.com/resources/cheatsheets/). This means that we'll only keep the students who answered the questionnaire in both math and Portuguese classes.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
math <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/student-mat.csv", sep=";", header=TRUE)
por <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/student-por.csv", sep=";", header=TRUE)
```
### Instructions
- Access the dplyr library and create the objects `free_cols` and `join_cols`:
- The first one (`free_cols`) will be a vector of the names of the six columns that vary in the data sets, namely: failures, paid, absences, G1, G2, and G3.
- The other one (`join_cols`) will be a vector of the names of the rest of the variables; use the handy `setdiff()` function to obtain that vector, based on the `free_cols`.
- Adjust the code: define the argument `by` in the `inner_join()` function to join the `math` and `por` data frames. Use the columns defined in `join_cols`.
- Print out the column names of the joined data set.
- Adjust the code again: add the argument `suffix` to `inner_join()` and give it a vector of two strings: ".math" and ".por".
- Join the data sets again and print out the new column names.
- Use the `glimpse()` function (from dplyr) to look at the joined data. Which data types are present?
Hints:
- You can create a vector with `c()`. Comma will separate the values.
- Remember to use quotes when creating string or character objects.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# math and por data sets are available
# access the dplyr package
library(dplyr)
# give the columns that vary in the two data sets
colnames(math); colnames(por)
free_cols <- c("failures", "paid", "absences", "G1", "G2", "G3")
# the rest of the columns are common identifiers used for joining the data sets
join_cols <- setdiff(colnames(por), free_cols)
join_cols
# join the two data sets by the selected identifiers
math_por <- inner_join(math, por, by = join_cols, suffix = c(".math", ".por"))
# look at the column names of the joined data set
colnames(math_por)
# glimpse at the joined data set
glimpse(math_por)
```
## 3.3 The if-else structure
The `math_por` data frame now contains - in addition to the background variables used for joining `por` and `math` - two possibly different answers to the same questions for each student. To fix this, you'll use programming to combine these 'duplicated' answers by either:
- taking the rounded average (**if** the two variables are numeric)
- simply choosing the first answer (**else**).
You'll do this by using a combination of a `for`-loop and an `if`-`else` structure.
The `if()` function takes a single logical condition as an argument and performs an action only if that condition is true. `if` can then be combined with `else`, which handles the cases where the condition is false. Schematically:
```
if(condition) {
do something
} else {
do something else
}
```
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(dplyr)
math <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/student-mat.csv", sep=";", header=TRUE)
por <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/student-por.csv", sep=";", header=TRUE)
free_cols <- c("failures","paid","absences","G1","G2","G3")
join_cols <- setdiff(colnames(por), free_cols)
math_por <- inner_join(math, por, by = join_cols, suffix = c(".math", ".por"))
```
### Instructions
- Print out the column names of `math_por` (use `colnames()`)
- Adjust the code: Create the data frame `alc` by selecting only the columns in `math_por` which were used for joining the two questionnaires. The names of those columns are available in the `join_cols` object.
- Print out the object that you created earlier for the columns that varied (`free_cols`).
- Execute the `for` loop (don't mind the "change me!").
- Take a `glimpse()` at the `alc` data frame. As you can see, it's not ready yet...
- Adjust the code inside the `for` loop: if the first of the two selected columns is not numeric, add the first column to the `alc` data frame.
- Execute the modified `for` loop and `glimpse()` at the new data again.
Hint:
- Inside the for-loop, use `first_col` in exchange for the "change me!".
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# dplyr, math_por, join_by are available
# print out the column names of 'math_por'
colnames(math_por)
# create a new data frame with only the joined columns
alc <- select(math_por, all_of(join_cols))
colnames(alc)
# print out the columns not used for joining (those that varied in the two data sets)
select(math_por, -all_of(join_cols))
# for every column name not used for joining...
for(col_name in free_cols) {
# select two columns from 'math_por' with the same original name
two_cols <- select(math_por, starts_with(col_name))
# select the first column vector of those two columns
first_col <- select(two_cols, 1)[[1]]
# then, enter the if-else structure!
# if that first column vector is numeric...
if(is.numeric(first_col)) {
# take a rounded average of each row of the two columns and
# add the resulting vector to the alc data frame
alc[col_name] <- round(rowMeans(two_cols))
} else { # else (if the first column vector was not numeric)...
# add the first column vector to the alc data frame
alc[col_name] <- first_col
}
}
# glimpse at the new combined data
glimpse(alc)
two_cols <- select(math_por, starts_with("paid"))
two_cols
first <- select(two_cols,1)
first
```
## 3.4 Mutations
Mutating a data frame means adding new variables as mutations of the existing ones. The `mutate()` function is also from the dplyr package, which belongs to the tidyverse packages. The tidyverse includes several packages that work well together, such as dplyr and ggplot2.
The tidyverse functions have a lot of similarities. For example, the first argument of the tidyverse functions is usually `data`. They also have other consistent features which makes them work well together and easy to use.
Let's now create some new variables in the joined data set!
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(dplyr)
math <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/student-mat.csv", sep=";", header=TRUE)
por <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/student-por.csv", sep=";", header=TRUE)
free_cols <- c("failures","paid","absences","G1","G2","G3")
join_cols <- setdiff(colnames(por), free_cols)
math_por <- inner_join(math, por, by = join_cols, suffix = c(".math", ".por"))
alc <- select(math_por, all_of(join_cols))
for(col_name in free_cols) {
two_cols <- select(math_por, starts_with(col_name))
first_col <- select(two_cols, 1)[[1]]
if(is.numeric(first_col)) {
alc[col_name] <- round(rowMeans(two_cols))
} else {
alc[col_name] <- first_col
}
}
library(ggplot2)
```
### Instructions
- Mutate `alc` by creating the new column `alc_use` by averaging weekday and weekend alcohol consumption.
- Draw a bar plot of `alc_use`.
- Define a new asthetic element to the bar plot of `alc_use` by defining `fill = sex`. Draw the plot again.
- Adjust the code: Mutate `alc` by creating a new column `high_use`, which is true if `alc_use` is greater than 2 and false otherwise.
- Initialize a ggplot object with `high_use` on the x-axis and then draw a bar plot.
- Add this element to the latter plot (using `+`): `facet_wrap("sex")`.
Hints:
- Use the `>` operator to test if the values of a vector are greater than some value.
- Add the argument `aes(x = high_use)` to the `ggplot()` function to initialize a plot with `high_use` on the x-axis.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# access the tidyverse packages dplyr and ggplot2
library(dplyr); library(ggplot2)
# define a new column alc_use by combining weekday and weekend alcohol use
alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
# initialize a plot of alcohol use
g1 <- ggplot(data = alc, aes(x = alc_use))
# define the plot as a bar plot and draw it
g1 + geom_bar()
# define a new logical column 'high_use'
alc <- mutate(alc, high_use = alc_use > 2)
# initialize a plot of 'high_use'
g2 <- ggplot(data = alc)
# draw a bar plot of high_use by sex
g2 + aes(x = factor(high_use, level = c("TRUE", "FALSE"))) + geom_bar() +
facet_wrap(~factor(sex, level = c("M", "F")))
```
## 3.5 So many plots
You are probably curious to find out how the distributions of some of the other variables in the data look like. Well, why don't we visualize all of them!
You'll also meet another new tidyverse toy, the pipe-operator: `%>%`.
The pipe (`%>%`) takes the result produced on its left side and uses it as the first argument in the function on its right side. Since the first argument of the tidyverse functions is usually `data`, this allows for some cool chaining of commands.
We'll look at `%>%` more closely in the next exercise. But now, let's draw some plots.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
```
### Instructions
- Access the tidyverse libraries tidyr, dplyr and ggplot2
- Take a glimpse at the `alc` data
- Apply the `gather()` function on `alc` and then take a glimpse at the resulting data directly after, utilizing the pipe (`%>%`). What does gather do?
- Take a more detailed look similarly, but using the `View()` function, and browse the data. Can you now see better what happens when you use the `gather()` function?
- Draw a plot of each variable in the `alc` data by first changing the values into names-value pairs and then visualizing them with ggplot. Define the plots as bar plots by adding the element `geom_bar()`, (using `+`).
Hint:
- Add the code `+ geom_bar()` to the line where the plots are drawn.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2);library(tidyverse)
# glimpse at the alc data
alc %>% glimpse()
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
# it may help to take a closer look by View() and browse the data
gather(alc) %>% view
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + geom_bar()+facet_wrap(~key, scales = "free")
```
## 3.6 The pipe: summarising by group
The pipe operator, `%>%`, takes the result of the left-hand side and uses it as the first argument of the function on the right-hand side. For example:
```
1:10 %>% mean() # result: 5.5
```
The parentheses of the 'target' function (here mean) can be dropped unless one wants to specify more arguments for it.
```
1:10 %>% mean # result: 5.5
```
Chaining operations with the pipe is great fun, so let's try it!
Utilizing the pipe, you'll apply the functions `group_by()` and `summarise()` on your data. The first one splits the data to groups according to a grouping variable (a factor, for example). The latter can be combined with any summary function such as `mean()`, `min()`, `max()` to summarize the data.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
```
### Instructions
- Access the tidyverse libraries dplyr and ggplot2
- Execute the sample code to see the counts of males and females in the data
- Adjust the code to calculate means of the grades of the students: inside `summarise()`, after the definition of `count`, define `mean_grade` by using `mean()` on the variable `G3`.
- Adjust the code: After `sex`, add `high_use` as another grouping variable. Execute the code again.
Hints:
- Remember to separate inputs inside functions with a comma. Here the first input of `summarise` is `count`, and the second one should be `mean_grade`.
- Also separate the grouping variables with a comma.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# access the tidyverse libraries dplyr and ggplot2
library(dplyr); library(ggplot2)
# produce summary statistics by group
alc %>% group_by(sex) %>% summarise(count = n())
```
## 3.7 Box plots by groups
[Box plots](https://en.wikipedia.org/wiki/Box_plot) are an excellent way of displaying and comparing distributions. A box plot visualizes the 25th, 50th and 75th percentiles (the box), the typical range (the whiskers) and the outliers of a variable.
The whiskers extending from the box can be computed by several techniques. The default (in base R and ggplot) is to extend them to reach to a data point that is no more than 1.5*IQR away from the box, where IQR is the inter quartile range defined as
`IQR = 75th percentile - 25th percentile`
Values outside the whiskers can be considered as outliers, unusually distant observations. For more information on IQR, see [wikipedia](https://en.wikipedia.org/wiki/Interquartile_range), for example.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
```
### Instructions
- Initialize a plot of student grades (`G3`), with `high_use` grouping the grade distributions on the x-axis. Draw the plot as a box plot.
- Add an aesthetic element to the plot by defining `col = sex` inside `aes()`
- Define a similar (box) plot of the variable `absences` grouped by `high_use` on the x-axis and the aesthetic `col = sex`.
- Add a main title to the last plot with `ggtitle("title here")`. Use "Student absences by alcohol consumption and sex" as a title, for example.
- Does high use of alcohol have a connection to school absences?
Hints:
- In ggplot, you can add stuff to the initialized plot with the `+` operator, e.g. `+ ylab("text here")`. Same goes for titles.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
library(ggplot2)
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")
# initialize a plot of high_use and absences
g2 <- alc %>% ggplot(aes(x = high_use, y = absences))
# define the plot as a box plot and draw it
g2 + geom_boxplot() + ylab ("absences from school") +ggtitle ("Student absences by alcohol consumption and sex")
```
## 3.8 Learning a logistic regression model
We will now use [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) to identify factors related to higher than average student alcohol consumption. You will also attempt to learn to identify (predict) students who consume high amounts of alcohol using background variables and school performance.
Because logistic regression can be used to classify observations into one of two groups (by giving the group probability) it is a [binary classification](https://en.wikipedia.org/wiki/Binary_classification) method. You will meet more classification methods in the next week's exercises.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
```
### Instructions
- Use `glm()` to fit a logistic regression model with `high_use` as the target variable and `failures` and `absences` as the predictors.
- Print out a summary of the model.
- Add another explanatory variable to the model after absences: 'sex'. Repeat the above.
- Use `coef()` on the model object to print out the coefficients of the model.
Hint:
- Use the `summary()` function to print out a summary.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# find the model with glm()
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
# print out the coefficients of the model
m$coefficients
coef(m)
exp(coef(m))
```
## 3.9 From coefficients to odds ratios
From the fact that the computational target variable in the logistic regression model is the log of odds, it follows that applying the exponent function to the modeled values gives the odds:
$$\exp \left( log\left( \frac{p}{1 - p} \right) \right) = \frac{p}{1 - p}.$$
For this reason, the exponents of the coefficients of a logistic regression model can be interpreted as odds ratios between a unit change (vs. no change) in the corresponding explanatory variable.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
```
### Instructions
- Use `glm()` to fit a logistic regression model.
- Creat the object `OR`: Use `coef()` on the model object to extract the coefficients of the model and then apply the `exp` function on the coefficients.
- Use `confint()` on the model object to compute confidence intervals for the coefficients. Exponentiate the values and assign the results to the object `CI`. (R does this quite fast, despite the "Waiting.." message)
- Combine and print out the odds ratios and their confidence intervals. Which predictor has the widest interval? Does any of the intervals contain 1 and why would that matter?
Hints:
- You can get the confidence intervals with `confint(*model_object*)`
- The logistic regression model is saved in the object `m`.
- `coef(m) %>% exp` is the same as `exp( coef(m) )`
- You get odds ratios by exponentiating the logistic regression coefficients.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# find the model with glm()
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# compute odds ratios (OR)
OR <- coef(m) %>% exp
OR
# compute confidence intervals (CI)
confint(m)
CI <- exp(confint(m))
CI
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
```
## 3.10 Binary predictions (1)
When you have a linear model, you can make predictions. A very basic question is, of course, how well does our model actually predict the target variable. Let's take a look!
The `predict()` function can be used to make predictions with a model object. If `predict()` is not given any new data, it will use the data used for finding (fitting, leaning, training) the model to make predictions.
In the case of a binary response variable, the 'type' argument of `predict()` can be used to get the predictions as probabilities (instead of log of odds, the default).
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
```
### Instructions
- Fit the logistic regression model with `glm()`.
- Create object `probabilities` by using `predict()` on the model object.
- Mutate the alc data: add a column 'probability' with the predicted probabilities.
- Mutate the data again: add a column 'prediction' which is true if the value of 'probability' is greater than 0.5.
- Look at the first ten observations of the data, along with the predictions.
- Use `table()` to create a cross table of the columns 'high_use' versus 'prediction' in `alc`. This is sometimes called a 'confusion matrix`.
Hints:
- Remember to use the [dplyr cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) by RStudio
- The `$` sign can be used to access columns of a data frame
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# fit the model
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
```
## 3.11 Binary predictions (2)
Let's continue to explore the predictions of our model.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
```
### Instructions
- Initialize the ggplot object and define `probability` as the x axis and `high_use` as the y axis.
- Use `geom_point()` to draw the plot.
- Add the aesthetic element `col = prediction` and draw the plot again.
- Use `table()` to create a cross table of 'high_use' versus 'prediction'
- Adjust the code: Use `%>%` to apply the `prop.table()` function on the output of `table()`
- Adjust the code: Use `%>%` to apply the `addmargins()` function on the output of `prop.table()`
Hint:
- Recall that the pipe (`%>%`) assigns the output of the function on its left side to the function on its right side. The idea is to chain the three commands `table`, `prop.table` and `addmargins` (in that order).
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# alc is available
# access dplyr and ggplot2
library(dplyr); library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use))
# define the geom as points and draw the plot
g+aes(color = prediction, shape = prediction)+geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
```
## 3.12 Accuracy and loss functions
A simple measure of performance in binary classification is accuracy: the average number of correctly classified observations.
Classification methods such as logistic regression aim to (approximately) minimize the incorrectly classified observations. The mean of incorrectly classified observations can be thought of as a penalty (loss) function for the classifier. Less penalty = good.
Since we know how to make predictions with our model, we can also compute the average number of incorrect predictions.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
```
### Instructions
- Define the loss function `loss_func`
- Execute the call to the loss function with `prob = 0`, meaning you define the probability of `high_use` as zero for each individual. What is the interpretation of the resulting proportion?
- Adjust the code: change the `prob` argument in the loss function to `prob = 1`. What kind of a prediction does this equal to? What is the interpretation of the resulting proportion?
- Adjust the code again: change the `prob` argument by giving it the prediction probabilities in `alc` (the column `probability`). What is the interpretation of the resulting proportion?
Hints:
- Select the whole code of `loss_func` to execute it.
- You can access the `probability` column in `alc` by using the `$`-mark.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
```
## 3.13 Cross-validation
[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good.
Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.
```{r, echo=FALSE}
# Pre-exercise-code (Run this code chunk first! Do NOT edit it.)
# Click the green arrow ("Run Current Chunk") in the upper-right corner of this chunk. This will initialize the R objects needed in the exercise. Then move to Instructions of the exercise to start working.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
```
### Instructions
- Define the loss function `loss_func` and compute the mean prediction error for the training data: The `high_use` column in `alc` is the target and the `probability` column has the predictions.
- Perform leave-one-out cross-validation and print out the mean prediction error for the testing data. (`nrow(alc)` gives the observation count in `alc` and using `K = nrow(alc)` defines the leave-one-out method. The `cv.glm` function from the 'boot' library computes the error and stores it in `delta`. See `?cv.glm` for more information.)
- Adjust the code: Perform 10-fold cross validation. Print out the mean prediction error for the testing data. Is the prediction error higher or lower on the testing data compared to the training data? Why?
Hint:
- The `K` argument in `cv.glm` tells how many folds you will have.
### R code
```{r}
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]
cv
cv1 <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv1
```
**GOOD JOB!**