forked from r4ds/bookclub-islr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04.Rmd
720 lines (397 loc) · 32.9 KB
/
04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
# Classification
**Learning objectives:**
- Compare and contrast **classification** with linear regression.
- Perform classification using **logistic regression**.
- Perform classification using **linear discriminant analysis (LDA)**.
- Perform classification using **quadratic discriminant analysis (QDA)**.
- Perform classification using **naive Bayes**.
- Identify the **strengths and weaknesses** of the various classification models.
- Model count data using **Poisson regression**.
## An Overview of Classification
- **Classification**: Approaches to make inference and/or predict qualitative (categorical) response variable
- Few common classification techniques (classifiers):
- logistic regression
- linear discriminant analysis (LDA)
- quadratic discriminant analysis (QDA)
- naive Bayes
- K-nearest neighbors
<br>
- **Examples of classification problems: **
<br>
1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
- Predictor variable: Symptoms
- Response variable: Type of medical conditions
<br>
2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
- Predictor variable: User's IP address, past transaction history, etc
- Response variable: Fraudulent activity (Yes/No)
<br>
3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
- Predictor variable: DNA sequence data
- Response variable: Presence of deleterious gene (Yes/No)
<br>
- In the following section, we are going to explore the `Default` dataset. The annual incomes ($X_1$ = `income`) and monthly credit card balances ($X_2$ =`balance`) are used to predict whether whether an individual will default on his or her credit card payment.
```{r fig4-1, cache=FALSE, echo=FALSE, fig.align="center", fig.cap="The distribution of balance and income split by the binary default variable respectively; Note. Defaulters represented as orange plus sign; non-defaulters represented as blue circle"}
knitr::include_graphics("./images/fig4_1.jpg", error = FALSE)
```
## Why NOT Linear Regression?
- a regression method cannot convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression
$$Y = \left\{ \begin{array}{ll}
1 & \mbox{if stroke};\\
2 & \mbox{if epileptic seizure};\\
3 & \mbox{if drug overdose}.\end{array} \right.$$
- a regression method will not provide meaningful estimates of Pr(Y |X), even with just two classes; partial estimates might be outside the [0, 1] probability interval
```{r fig4-2, cache=FALSE, eval=FALSE, echo=FALSE, fig.align="center", fig.cap="Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right: Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1."}
knitr::include_graphics("./images/fig4_2.jpg", error = FALSE)
```
## Logistic Regression
- **Logistic regression**: models the probability that Y belongs to a particular category (X)
- X is binary (0/1)
$$p(X) = β_0 + β_1X \space \Longrightarrow {Linear \space regression}$$
$$p (X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
$$odds = p (X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \Longrightarrow {odds \space value [0, ∞]}$$
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$
By logging the whole equation, we get
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$
To estimate the regression coefficient, we use **maximum likelihood (ME)**.
***Likelihood Function***
$$ℓ (\beta_{0}, \beta_{1}) = \prod_{i: y_{i}= 1} p (x_i) = \prod_{i': y_{i'}= 0} (1- p (x_{i'})) \Longrightarrow {Likelihood \space function}$$
### Multiple Logistic Regression
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p \\ \Downarrow \\ p(X) = \frac{e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}{1 + \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}$$
```{r fig4-3, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown."}
knitr::include_graphics("./images/fig4_3.jpg", error = FALSE)
```
### Multinomial Logistic Regression
- This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.
- However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline.
- Alternatively, you can use `Softmax coding, where we _treat all K classes symmetrically_, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes.
## Generative Models for Classification
**Why Logistic Regression is not ideal?**
- When there is substantial separation between the two classes, the
parameter estimates for the logistic regression model are surprisingly
unstable.
- If the distribution of the predictors X is approximately normal in
each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.
- Generative modelling can be naturally extended to the case
of more than two response classes.
<br>
**Common notations:**
<br>
- K $\Longrightarrow$ response class
- $π_k \Longrightarrow$ overall or prior probability that a randomly chosen observation comes from the prior kth class; can be obtained from the random
sample from the population
- $f_k(X) ≡ Pr(X|Y = k)^1 \Longrightarrow$ the density function of X density for an observation that comes from the kth class; requires some underlying assumption to estimate
<br>
Bayes’ theorem states that
$$Pr(Y = k|X = x) = \frac {π_k f_k(x)}{\sum_{l =1}^{k} π_lf_l(x)}$$
- $p_k(x) = Pr(Y = k|X = x) \Longrightarrow$ _posterior probability_ that an observation posterior X = x belongs to the kth class; computed from $f_k(X)$
## A Comparison of Classification Methods
Each of the classifiers below uses different estimates of $f_k(x)$.
- linear discriminant analysis;
- quadratic discriminant analysis;
- naive Bayes
### Linear Discriminant Analysis for p = 1
- one predictor
- classify an observation to the class for which $p_k(x)$ is greatest
**Assumptions:**
- we assume that $f_k(x)$ is normal or Gaussian with a classs pecific
mean and,
- a shared variance term across all K classes [$σ^2_1 = · · · = σ^2_K$ ]
The normal density takes the form
$$f_k(x) = \frac{1}{\sqrt{2πσk}}exp(- \frac{1}{2σ^2_k}(x- \mu_k)^2)$$
Then, the posterior probability (probability that the observation belongs to the kth class, given the predictor value for that observation) is
$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσk}}exp(- \frac{1}{2σ^2_k}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσk}}exp(- \frac{1}{2σ^2_k}(x- \mu_l)^2)}$$
**Additional mathematical formula**
After you log and rearrange the above equation, you will the following formula. The Bayes' classifier assign to one class if $2x (μ_1 − μ_2) > μ_1^2 − μ_2^2$ and otherwise.
$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$
The Bayes decision boundary is the point for which $δ_1(x) = δ_2(x)$
$$x = \frac{μ_1^2 − μ_2^2}{2(μ_1 − μ_2)} = \frac{μ_1 + μ_2}{2}$$
```{r fig4-4, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: Two one-dimensional normal density functions are shown. The dashed vertical line represents the Bayes decision boundary. Right: 20 observations were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data."}
knitr::include_graphics("./images/fig4_4.jpg", error = FALSE)
```
The **linear discriminant analysis (LDA)** method approximates the linear
discriminant analysis Bayes classifier by plugging estimates for $π_k$, $μ_k$, and σ^2 into equation 4.18.
$\hat μ_k$ is the average of all the training observations from the kth class
$$\hat{\mu}_{k} = \frac{1}{n_{k}}\sum_{i: y_{i}= k} x_{i}$$
$\hat σ^2$ is the weighted average of the sample variances for each of the K classes
$$\hat{\sigma}^2 = \frac{1}{n - K} \sum_{k = 1}^{K} \sum_{i: y_{i}= k} (x_{i} - \hat{\mu}_{k})^2$$
Note.
n = total number of training observations,
$n_k$ = number of training observations in the kth class
$π_k$ is estimated from the proportion of the training observations
that belong to the kth class.
$π_k = \frac{n_k}{n}$
LDA classifier assigns an observation X = x to the class for which $δ_k(x)$ is largest.
$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18} \\ \Downarrow \\ \hat δ_k(x) = x \cdot \frac{\hat \mu_k}{\hat \sigma^2} - \frac{\hat \mu_k^2}{2\hat \sigma^2} + log(\hat π_k)$$
### Linear Discriminant Analysis for p > 1
- multiple predictors; p > 1 predictors
- observations come from a multivariate Gaussian (or multivariate normal) distribution, with a **class-specific mean vector** and a common **covariance matrix**; $$N(μ_k,Σ)$$
**Assumptions: **
- each individual predictor follows a one-dimensional normal distribution, with predictors having some correlation
```{r fig4-5, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated and it has a circular base. Var(X_1) = Var(X_2) and Cor(X_1,X_2) = 0; Right: The two variables have a correlation of 0.7 with a elliptical base"}
knitr::include_graphics("./images/fig4_5.jpg", error = FALSE)
```
$\exp$
The multivariate Gaussian density is defined as:
$$f(x) = \frac{1}{(2π)^{\frac{p}{2}}|Σ|^{\frac{1}{2}}}\exp -\frac{1}{2}(x - \mu)^T Σ^{−1}(x − μ))$$
Bayes classifier assigns an observation X = x to the class for which $$δ_k(x)$$ is largest.
$$δ_k(x) = x^T Σ^{−1}μ_k - \frac{1}{2}μ_k^T Σ^{−1} μ_k + log π_k \Longrightarrow vector/matrix \space version \\ δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$
```{r fig4-6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "An example with three classes. The observations from each class are drawn from a multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines. The Bayes decision boundaries are once again shown as dashed lines. Overall, the LDA decision boundaries are pretty close to the Bayes decision boundaries, shown again as dashed lines. The test error rates for the Bayes and LDA classifiers are 0.0746 and 0.0770, respectively."}
knitr::include_graphics("./images/fig4_6.jpg", error = FALSE)
```
All classification models have training error rate, which can be displayed with a **confusion matrix**.
**Caveats of error rate: **
- training error rates will usually be lower than test error rates, which are the real quantity of interest. The higher the ratio of parameters _p_ to number of samples n, the more we expect this _overfitting_ to play a role.
- the trivial null classifier will achieve an error rate that is only a bit higher than the LDA training set error rate
- a binary classifier such as this one can make two types of errors (Type I and II)
- Class-specific performance _(sensitivity and specificity)_ is important in certain fields (e.g., medicine)
LDA has low sensitivity due to
1. LDA is trying to approximate the Bayes classifier, which has the lowest
total error rate out of all classifiers
2. In the process, the Bayes classifier will yield the smallest possible total number of misclassified observations, regardless of the class from which the errors stem.
3. It also uses a threshold of 50% for the posterior probability of default in order to assign an observation to the default class
$$Pr(default = Yes|X = x) > 0.5. \\ Pr(default = Yes|X = x) > 0.2.$$
```{r fig4-7, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The figure illustrates the trade-off that results from modifying the threshold value for the posterior probability of default. For the Default data set, error rates are shown as a function of the threshold value for the posterior probability that is used to perform the assignment. The black solid line displays the overall error rate. The blue dashed line represents the fraction of defaulting customers that are incorrectly classified, and the orange dotted line indicates the fraction of errors among the non-defaulting customers."}
knitr::include_graphics("./images/fig4_7.jpg", error = FALSE)
```
- As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases. The decision on the threshold must be based on **domain knowledge** (e.g., detailed information about the costs associated with default)
- ROC curve is a way to illustrate the two type of errors at all possible thresholds.
```{r fig4-8, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the “no information” classifier; this is what we would expect if student status and credit card balance are not associated with probability of default."}
knitr::include_graphics("./images/fig4_8.jpg", error = FALSE)
```
An ideal ROC curve will hug the top left corner, so the larger **area under the ROC curve (AUC)**, the better the classifier.
```{r tbl4_6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap="Possible results when applying a classifier or diagnostic test to a population"}
library("htmlTable")
library("magrittr")
matrix(c("True Neg. (TN)", "False Pos. (FP)", "N", "False Neg. (FN)", "True Pos. (TP)", "P", "N∗", "P∗", ""),
ncol = 3,
dimnames = list("Predicted class" = c(" − or Null", " + or Non-null", "Total"),
"True class" = c("Neg. or Null", "Pos. or Non-null", "Total"))) %>%
addHtmlTableStyle(align = "lcr") %>%
htmlTable
```
Important measures for classification and diagnostic testing:
- **False Positive rate (FP/N)** $\Longrightarrow$ Type I error, 1−Specificity
- **True Positive rate (TP/P)** $\Longrightarrow$ 1−Type II error, power, sensitivity, recall
- **Pos. Predicted value (TP/P∗)** $\Longrightarrow$ Precision, 1−false discovery proportion
- **Neg. Predicted value (TN/N∗)**
### Quadratic Discriminant Analysis (QDA)
- Assumptions similar to LDA, in which observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction
- QDA assumes that each class has its own covariance matrix
$$X ∼ N(μ_k,Σ_k) \Longrightarrow {Σ_k is \space covariance \space matrix \space for \space the \space kth \space class}$$
**Bayes classifier**
$$δ_k(x) = - \frac{1}{2}(x - \mu_k)^T Σ_k^{−1}(x - \mu_k) - \frac{1}{2}log|Σ_k| + log(π_k) \\ \Downarrow \\ δ_k(x) = - \frac{1}{2}x^T Σ_k^{−1}x - x^T Σ_k^{−1} \mu_k - \frac{1}{2}μ_k^T Σ_k^{−1} μ_k - \frac{1}{2}log|Σ_k| + log π_k$$
QDA classifier involves plugging estimates for **$Σ_k$, $μ_k$, and $π_k$** into the above equation, and then assigning an observation X = x to the class for which this quantity is **largest**.
The quantity x appears as a quadratic function, hence the name.
<br>
**Why the LDA to QDA is preferred or vice-versa?**
<br>
1. **Bias-variance trade-off**
<br>
- Pro LDA: LDA assumes that the K classes share a common covariance matrix and the quantity X becomes linear, which means there are $K_p$ linear coefficients to estimate.LDA is a much less flexible classifier than QDA, and so has substantially *lower variance*; improved prediction performance.
- Con LDA: If the assumption K classes share a common covariance matrix is badly off, LDA can suffer from *high bias*
- Conclusion: Use LDA when there is a few training observations; use QDA when the training set is very large or common covariance matrix is untennable.
```{r fig4-9, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem with Σ1 = Σ2. The shading indicates the QDA decision rule. Since the Bayes decision boundary is linear, it is more accurately approximated by LDA than by QDA. Right: Details are as given in the left-hand panel, except that Σ1 ̸= Σ2. Since the Bayes decision boundary is non-linear, it is more accurately approximated by QDA than by LDA."}
knitr::include_graphics("./images/fig4_9.jpg", error = FALSE)
```
### Naive Bayes
- Estimating a p-dimensional density function is challenging; naive bayes make a different assumption than LDA and QDA.
- an alternative to LDA that does not assume normally distributed
predictors
$$f_k(x) = f_{k1}(x_1) × f_{k2}(x_2)×· · ·×f{k_p}(x_p),$$
where $f_{kj}$ is the density function of the jth predictor among observations in the kth class
*Within the kth class, the p predictors are independent.*
**Why naive Bayes is better/powerful?**
1. By assuming that the p covariates are independent within each class, we assumed that there is no association between the predictors! When estimating a p-dimensional density function, it is difficult to calculate the *marginal distribution* of each predictor and *joint distribution* of the predictors.
2. Although p covariates might not be independent within each class, it is convenient and we obtain pretty decent results when the n is small, p is large.
3. It reduces variance, though it has some bias (Bias-variance trade-off)
**Options to estimate the one-dimensional density function fkj using training data**
1. [For Quantitative $X_j$] -> We assume $X_j |Y = k ∼ N(μ_{jk},σ_{jk}^2)$, where within each class, the jth predictor is drawn from a (univariate) normal distribution. It is **QDA-like with diagonal class-specific covariance matrix**
2. [For Quantitative $X_j$] -> Use a *non-parametric estimate* for $f_{kj}$. First, a histogram for the within-class observations and then estimate $f_{kj}(x_j)$. Or else, use **kernel density estimator**.
3. [For Qualitative $X_j$] ->Count the proportion of training observations for the jth predictor corresponding to each class.
Note: Fixing the threshold, the Naive Bayes has a higher error rate than LDA, but better prediction (higher sensitivity).
## Summary of the classification methods
### An Analytical Comparison
- **LDA** and **logistic regression** assume that the log odds of the posterior probabilities is _linear_ in x.
- **QDA** assumes that the log odds of the posterior probabilities is _quadratic_ in x.
- **LDA** is simply a restricted version of QDA with $Σ_1 = · · · = Σ_K = Σ$
- **LDA** is a special case of naive Bayes and vice-versa!
- **LDA** assumes that the features are normally distributed with a common within-class covariance matrix, and naive Bayes instead assumes _independence_ of the features.
- **Naive Bayes** can produce a more _flexible_ fit.
- **QDA** might be more accurate in settings where interactions among the predictors are important in discriminating between classes.
- **LDA > logistic regression** when the observations at each Kth class is normal.
- **K-nearest neighbors (KNN)** will be better classifiers when decision boudary is non-linear, n is large, and p is small.
- **KNN** has low bias but large variance; as such, KNN requires a lot of observations relative to the number of predictors.
- If decision boundary is non-linear but n is and p are small, then QDA may be preferred to KNN.
- KNN does not tell us which predictors are important!
<br>
_Final note._ The choice of method depends on (1) the true distribution of the predictors in each of the K classes,(2) the values of n and p - bias-variance trade-off
### An Empirical Comparison
```{r fig4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the linear scenarios described in the main text."}
knitr::include_graphics("./images/fig4_11.jpg", error = FALSE)
```
**When Bayes decision boundary is linear,**
_Scenario 1_: Binary class response, equal observations in each class, uncorrelated predictors
_Scenario 2_: Similar to Scenario 1, but the predictors had a correlation of −0.5.
_Scenario 3_: Predictors had a negative correlation, t-distribution (more extreme points at the tails)
```{r fig4-12, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the non-linear scenarios described in the main text"}
knitr::include_graphics("./images/fig4_12.jpg", error = FALSE)
```
**When Bayes decision boundary is non-linear,**
_Scenario 4_: normal distiibution, correlation of 0.5 between the predictors in the first class, and correlation of −0.5 between the predictors in the second class.
_Scenario 5_: Normal distribution, uncorrelated predictors
_Scenario 6_: Normal distribution, different diagonal covariance matrix for each class, small n
## Generalized Linear Models
**Count data** (e.g. number of bikers per hour) is neither quantitative nor qualitative
=> neither linear regression nor the classification approaches considered so far are applicable.
## Linear regression with count data - negative values
The results of fitting a least squares regression model to the `Bikeshare` data provides some reasonable results:
* as weather progressively worsens, the number of bikers decreases (_coefficients become negative wrt baseline_)
* the coefficients associated with season and time of day match expected patterns (_lowest in winter, and highest during peak commute times_)
```{r tab4-10, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Results for a least squares linear model fit to predict bikers in the Bikeshare data. For the qualitative variable weathersit, the baseline level corresponds to clear skies._"}
knitr::include_graphics("./images/tab4_10.jpg", error = FALSE)
```
```{r fig4-13, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_A least squares linear regression model was fit to predict bikers in the Bikeshare data set. Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight._"}
knitr::include_graphics("./images/fig4_13.jpg", error = FALSE)
```
***Problem 1***: <mark>*model predicts negative numbers of bikers at times*</mark>
## Linear regression with count data - heteroscedasticity
In this example, the variance of biker numbers changes as the mean number changes:
* during worse conditions, there are few bikers, and little variation in the number of bikers
* during better conditions, there are many bikers on average, but also larger variation in the number of bikers
```{r fig4-14, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers. A smoothing spline fit is shown in green. Right: The log of the number of bikers is displayed on the y-axis._"}
knitr::include_graphics("./images/fig4_14.jpg", error = FALSE)
```
***Problem 2***: <mark>*observed heteroscedasticity is a violation of linear model assumptions*</mark>
$$Y = \beta_{0} + \sum_{j=1}^p \beta_{j} + \epsilon$$
where $\epsilon$ is a mean-zero error term with a constant variance
Transforming to log improves the variance, but cannot be used where the response can take on a 0 value.
Log transformation also results in challenges in interpretation:
e.g. “_a one-unit increase in $X_j$ is associated with an increase in the mean of the log of $Y$ by an amount $β_j$_”
## Problems with linear regression of count data
***Problem 1***: <mark>*model predicts negative numbers of bikers at times*</mark>
***Problem 2***: <mark>*observed heteroscedasticity is a violation of linear model assumptions*</mark>
***Problem 3***: <mark>*integer values (bikers) predicted using a continuous response $Y$*</mark>
"_[A] Poisson regression model provides a much more natural and elegant approach for this task._"
## Poisson distribution
A count response variable $Y$ (which takes on non-negative integer values) can be modeled using the **Poisson distribution**, where the probability that $Y$ takes on a given count value $k$ can be calculated as:
$Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$ for $k$ = 0, 1, 2, ...
where $\lambda$ represents both the expected value (mean) and variance of $Y$:
$Y = E(Y) = Var(Y)$
=> "_[I]f $Y$ follows the Poisson distribution, then the larger the mean of $Y$, the larger its variance._"
```{r fig.cap= "_Plots of Poisson Distributions with different lambda values, showing how variance increases with increasing lambda. Note all values are non-negative integer values, suitable for modelling counts, k._"}
par(mfrow = c(2,2))
lambda <- c(1:4)
k <- c(0:10)
for (lam in lambda) {
Prk <- (exp(-lam)*lam^k)/factorial(k)
plot(k, Prk, type = 'b', ylim = c(0, 0.4), main = paste("lambda =", lam))
}
```
## Poisson Regression Model mean (lambda)
"_[R]ather than modeling [a count response variable], $Y$, as a Poisson distribution with a fixed mean value like $\lambda$ = 5, we would like to allow the mean to vary as a function of the covariates._"
The mean $\lambda$ can be modeled as a function of the predictor variables as follows:
$log(\lambda(X_1, ..., X_p) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$
NB: taking the log ensures that $\lambda$ can only be non-negative.
This is equivalent to representing the mean $\lambda$ as follows:
$\lambda = \text{E}(Y) = \lambda(X_1, ..., X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}$
## Estimating the Poisson Regression parameters
The calculation of $\lambda$ can then be used in the formula of the Poisson Distribution, allowing the Maximum Likelihood approach to be used in estimating the parameters, $\beta_0$, $\beta_1$,..., $\beta_p$:
Poisson Distribution Formula: $Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$ for $k$ = 0, 1, 2, ...
Maximum likelihood: $l(\beta_0, \beta_1, ..., \beta_p) = \Pi_{i=1}^n\frac{e^{-\lambda(x_i)}\lambda(x_i)^{y_i}}{y_i!}$
where $\lambda(x_i) = e^{\beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip}}$
Coefficients that maximize the likelihood $l(\beta_0, \beta_1, ..., \beta_p)$ (make the observed data as likely as possible) are chosen.
## Interpreting Poisson Regression
An increase in $X_j$ by one unit is associated with a change in $E(Y) = \lambda$ by a factor of $exp(\beta_j)$
```{r tab4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Results for Poisson regression model fit to predict bikers in the Bikeshare data. For the qualitative variable weathersit, the baseline level corresponds to clear skies._"}
knitr::include_graphics("./images/tab4_11.jpg", error = FALSE)
```
A change in weather from clear to cloudy skies is associated with a change in mean bike usage by a factor of
exp(-0.08) = 0.923
i.e. on average, only 92.3% as many people will use bikes compared to when it is clear (baseline weather).
## Advantages of Poisson Regression
Poisson regression has several advantages in modeling count data:
**Mean-variance relationship** We implicitly assume that mean bike usage in a given hour equals the variance of bike usage during that hour (cf use constant variance in linear regression).
**Non-negative fitted values** There are no negative predictions using the Poisson regression model.
## Generalized Linear Models
Generalized linear models (GLMs) all follow the same 'recipe':
* use a set of predictors $X_1$, ..., $X_p$ to predict a response $Y$
* model the response $Y$ as coming from a particular distribution
e.g. Poisson Distribution, for Poisson regression
* transform the mean of the response (via a _link function_ $\eta$) so that the transformed mean is a linear function of the predictors
e.g. for Poisson regression, $log(\lambda(X_1, ..., X_p) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$
## Lab: Classification Methods
## Meeting Videos
### Cohort 1
`r knitr::include_url("https://www.youtube.com/embed/_W0MGfmpzYY")`
<details>
<summary> Meeting chat log </summary>
```
00:08:27 Kim M: 👋😁
00:09:40 SriRam: Yay, summer time in SA
00:12:51 SriRam: lol
00:13:42 SriRam: Can someone help me with zoom, my Audio doesn’t work. I tried many microphones 🙁, all kinds of settings
00:14:01 SriRam: Any expert advice?
00:14:05 Raymond Balise: Mac or Windows?
00:14:10 SriRam: Mac
00:14:16 Raymond Balise: I supper common
00:14:27 Raymond Balise: there is an up arrow next to the mic
00:14:28 SriRam: 🙁
00:14:30 Raymond Balise: pick the mic there
00:14:40 Raymond Balise: Zoom usually guesses worng
00:14:53 SriRam: It says “same as system”
00:15:24 August: document html knit open that
00:15:24 Ryan Metcalf: Ctrl + Shift + B for building. Or Knit the single file.
00:15:29 Kim M: No other option? I see 'same as system' but the other option ('Microphone Array...') is selected
00:15:36 Raymond Balise: are you using an external mic
00:15:55 SriRam: Yes < I tried wired and wireless
00:36:18 Jon Harmon (jonthegeek): FYI: No newline between $$ and the equation to get it to render properly in HTML output. So:
$$Pr(Y = k|X = x) … {(x)}$$ works (even if it's multi-line in-between)
01:08:09 Raymond Balise: Lovely work. I need to get to another meeting.
01:12:36 Ryan Metcalf: Great job Mei Ling!
01:13:04 Kim M: Me. But I'm not sure I'll be able to get through.
01:13:14 Kim M: I'll try.
01:13:48 Laura Rose: sounds good!
01:14:05 Kim M: Ciao
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/3PriAsFD5Ps")`
<details>
<summary> Meeting chat log </summary>
```
00:22:00 Wayne Defreitas: YES tidymodels
00:31:06 August: cs = cubic spline
00:32:57 August: some additional stuff about splines in r
00:32:58 August: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3
00:42:58 Wayne Defreitas: #can create custom function [metric_set()] to specify which model metrics we want.
#available functions: accuracy(), kap(), sens(),spec(), ppv(), npv(), mcc(), j_index(), bal_accuracy, detection_prevalence(), precision(), recall(), f_meas()
custom_metrics <- metric_set (accuracy, sens, spec)
00:44:24 Wayne Defreitas: custom_metrics(leads_results,
truth=purchased,
estimate=.pred_class)
00:44:43 August: https://yardstick.tidymodels.org/reference/metric_set.html
01:02:54 Wayne Defreitas: Running to another meeting…thanks everyone
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/ZjFSODq2Ryk")`
<details>
<summary> Meeting chat log </summary>
```
00:20:03 Raymond Balise: nicely done
00:21:33 Mei Ling Soh: https://www.theanalysisfactor.com/count-data-considered-continuous/#:~:text=The%20issue%20with%20count%20variables,model%2C%20which%20require%20continuous%20data.&text=Treating%20that%20count%20variable%20as,in%20your%20particular%20data%20set.
00:30:04 SriRam: Just realised that GLM is not in the first edition (the book I have) :(
00:30:54 August: https://www.statlearning.com/
00:31:05 August: link at the bottom of page for 2nd edition
00:31:24 SriRam: Thank you August & Mei
00:32:25 August: 👍 anytime
00:42:27 Federica Gazzelloni: Thanks Raymond!!
```
</details>
### Cohort 2
`r knitr::include_url("https://www.youtube.com/embed/URL")`
<details>
<summary> Meeting chat log </summary>
```
ADD LOG HERE
```
</details>