-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathstat_models_2.Rmd
1026 lines (666 loc) · 30.8 KB
/
stat_models_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\providecommand{\E}{\operatorname{E}}
\providecommand{\V}{\operatorname{Var}}
\providecommand{\Cov}{\operatorname{Cov}}
\providecommand{\se}{\operatorname{se}}
\providecommand{\logit}{\operatorname{logit}}
\providecommand{\iid}{\; \stackrel{\text{iid}}{\sim}\;}
\providecommand{\asim}{\; \stackrel{.}{\sim}\;}
\providecommand{\xs}{x_1, x_2, \ldots, x_n}
\providecommand{\Xs}{X_1, X_2, \ldots, X_n}
\providecommand{\bB}{\boldsymbol{B}}
\providecommand{\bb}{\boldsymbol{\beta}}
\providecommand{\bx}{\boldsymbol{x}}
\providecommand{\bX}{\boldsymbol{X}}
\providecommand{\by}{\boldsymbol{y}}
\providecommand{\bY}{\boldsymbol{Y}}
\providecommand{\bz}{\boldsymbol{z}}
\providecommand{\bZ}{\boldsymbol{Z}}
\providecommand{\be}{\boldsymbol{e}}
\providecommand{\bE}{\boldsymbol{E}}
\providecommand{\bs}{\boldsymbol{s}}
\providecommand{\bS}{\boldsymbol{S}}
\providecommand{\bP}{\boldsymbol{P}}
\providecommand{\bI}{\boldsymbol{I}}
\providecommand{\bD}{\boldsymbol{D}}
\providecommand{\bd}{\boldsymbol{d}}
\providecommand{\bW}{\boldsymbol{W}}
\providecommand{\bw}{\boldsymbol{w}}
\providecommand{\bM}{\boldsymbol{M}}
\providecommand{\bPhi}{\boldsymbol{\Phi}}
\providecommand{\bphi}{\boldsymbol{\phi}}
\providecommand{\bN}{\boldsymbol{N}}
\providecommand{\bR}{\boldsymbol{R}}
\providecommand{\bu}{\boldsymbol{u}}
\providecommand{\bU}{\boldsymbol{U}}
\providecommand{\bv}{\boldsymbol{v}}
\providecommand{\bV}{\boldsymbol{V}}
\providecommand{\bO}{\boldsymbol{0}}
\providecommand{\bOmega}{\boldsymbol{\Omega}}
\providecommand{\bLambda}{\boldsymbol{\Lambda}}
\providecommand{\bSig}{\boldsymbol{\Sigma}}
\providecommand{\bSigma}{\boldsymbol{\Sigma}}
\providecommand{\bt}{\boldsymbol{\theta}}
\providecommand{\bT}{\boldsymbol{\Theta}}
\providecommand{\bpi}{\boldsymbol{\pi}}
\providecommand{\argmax}{\text{argmax}}
\providecommand{\KL}{\text{KL}}
\providecommand{\fdr}{{\rm FDR}}
\providecommand{\pfdr}{{\rm pFDR}}
\providecommand{\mfdr}{{\rm mFDR}}
\providecommand{\bh}{\hat}
\providecommand{\dd}{\lambda}
\providecommand{\q}{\operatorname{q}}
```{r, message=FALSE, echo=FALSE, cache=FALSE}
source("./customization/knitr_options.R")
```
```{r, message=FALSE, echo=FALSE, cache=FALSE}
library(broom)
```
# Logistic Regression
## Goal
Logistic regression models a Bernoulli distributed response variable in terms of linear combinations of explanatory variables.
This extends least squares regression to the case where the response variable captures a "success" or "failure" type outcome.
## Bernoulli as EFD
If $Y \sim \mbox{Bernoulli}(p)$, then its pmf is:
$$
\begin{aligned}
f(y; p) & = p^{y} (1-p)^{1-y} \\
& = \exp\left\{ \log\left(\frac{p}{1-p}\right)y + \log(1-p) \right\}
\end{aligned}
$$
In exponential family distribution (EFD) notation,
$$
\eta(p) = \log\left(\frac{p}{1-p}\right) \equiv \logit(p),
$$
$A(\eta(p)) = \log(1 + \exp(\eta)) = \log(1-p)$, and $y$ is the sufficient statistic.
## Model
$(\bX_1, Y_1), (\bX_2, Y_2), \ldots, (\bX_n, Y_n)$ are distributed so that $Y_i | \bX_i \sim \mbox{Bernoulli}(p_i)$, where $\{Y_i | \bX_i\}_{i=1}^n$ are jointly independent and
$$
\logit\left(\E[Y_i | \bX_i]\right) = \log\left( \frac{\Pr(Y_i = 1 | \bX_i)}{\Pr(Y_i = 0 | \bX_i)} \right) = \bX_i \bb.
$$
From this it follows that
$$
p_i = \frac{\exp\left(\bX_i \bb\right)}{1 + \exp\left(\bX_i \bb\right)}.
$$
## Maximum Likelihood Estimation
The $\bb$ are estimated from the MLE calculated from:
$$
\begin{aligned}
\ell\left(\bb ; \by, \bX \right) & = \sum_{i=1}^n \log\left(\frac{p_i}{1-p_i}\right) y_i + \log(1-p_i) \\
& = \sum_{i=1}^n (\bx_i \bb) y_i - \log\left(1 + \exp\left(\bx_i \bb\right) \right)
\end{aligned}
$$
## Iteratively Reweighted Least Squares
1. Initialize $\bb^{(1)}$.
2. For each iteration $t=1, 2, \ldots$, set
$$
p_i^{(t)} = \logit^{-1}\left( \bx_i \bb^{(t)} \right), \ \ \ \
z_i^{(t)} = \logit\left(p_i^{(t)}\right) + \frac{y_i - p_i^{(t)}}{p_i^{(t)}(1-p_i^{(t)})}
$$
and let $\bz^{(t)} = \left\{z_i^{(t)}\right\}_{i=1}^n$.
3. Form $n \times n$ diagonal matrix $\boldsymbol{W}^{(t)}$ with $(i, i)$ entry equal to $p_i^{(t)}(1-p_i^{(t)})$.
4. Obtain $\bb^{(t+1)}$ by performing the wieghted least squares regression (see [GLS](#/generalized-least-squares-1) from earlier)
$$
\bb^{(t+1)} = \left(\bX^T \boldsymbol{W}^{(t)} \bX \right)^{-1} \bX^T \boldsymbol{W}^{(t)} \bz^{(t)}.
$$
5. Iterate Steps 2-4 over $t=1, 2, 3, \ldots$ until convergence, setting $\hat{\bb} = \bb^{(\infty)}$.
## GLMs
For exponential family distribution response variables, the **generalized linear model** is
$$
\eta\left(\E[Y | \bX]\right) = \bX \bb
$$
where $\eta(\theta)$ is function of the expected value $\theta$ into the natural parameter. This is called the **canonical link function** in the GLM setting.
The **iteratively reweighted least squares** algorithm presented above for calculating (local) maximum likelihood estimates of $\bb$ has a generalization to a large class of exponential family distribution response vairables.
# `glm()` Function in R
## Example: Grad School Admissions
```{r ucladata, cache=TRUE}
mydata <-
read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
dim(mydata)
head(mydata)
```
Data and analysis courtesy of <http://www.ats.ucla.edu/stat/r/dae/logit.htm>.
## Explore the Data
```{r}
apply(mydata, 2, mean)
apply(mydata, 2, sd)
table(mydata$admit, mydata$rank)
```
```{r}
ggplot(data=mydata) +
geom_boxplot(aes(x=as.factor(admit), y=gre))
```
```{r}
ggplot(data=mydata) +
geom_boxplot(aes(x=as.factor(admit), y=gpa))
```
## Logistic Regression in R
```{r}
mydata$rank <- factor(mydata$rank, levels=c(1, 2, 3, 4))
myfit <- glm(admit ~ gre + gpa + rank,
data = mydata, family = "binomial")
myfit
```
## Summary of Fit
```{r}
summary(myfit)
```
## ANOVA of Fit
```{r}
anova(myfit, test="Chisq")
anova(myfit, test="LRT")
```
## Example: Contraceptive Use
```{r wws509data, cache=TRUE}
cuse <-
read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
header=TRUE)
dim(cuse)
head(cuse)
```
Data and analysis courtesy of <http://data.princeton.edu/R/glms.html>.
## A Different Format
Note that in this data set there are multiple observations per explanatory variable configuration.
The last two columns of the data frame count the successes and failures per configuration.
```{r}
head(cuse)
```
## Fitting the Model
When this is the case, we call the `glm()` function slighlty differently.
```{r}
myfit <- glm(cbind(using, notUsing) ~ age + education + wantsMore,
data=cuse, family = binomial)
myfit
```
## Summary of Fit
```{r}
summary(myfit)
```
## ANOVA of Fit
```{r}
anova(myfit)
```
## More on this Data Set
See <http://data.princeton.edu/R/glms.html> for more on fitting logistic regression to this data set.
A number of interesting choices are made that reveal more about the data.
# Generalized Linear Models
## Definition
The **generalized linear model** (GLM) builds from OLS and GLS to allow for the case where $Y | \bX$ is distributed according to an exponential family distribution. The estimated model is
$$
g\left(\E[Y | \bX]\right) = \bX \bb
$$
where $g(\cdot)$ is called the **link function**. This model is typically fit by numerical methods to calculate the maximum likelihood estimate of $\bb$.
## Exponential Family Distributions
Recall that if $Y$ follows an EFD then it has pdf of the form
$$f(y ; \boldsymbol{\theta}) =
h(y) \exp \left\{ \sum_{k=1}^d \eta_k(\boldsymbol{\theta}) T_k(y) - A(\boldsymbol{\eta}) \right\}
$$
where $\boldsymbol{\theta}$ is a vector of parameters, $\{T_k(y)\}$ are sufficient statistics, $A(\boldsymbol{\eta})$ is the cumulant generating function.
The functions $\eta_k(\boldsymbol{\theta})$ for $k=1, \ldots, d$ map the usual parameters $\bt$ (often moments of the rv $Y$) to the *natural parameters* or *canonical parameters*.
$\{T_k(y)\}$ are sufficient statistics for $\{\eta_k\}$ due to the factorization theorem.
$A(\boldsymbol{\eta})$ is sometimes called the *log normalizer* because
$$A(\boldsymbol{\eta}) = \log \int h(y) \exp \left\{ \sum_{k=1}^d \eta_k(\boldsymbol{\theta}) T_k(y) \right\}.$$
## Natural Single Parameter EFD
A natural single parameter EFD simplifies to the scenario where $d=1$ and $T(y) = y$
$$f(y ; \eta) =
h(y) \exp \left\{ \eta(\theta) y - A(\eta(\theta)) \right\}
$$
where without loss of generality we can write $\E[Y] = \theta$.
## Dispersion EFDs
The family of distributions for which GLMs are most typically developed are dispersion EFDs. An example of a dispersion EFD that extends the natural single parameter EFD is
$$f(y ; \eta) =
h(y, \phi) \exp \left\{ \frac{\eta(\theta) y - A(\eta(\theta))}{\phi} \right\}
$$
where $\phi$ is the dispersion parameter.
## Example: Normal
Let $Y \sim \mbox{Normal}(\mu, \sigma^2)$. Then:
$$
\theta = \mu, \eta(\mu) = \mu
$$
$$
\phi = \sigma^2
$$
$$
A(\mu) = \frac{\mu^2}{2}
$$
$$
h(y, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2}\frac{y^2}{\sigma^2}}
$$
## EFD for GLMs
There has been a very broad development of GLMs and extensions. A common setting for introducting GLMs is the dispersion EFD with a general link function $g(\cdot)$.
See the classic text *Generalized Linear Models*, by McCullagh and Nelder, for such a development.
## Components of a GLM
1. *Random*: The particular exponential family distribution.
$$
Y \sim f(y ; \eta, \phi)
$$
2. *Systematic*: The determination of each $\eta_i$ from $\bX_i$ and $\bb$.
$$
\eta_i = \bX_i \bb
$$
3. *Parametric Link*: The connection between $E[Y_i|\bX_i]$ and $\bX_i \bb$.
$$
g(E[Y_i|\bX_i]) = \bX_i \bb
$$
## Link Functions
Even though the link function $g(\cdot)$ can be considered in a fairly general framework, the **canonical link function** $\eta(\cdot)$ is often utilized.
The canonical link function is the function that maps the expected value into the natural paramater.
In this case, $Y | \bX$ is distributed according to an exponential family distribution with
$$
\eta \left(\E[Y | \bX]\right) = \bX \bb.
$$
## Calculating MLEs
Given the model $g\left(\E[Y | \bX]\right) = \bX \bb$, the EFD should be fully parameterized. The Newton-Raphson method or Fisher's scoring method can be utilized to find the MLE of $\bb$.
## Newton-Raphson
1. Initialize $\bb^{(0)}$. For $t = 1, 2, \ldots$
2. Calculate the score $s(\bb^{(t)}) = \nabla \ell(\bb; \bX, \bY) \mid_{\bb = \bb^{(t)}}$ and observed Fisher information $$H(\bb^{(t)}) = - \nabla \nabla^T \ell(\bb; \bX, \bY) \mid_{\bb = \bb^{(t)}}$$. Note that the observed Fisher information is also the negative Hessian matrix.
3. Update $\bb^{(t+1)} = \bb^{(t)} + H(\bb^{(t)})^{-1} s(\bb^{(t)})$.
4. Iterate until convergence, and set $\hat{\bb} = \bb^{(\infty)}$.
## Fisher's scoring
1. Initialize $\bb^{(0)}$. For $t = 1, 2, \ldots$
2. Calculate the score $s(\bb^{(t)}) = \nabla \ell(\bb; \bX, \bY) \mid_{\bb = \bb^{(t)}}$ and expected Fisher information $$I(\bb^{(t)}) = - \E\left[\nabla \nabla^T \ell(\bb; \bX, \bY) \mid_{\bb = \bb^{(t)}} \right].$$
3. Update $\bb^{(t+1)} = \bb^{(t)} + I(\bb^{(t)})^{-1} s(\bb^{(t)})$.
4. Iterate until convergence, and set $\hat{\bb} = \bb^{(\infty)}$.
When the canonical link function is used, the Newton-Raphson algorithm and Fisher's scoring algorithm are equivalent.
Exercise: Prove this.
## Iteratively Reweighted Least Squares
For the canonical link, Fisher's scoring method can be written as an iteratively reweighted least squares algorithm, as shown earlier for logistic regression. Note that the Fisher information is
$$
I(\bb^{(t)}) = \bX^T \bW \bX
$$
where $\bW$ is an $n \times n$ diagonal matrix with $(i, i)$ entry equal to $\V(Y_i | \bX; \bb^{(t)})$.
The score function is
$$
s(\bb^{(t)}) = \bX^T \left( \bY - \bX \bb^{(t)} \right)
$$
and the current coefficient value $\bb^{(t)}$ can be written as
$$
\bb^{(t)} = (\bX^T \bW \bX)^{-1} \bX^T \bW \bX \bb^{(t)}.
$$
Putting this together we get
$$
\bb^{(t)} + I(\bb^{(t)})^{-1} s(\bb^{(t)}) = (\bX^T \bW \bX)^{-1} \bX^T \bW \bz^{(t)}
$$
where
$$
\bz^{(t)} = \bX \bb^{(t)} + \bW^{-1} \left( \bY - \bX \bb^{(t)} \right).
$$
This is a generalization of the iteratively reweighted least squares algorithm we showed earlier for logistic regression.
## Estimating Dispersion
For the simple dispersion model above, it is typically straightforward to calculate the MLE $\hat{\phi}$ once $\hat{\bb}$ has been calculated.
## CLT Applied to the MLE
Given that $\hat{\bb}$ is a maximum likelihood estimate, we have the following CLT result on its distribution as $n \rightarrow \infty$:
$$
\sqrt{n} (\hat{\bb} - \bb) \stackrel{D}{\longrightarrow} \mbox{MVN}_{p}(\bO, \phi (\bX^T \bW \bX)^{-1})
$$
## Approximately Pivotal Statistics
The previous CLT gives us the following two approximations for pivtoal statistics. The first statistic facilitates getting overall measures of uncertainty on the estimate $\hat{\bb}$.
$$
\hat{\phi}^{-1} (\hat{\bb} - \bb)^T (\bX^T \hat{\bW} \bX) (\hat{\bb} - \bb) \asim \chi^2_1
$$
This second pivotal statistic allows for performing a Wald test or forming a confidence interval on each coefficient, $\beta_j$, for $j=1, \ldots, p$.
$$
\frac{\hat{\beta}_j - \beta_j}{\sqrt{\hat{\phi} [(\bX^T \hat{\bW} \bX)^{-1}]_{jj}}} \asim \mbox{Normal}(0,1)
$$
## Deviance
Let $\hat{\boldsymbol{\eta}}$ be the estimated natural parameters from a GLM. For example, $\hat{\boldsymbol{\eta}} = \bX \hat{\bb}$ when the canonical link function is used.
Let $\hat{\boldsymbol{\eta}}_n$ be the **saturated model** wwhere $Y_i$ is directly used to estimate $\eta_i$ without model constraints. For example, in the Bernoulli logistic regression model $\hat{\boldsymbol{\eta}}_n = \bY$, the observed outcomes.
The **deviance** for the model is defined to be
$$
D\left(\hat{\boldsymbol{\eta}}\right) = 2 \ell(\hat{\boldsymbol{\eta}}_n; \bX, \bY) - 2 \ell(\hat{\boldsymbol{\eta}}; \bX, \bY)
$$
## Generalized LRT
Let $\bX_0$ be a subset of $p_0$ columns of $\bX$ and let $\bX_1$ be a subset of $p_1$ columns, where $1 \leq p_0 < p_1 \leq p$. Also, assume that the columns of $\bX_0$ are a subset of $\bX_1$.
Without loss of generality, suppose that $\bb_0 = (\beta_1, \beta_2, \ldots, \beta_{p_0})^T$ and $\bb_1 = (\beta_1, \beta_2, \ldots, \beta_{p_1})^T$.
Suppose we wish to test $H_0: (\beta_{p_0+1}, \beta_{p_0 + 2}, \ldots, \beta_{p_1}) = \bO$ vs $H_1: (\beta_{p_0+1}, \beta_{p_0 + 2}, \ldots, \beta_{p_1}) \not= \bO$
We can form $\hat{\boldsymbol{\eta}}_0 = \bX \hat{\bb}_0$ from the GLM model $g\left(\E[Y | \bX_0]\right) = \bX_0 \bb_0$. We can analogously form $\hat{\boldsymbol{\eta}}_1 = \bX \hat{\bb}_1$ from the GLM model $g\left(\E[Y | \bX_1]\right) = \bX_1 \bb_1$.
The $2\log$ generalized LRT can then be formed from the two deviance statistics
$$
2 \log \lambda(\bX, \bY) = 2 \log \frac{L(\hat{\boldsymbol{\eta}}_1; \bX, \bY)}{L(\hat{\boldsymbol{\eta}}_0; \bX, \bY)} = D\left(\hat{\boldsymbol{\eta}}_0\right) - D\left(\hat{\boldsymbol{\eta}}_1\right)
$$
where the null distribution is $\chi^2_{p_1-p_0}$.
## Example: Grad School Admissions
Let's revisit a logistic regression example now that we know how the various statistics are calculated.
```{r, eval=FALSE}
mydata <-
read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
dim(mydata)
head(mydata)
```
Fit the model with basic output. Note the argument `family = "binomial"`.
```{r}
mydata$rank <- factor(mydata$rank, levels=c(1, 2, 3, 4))
myfit <- glm(admit ~ gre + gpa + rank,
data = mydata, family = "binomial")
myfit
```
This shows the fitted coefficient values, which is on the link function scale -- logit aka log odds here. Also, a Wald test is performed for each coefficient.
```{r}
summary(myfit)
```
Here we perform a generalized LRT on each variable. Note the `rank` variable is now tested as a single factor variable as opposed to each dummy variable.
```{r}
anova(myfit, test="LRT")
```
```{r}
mydata <- data.frame(mydata, probs = myfit$fitted.values)
ggplot(mydata) + geom_point(aes(x=gpa, y=probs, color=rank)) +
geom_jitter(aes(x=gpa, y=admit), width=0, height=0.01, alpha=0.3)
```
```{r}
ggplot(mydata) + geom_point(aes(x=gre, y=probs, color=rank)) +
geom_jitter(aes(x=gre, y=admit), width=0, height=0.01, alpha=0.3)
```
```{r}
ggplot(mydata) + geom_boxplot(aes(x=rank, y=probs)) +
geom_jitter(aes(x=rank, y=probs), width=0.1, height=0.01, alpha=0.3)
```
## `glm()` Function
The `glm()` function has many different options available to the user.
```
glm(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
```
To see the different link functions available, type:
```
help(family)
```
# Nonparametric Regression
## Simple Linear Regression
Recall the set up for simple linear regression. For random variables $(X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)$, **simple linear regression** estimates the model
$$
Y_i = \beta_1 + \beta_2 X_i + E_i
$$
where $\E[E_i | X_i] = 0$, $\V(E_i | X_i) = \sigma^2$, and $\Cov(E_i, E_j | X_i, X_j) = 0$ for all $1 \leq i, j \leq n$ and $i \not= j$.
Note that in this model $\E[Y | X] = \beta_1 + \beta_2 X.$
## Simple Nonparametric Regression
In **simple nonparametric regression**, we define a similar model while eliminating the linear assumption:
$$
Y_i = s(X_i) + E_i
$$
for some function $s(\cdot)$ with the same assumptions on the distribution of $\bE | \bX$. In this model, we also have
$$
\E[Y | X] = s(X).
$$
## Smooth Functions
Suppose we consider fitting the model $Y_i = s(X_i) + E_i$ with the restriction that $s \in C^2$, the class of functions with continuous second derivatives. We can set up an objective function that regularizes how smooth vs wiggly $s$ is.
Specifically, suppose for a given set of observed data $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ we wish to identify a function $s \in C^2$ that minimizes for some $\lambda$
$$
\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s''(x)|^2 dx
$$
## Smoothness Parameter $\lambda$
When minimizing
$$
\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx
$$
it follows that if $\lambda=0$ then any function $s \in C^2$ that interpolates the data is a solution.
As $\lambda \rightarrow \infty$, then the minimizing function is the simple linear regression solution.
## The Solution
For an observed data set $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ where $n \geq 4$ and a fixed value $\lambda$, there is an exact solution to minimizing
$$
\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx.
$$
The solution is called a **natural cubic spline**, which is constructed to have knots at $x_1, x_2, \ldots, x_n$.
## Natural Cubic Splines
Suppose without loss of generality that we have ordered $x_1 < x_2 < \cdots < x_n$. We assume all $x_i$ are unique to simplify the explanation here, but ties can be deal with.
A **natural cubic spline** (NCS) is a function constructed from a set of piecewise cubic functions over the range $[x_1, x_n]$ joined at the knots so that the second derivative is continuous at the knots. Outside of the range ($< x_1$ or $> x_n$), the spline is linear and it has continuous second derivatives at the endpoint knots.
## Basis Functions
Depending on the value $\lambda$, a different ncs will be constructed, but the entire family of ncs solutions over $0 < \lambda < \infty$ can be constructed from a common set of basis functions.
We construct $n$ basis functions $N_1(x), N_2(x), \ldots, N_n(x)$ with coefficients $\theta_1(\lambda), \theta_2(\lambda), \ldots, \theta_n(\lambda)$. The NCS takes the form
$$
s(x) = \sum_{i=1}^n \theta_i(\lambda) N_i(x).
$$
Define $N_1(x) = 1$ and $N_2(x) = x$. For $i = 3, \ldots, n$, define $N_i(x) = d_{i-1}(x) - d_{i-2}(x)$ where
$$
d_{i}(x) = \frac{(x - x_i)^3 - (x - x_n)^3}{x_n - x_i}.
$$
Recall that we've labeled indices so that $x_1 < x_2 < \cdots < x_n$.
## Calculating the Solution
Let $\bt_\lambda = (\theta_1(\lambda), \theta_2(\lambda), \ldots, \theta_n(\lambda))^T$ and let $\bN$ be the $n \times n$ matrix with $(i, j)$ entry equal to $N_j(x_i)$. Finally, let $\bOmega$ be the $n \times n$ matrix with $(i,j)$ entry equal to $\int N_i^{\prime\prime}(x) N_j^{\prime\prime}(x) dx$.
The solution to $\bt_\lambda$ are the values that minimize
$$
(\by - \bN \bt)^T (\by - \bN \bt) + \lambda \bt^T \bOmega \bt.
$$
which results in
$$
\hat{\bt}_\lambda = (\bN^T \bN + \lambda \bOmega)^{-1} \bN^T \by.
$$
## Linear Operator
Letting
$$
\bS_\lambda = \bN (\bN^T \bN + \lambda \bOmega)^{-1} \bN^T
$$
it folows that the fitted values are
$$
\hat{\by} = \bS_\lambda \by.
$$
Thus, the fitted values from a NCS are contructed by taking linear combination of the response variable values $y_1, y_2, \ldots, y_n$.
## Degrees of Freedom
Recall that in OLS, we formed projection matrix $\bP = \bX (\bX^T \bX)^{-1} \bX^T$ and noted that the number of columns $p$ of $\bX$ is also found in the trace of $\bP$ where $\operatorname{tr}(\bP) = p$.
The effective degrees of freedom for a model fit by a linear operator is calculated as the trace of the operator.
Therefore, for a given $\lambda$, the effective degrees of freedom is
$$
\operatorname{df}_\lambda = \operatorname{tr}(\bS_\lambda).
$$
## Bayesian Intepretation
Minimizing
$$
\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx
$$
is equivalent to maximizing
$$
\exp\left\{ -\frac{\sum_{i=1}^n (y_i - s(x_i))^2}{2\sigma^2} \right\} \exp\left\{-\frac{\lambda}{2\sigma^2} \int |s^{\prime\prime}(x)|^2 dx\right\}.
$$
Therefore, the NCS solution can be interpreted as calculting the MAP where $Y | X$ is Normal and there's an Exponential prior on the smoothness of $s$.
## Bias and Variance Trade-off
Typically we will choose some $0 < \lambda < \infty$ in an effort to balance the bias and variance. Let $\hat{Y} = \hat{s}(X; \lambda)$ where $\hat{s}(\cdot; \lambda)$ minimizes the above for some chosen $\lambda$ on an independent data set. Then
$$
\begin{aligned}
\E\left[\left(Y - \hat{Y}\right)^2\right] & = {\rm E}\left[\left(s(x) + E - \hat{s}(x; \lambda)\right)^2 \right] \\
\ & = {\rm E}\left[\left( s(x) - \hat{s}(x; \lambda) \right)^2 \right] + {\rm Var}(E) \\
\ & = \left( s(x) - {\rm E}[\hat{s}(x; \lambda)]\right)^2 + {\rm Var}\left(\hat{s}(x; \lambda)\right) + {\rm Var}(E) \\
\ & = \mbox{bias}^2_{\lambda} + \mbox{variance}_{\lambda} + {\rm Var}(E)
\end{aligned}
$$
where all of the above calculations are conditioned on $X=x$.
In minimizing
$$
\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx
$$
the relationship is such that:
$$
\uparrow \lambda \Longrightarrow \mbox{bias}^2 \uparrow, \mbox{variance} \downarrow
$$
$$
\downarrow \lambda \Longrightarrow \mbox{bias}^2 \downarrow, \mbox{variance} \uparrow
$$
## Choosing $\lambda$
There are several approaches that are commonly used to identify a value of $\lambda$, including:
- Scientific knowledge that guides the acceptable value of $\operatorname{df}_\lambda$
- Cross-validation or some other prediction quality measure
- Model selection measures, such as Akaike information criterion (AIC) or Mallows $C_p$
## Smoothers and Spline Models
We investigated one type of nonparametric regression model here, the NCS. However, in general there are many such "smoother" methods available in the simple nonparametric regression scenario.
Splines are particularly popular since splines are constructed from putting together polynomials and are therefore usually tractable to compute and analyze.
## Smoothers in R
There are several functions and packages available in R for computing smoothers and tuning smoothness parameters. Examples include:
- `splines` library
- `smooth.spline()`
- `loess()`
- `lowess()`
## Example
```{r, echo=FALSE}
set.seed(777)
x <- seq(1, 10, length.out=25)
n <- length(x)
f <- 4*log(x) + sin(x)
y <- f + rnorm(n)
plot(x, y, pch=19, ylim=c(0, 12))
lines(x, f, lty=3, lw=2)
```
```{r}
y2 <- smooth.spline(x=x, y=y, df=2)
y5 <- smooth.spline(x=x, y=y, df=5)
y25 <- smooth.spline(x=x, y=y, df=25)
ycv <- smooth.spline(x=x, y=y)
ycv
```
```{r, echo=FALSE}
plot(x, y, pch=19, ylim=c(0, 12))
lines(x, f, lty=3, lw=2)
lines(y2, col="red", lw=2)
lines(predict(y5, x=seq(1,10,0.01)), col="purple", lw=2)
lines(ycv, col="blue", lw=2)
lines(predict(y25, x=seq(1,10,0.01)), col="grey40", lw=2)
```
# Generalized Additive Models
## Ordinary Least Squares
Recall that OLS estimates the model
$$
\begin{aligned}
Y_i & = \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip} + E_i \\
& = \bX_i \bb + E_i
\end{aligned}
$$
where $\E[\bE | \bX] = \bO$ and $\Cov(\bE | \bX) = \sigma^2 \bI$.
## Additive Models
The **additive model** (which could also be called "ordinary nonparametric additive regression") is of the form
$$
\begin{aligned}
Y_i & = s_1(X_{i1}) + s_2(X_{i2}) + \ldots + s_p(X_{ip}) + E_i \\
& = \sum_{j=1}^p s_j(X_{ij}) + E_i
\end{aligned}
$$
where the $s_j(\cdot)$ for $j = 1, \ldots, p$ are a set of nonparametric (or flexible) functions. Again, we assume that $\E[\bE | \bX] = \bO$ and $\Cov(\bE | \bX) = \sigma^2 \bI$.
## Backfitting
The additive model can be fit through a technique called **backfitting**.
1. Intialize $s^{(0)}_j(x)$ for $j = 1, \ldots, p$.
2. For $t=1, 2, \ldots$, fit $s_j^{(t)}(x)$ on response variable $$y_i - \sum_{k \not= j} s_k^{(t-1)}(x_{ij}).$$
3. Repeat until convergence.
Note that some extra steps have to be taken to deal with the intercept.
## GAM Definition
$Y | \bX$ is distributed according to an exponential family distribution. The extension of additive models to this family of response variable is called **generalized additive models** (GAMs). The model is of the form
$$
g\left(\E[Y_i | \bX_i]\right) = \sum_{j=1}^p s_j(X_{ij})
$$
where $g(\cdot)$ is the link function and the $s_j(\cdot)$ are flexible and/or nonparametric functions.
## Overview of Fitting GAMs
Fitting GAMs involves putting together the following three tools:
1. We know how to fit a GLM via IRLS
2. We know how to fit a smoother of a single explanatory variable via a least squares solution, as seen for the NCS
3. We know how to combine additive smoothers by backfitting
## GAMs in R
Three common ways to fit GAMs in R:
1. Utilize `glm()` on explanatory variables constructed from `ns()` or `bs()`
2. The `gam` library
3. The `mgcv` library
## Example
```{r}
set.seed(508)
x1 <- seq(1, 10, length.out=50)
n <- length(x1)
x2 <- rnorm(n)
f <- 4*log(x1) + sin(x1) - 7 + 0.5*x2
p <- exp(f)/(1+exp(f))
summary(p)
y <- rbinom(n, size=1, prob=p)
mean(y)
df <- data.frame(x1=x1, x2=x2, y=y)
```
Here, we use the `gam()` function from the `mgcv` library. It automates choosing the smoothness of the splines.
```{r, message=FALSE}
library(mgcv)
mygam <- gam(y ~ s(x1) + s(x2), family = binomial(), data=df)
library(broom)
tidy(mygam)
```
```{r}
summary(mygam)
```
True probabilities vs. estimated probabilities.
```{r}
plot(p, mygam$fitted.values, pch=19); abline(0,1)
```
Smoother estimated for `x1`.
```{r}
plot(mygam, select=1)
```
Smoother estimated for `x2`.
```{r}
plot(mygam, select=2)
```
Here, we use the `glm()` function and include as an explanatory variable a NCS built from the `ns()` function from the `splines` library. We include a `df` argument in the `ns()` call.
```{r, message=FALSE}
library(splines)
myglm <- glm(y ~ ns(x1, df=2) + x2, family = binomial(), data=df)
tidy(myglm)
```
The spline basis evaluated at `x1` values.
```{r}
ns(x1, df=2)
```
Plot of basis function values vs `x1`.
```{r, echo=FALSE}
plot(x1, ns(x1,df=2)[,2], type="l", col="blue", ylab="basis function")
lines(x1, ns(x1,df=2)[,1], col="red")
```
```{r}
summary(myglm)
```
```{r}
anova(myglm, test="LRT")
```
True probabilities vs. estimated probabilities.
```{r}
plot(p, myglm$fitted.values, pch=19); abline(0,1)
```
# Bootstrap for Statistical Models
## Homoskedastic Models
Let's first discuss how one can utilize the bootstrap on any of the three homoskedastic models:
- Simple linear regression
- Ordinary least squares
- Additive models
## Residuals
In each of these scenarios we sample data $(\bX_1, Y_1), (\bX_2, Y_2), \ldots, (\bX_n, Y_n)$. Let suppose we calculate fitted values $\hat{Y}_i$ and they are unbiased:
$$
\E[\hat{Y}_i | \bX] = \E[Y_i | \bX].
$$
We can calculate residuals $\hat{E}_i = Y_i - \hat{Y}_i$ for $i=1, 2, \ldots, n$.
## Studentized Residuals
One complication is that the residuals have a covariance. For example, in OLS we showed that
$$
\Cov(\hat{\bE}) = \sigma^2 (\bI - \bP)
$$
where $\bP = \bX (\bX^T \bX)^{-1} \bX^T$.
To correct for this induced heteroskedasticity, we studentize the residuals by calculating
$$
R_i = \frac{\hat{E}_i}{\sqrt{1-P_{ii}}}
$$
which gives $\Cov(\bR) = \sigma^2 \bI$.
## Confidence Intervals
The following is a bootstrap procedure for calculating a confidence interval on some statistic $\hat{\theta}$ calculated from a homoskedastic model fit. An example is $\hat{\beta}_j$ in an OLS.
1. Fit the model to obtain fitted values $\hat{Y}_i$, studentized residuals $R_i$, and the statistic of interest $\hat{\theta}$.
For $b = 1, 2, \ldots, B$.
2. Sample $n$ observations with replacement from $\{R_i\}_{i=1}^n$ to obtain bootstrap residuals $R_1^{*}, R_2^{*}, \ldots, R_n^{*}$.
3. Form new response variables $Y_i^{*} = \hat{Y}_i + R_i^{*}$.
4. Fit the model to obtain $\hat{Y}^{*}_i$ and all other fitted parameters.
5. Calculate statistic of interest $\hat{\theta}^{*(b)}$.
The bootstrap statistics $\hat{\theta}^{*(1)}, \hat{\theta}^{*(2)}, \ldots, \hat{\theta}^{*(B)}$ are then utilized through one of the techniques discussed earlier (percentile, pivotal, studentized pivotal) to calculate a bootstrap CI.
## Hypothesis Testing