forked from abecode/631-rtopics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlecture-13.Rmd
134 lines (106 loc) · 4.19 KB
/
lecture-13.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "Lecture 13"
---
### Lecture handout:
chp9-handout.pdf
### Textbook:
Chapter 9: Multiple and Logistic Regression
### R Topics
#### generating a regression problem
```{r}
# first generate x, explanatory variable
x <- rnorm(100, mean=50, sd=25)
# set population slope and intercept
B0 <- 100
B1 <- -10
# generate error/residuals
err <- rnorm(100, mean=0, sd=100)
# finally generate y, response variable
y <- B0 + B1*x + err
# now find the sample estimates
R <- cor(x,y)
b1 <- cor(x,y)*sd(y)/sd(x)
# or
b1 <- lm(y~x)$coefficients['x']
# solve for intercept given slope and mean(x), mean(y)
b0 <- -b1*mean(x)+mean(y)
# or
b0 <- lm(y~x)$coefficients['(Intercept)']
# predicted y values
yhat <- b0 + b1*x
# base R plot
plot(x,y)
lines(x,yhat)
# ggplot2 plot
library(ggplot2)
ggplot(data.frame(x=x,y=y,yhat=yhat)) + geom_point(aes(x=x,y=y), alpha = .2) + geom_line(aes(x=x,y=yhat))
ggplot(data.frame(x=x,y=y,yhat=yhat,err=err)) + geom_point(aes(x=x,y=y,alpha=abs(err))) + geom_line(aes(x=x,y=yhat))
ggplot(data.frame(x=x,y=y,yhat=yhat,err=err)) + geom_point(aes(x=x,y=y,alpha=abs(err))) + geom_line(aes(x=x,y=yhat)) + geom_segment(aes(x=x, y=y, xend=x, yend=yhat, alpha=abs(err) ))
# examine SStot, SSreg, SSerr
SStot <- var(y)
SSreg <- var(yhat)
SSerr <- var(y-yhat)
SStot == SSreg + SSerr # not exact
all.equal(SStot, SSreg + SSerr) # TRUE
# R^2 equalities
all.equal(R^2, 1 - SSerr/SStot)
all.equal(R^2, SSreg/SStot)
```
#### Extra discussion on correlation and covariance
```{r}
all.equal(cor(x,y), cov(x,y)/sd(x)/sd(y))
```
recall variance
LaTeX: $var\left(x\right) =\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2 $
v a r ( x ) = 1 n − 1 ∑ i = 1 n ( x i − x ¯ ) 2
we can rewrite this using expectations (E[]):
LaTeX: var\left(x\right) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-E[x])^2 \\
var\left(x\right) = E\left[(x_i-E[x])^2\right] \\
var\left(x\right) = E\left[ x^2 -xE[x] -E[x]x + E[x]^2 \right] \\
var\left(x\right) = E[ x^2] -E[x]E[x] -E[x]E[x] + E[x]^2 \\
var\left(x\right) = E[ x^2] -E[x]^2 \\
v a r ( x ) = 1 n − 1 ∑ i = 1 n ( x i − E [ x ] ) 2 v a r ( x ) = E [ ( x i − E [ x ] ) 2 ] v a r ( x ) = E [ x 2 − x E [ x ] − E [ x ] x + E [ x ] 2 ] v a r ( x ) = E [ x 2 ] − E [ x ] E [ x ] − E [ x ] E [ x ] + E [ x ] 2 v a r ( x ) = E [ x 2 ] − E [ x ] 2
```{r}
var(x)
sum(x^2 - mean(x)^2)/(length(x)-1)
```
Covariance
LaTeX: cov\left(x,y\right) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})
c o v ( x , y ) = 1 n − 1 ∑ i = 1 n ( x i − x ¯ ) ( y i − y ¯ )
we can rewrite this using expectations (E[]):
LaTeX: var\left(x\right) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-E[x])(y_i-E[y]) \\
var\left(x\right) = E\left[(x_i-E[x])(y_i-E[y])\right] \\
var\left(x\right) = E\left[ xy -xE[y] -E[x]y + E[x]E[y] \right] \\
var\left(x\right) = E[ xy] -E[x]E[y] -E[x]E[y] + E[x]E[y] \\
var\left(x\right) = E[ xy] -E[x]E[y] \\
v a r ( x ) = 1 n − 1 ∑ i = 1 n ( x i − E [ x ] ) ( y i − E [ y ] ) v a r ( x ) = E [ ( x i − E [ x ] ) ( y i − E [ y ] ) ] v a r ( x ) = E [ x y − x E [ y ] − E [ x ] y + E [ x ] E [ y ] ] v a r ( x ) = E [ x y ] − E [ x ] E [ y ] − E [ x ] E [ y ] + E [ x ] E [ y ] v a r ( x ) = E [ x y ] − E [ x ] E [ y ]
```{r}
cov(x,y)
sum(x*y - mean(x)*mean(y))/(length(x)-1)
# covariance and correlation as matrices
X <- matrix(cbind(x,y),ncol=2)
# covariance
cov(x,y)
cov(X)
var(X)
N <- dim(X)[1]
Xmean <- matrix(rep(colMeans(X),N),nrow=N, byrow=T)
Xc <- X - Xmean # a "centered" version of X
S <- t(Xc) %*% (Xc) /(N-1) # covariance via multiplying centered matrix w/ itself
S
# correlation
cor(x,y)
cor(X)
Xsd <- matrix(rep(sqrt(diag(S)),N),nrow=N, byrow=T) # the sd of each column of X repeated N times
Xs <- Xc/Xsd # a "scaled" and centered version of X
(t(Xs) %*% Xs) / (N-1)
```
version control and projects
* saving your workspace as various types of projects (project, package, shiny webapp, various R+cpp formats, and RMarkdown website) via File->New Project
* loading experimental code libraries with
```devtools::install_github("r-lib/devtools")```
# instead of
```install.packages("devtools")```
pairs plot
http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
https://www.r-bloggers.com/scatterplot-matrices-pair-plots-with-cdata-and-ggplot2/