-
Notifications
You must be signed in to change notification settings - Fork 1
/
ch_linear-models.Rmd
330 lines (238 loc) · 9.06 KB
/
ch_linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# Linear Models
```{r setup, include = FALSE}
library(tidyverse)
library(emmeans)
library(broom)
library(knitr)
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
df_trial <-
read_csv(
file = "data/linear/drug_trial.csv",
col_types = cols(drug = col_factor(),
pre = col_double(),
post = col_double(),
sex = col_factor()))
df_disease <-
read_csv(
file = "data/linear/disease.csv",
col_types = cols(drug = col_factor(),
disease = col_factor(),
y = col_double()))
```
## Topic 1 (BV)
### Getting Started {-}
## Sums of Squares Types
### Getting Started {-}
To demonstrate the various types of sums of squares, we'll create a data frame called `df_disease` taken from the SAS documentation (__reference__). The summary of the data is shown.
```{r, echo=FALSE}
df_disease %>% head(3)
summary(df_disease)
```
### The Model {-}
For this example, we're testing for a significant difference in `stem_length` using ANOVA. In R, we're using `lm()` to run the ANOVA, and then using `broom::glance()` and `broom::tidy()` to view the results in a table format.
```{r}
lm_model <- lm(y ~ drug + disease + drug*disease, df_disease)
```
The `glance` function gives us a summary of the model diagnostic values.
```{r}
lm_model %>% glance()
```
The `tidy` function gives a summary of the model results.
```{r}
lm_model %>% tidy()
```
### The Results {-}
You'll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the `anova()` function on the model object.
```{r}
lm_model %>%
anova() %>%
tidy() %>%
kable()
```
And with some extra work, we can get a `Total` row to match the F table output by SAS.
```{r}
lm_model %>%
anova() %>%
tidy() %>%
add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>%
kable()
```
Comparing this to the output in SAS, we see the following.
```{r, eval=FALSE}
proc glm;
class drug disease;
model y=drug disease drug*disease / ss1 ss2 ss3 ss4;
run;
```
```{r echo=FALSE, fig.align='center', out.width="90%"}
knitr::include_graphics("images/linear/sas-f-table.png")
```
### Sums of Squares Tables {-}
In SAS, it is easy to find the tables for the variouse types of sums of squares calculations. Unfortunately, it is not easy to match this output in using functions from base R. However, the `rstatix` package offers a solution to produce these various sums of squares tables. Note that there does not appear to be a `Type IV SS` equivalent in R.
#### Type I
In R,
```{r message=FALSE, warning=FALSE}
df_disease %>%
rstatix::anova_test(
y ~ drug + disease + drug*disease,
type = 1,
detailed = TRUE) %>%
rstatix::get_anova_table() %>%
kable()
```
And in SAS,
```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-1.png")
```
#### Type II {-}
In R,
```{r message=FALSE, warning=FALSE}
df_disease %>%
rstatix::anova_test(
y ~ drug + disease + drug*disease,
type = 2,
detailed = TRUE) %>%
rstatix::get_anova_table() %>%
kable()
```
And in SAS,
```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-2.png")
```
#### Type III {-}
In R,
```{r message=FALSE, warning=FALSE}
df_disease %>%
rstatix::anova_test(
y ~ drug + disease + drug*disease,
type = 3,
detailed = TRUE) %>%
rstatix::get_anova_table() %>%
kable()
```
And in SAS,
```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-3.png")
```
#### Type IV {-}
In SAS,
```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-4.png")
```
In R there is no equivalent operation to the `Type IV` sums of squares calculation in SAS.
## Contrasts
### Getting Started {-}
To demonstrate contrasts, we'll create a data frame called `df_trial`. We see that the `drug` variable has three levels, _A_, _C_ and _E_.
```{r, echo=FALSE}
levels_drug <- levels(df_trial$drug)
summary(df_trial)
glimpse(df_trial)
```
In order to work with these levels as contrasts, we can use one of the pre-existing R functions to create the identity matrix for this factor variable. The `contr.treatment` definition is the default one that R uses, while the `contr.SAS` is the definition that SAS uses by default.
```{r}
contr.treatment(levels_drug)
contr.SAS(levels_drug)
```
R also has the following definitions ready for use: `contr.sum`, `contr.poly`, and `contr.helmert`.
### Using Contrasts in R {-}
There are many ways to work with contrasts in R, but here we're only going to focus on defining contrasts in the modeling function. In this example, we're creating a linear model to predict post values from the pre and drug values. The only difference between the two is in the contrasts argument. In one case, we're using the R default, and in the other we're using the SAS default.
```{r}
model_trt <-
lm(post ~ pre + drug, data = df_trial,
contrasts = list(drug = contr.treatment))
model_sas <-
lm(post ~ pre + drug, data = df_trial,
contrasts = list(drug = contr.SAS))
```
Comparing the output for these two models, we see that differ in which drug level is being used as the reference baseline.
```{r}
tidy(model_trt)
tidy(model_sas)
```
We can define a custom contrast as well. In this case, we are comparing _A_ to _E_, where _E_ is the baseline group. As is typical of contrasts, the values must sum to 0, where negative numbers represent the baseline groups.
```{r, eval=FALSE}
lm(post ~ pre + drug, data = df_trial,
constrasts = list(drug = c("A" = 1, "C" = "0", "E" = -1)))
```
It is possible to combine contrasts using `cbind`. Notice here that contrast definitions don't require level names to be assigned, as long as the order of the levels is maintained.
```{r, eval=FALSE}
contrast_1 <- c(1, 0, -1)
contrast_2 <- c(1, -2, 1)
contrast_c <- cbind(contrast_1, contrast_2)
lm(post ~ pre + drug, data = df_trial,
contrasts = contrast_c)
```
### Easy Contrasts in R with `emmeans` {-}
The `emmeans` package makes working with contrasts much easier. In fact, this is the method that we recommend when trying to match contrast output between SAS and R.
We begin by defining some models.
```{r}
model_lm <- lm(post ~ pre + drug + sex, data = df_trial)
model_av <- aov(post ~ pre + drug + sex, data = df_trial)
model_gm <- glm(post ~ pre + drug + sex, data = df_trial)
```
Then we convert these models to `emmeans` models. In the `emmeans` function we can specify which variables we want to display estimated marginal means for.
```{r, eval=FALSE}
model_lm %>% emmeans(specs = "drug")
model_av %>% emmeans(specs = "drug", by = "sex")
model_gm %>% emmeans(specs = ~ drug | sex)
```
Or, we can define some common contrasts with the `contrast` function.
```{r, eval=FALSE}
# All Pairwise Comparisons
model_lm %>%
emmeans("drug") %>%
contrast(method = "pairwise")
# Treatment v Control Comparison
model_gm %>%
emmeans("drug") %>%
contrast(method = "trt.vs.ctrl")
```
We can also control the reference group, or reverse the contrast order using the arguments in the `contrast` function.
```{r, eval=FALSE}
model_lm %>%
emmeans("drug") %>%
contrast(method = "trt.vs.ctrl", ref = 2)
model_lm %>%
emmeans("drug") %>%
contrast(method = "trt.vs.ctrl", rev = T)
```
Custom contrasts can be defined as well.
```{r, eval=FALSE}
model_lm %>%
emmeans("drug") %>%
contrast(method = list(
"A v E" = c("A" = 1, "C" = 0, "E" = -1),
"AE v C" = c(1, -2, 1),
"A" = c(1, 0, 0)
))
```
### Matching Contrasts: R and SAS {-}
It is recommended to use the `emmeans` package when attempting to match contrasts between R and SAS. In SAS, all contrasts must be manually defined, whereas in R, we have many ways to use pre-existing contrast definitions. The `emmeans` package makes simplifies this process, and provides syntax that is similar to the syntax of SAS.
This is how we would define a contrast in SAS.
```{r, eval=FALSE}
# In SAS
proc glm data=work.mycsv;
class drug;
model post = drug pre / solution;
estimate 'C vs A' drug -1 1 0;
estimate 'E vs CA' drug -1 -1 2;
run;
```
And this is how we would define the same contrast in R, using the `emmeans` package.
```{r, eval=FALSE}
lm(formula = post ~ pre + drug, data = df_trial) %>%
emmeans("drug") %>%
contrast(method = list(
"C vs A" = c(-1, 1, 0),
"E vs CA" = c(-1, -1, 2)
))
```
Note, however, that there are some cases where the scale of the parameter estimates between SAS and R is off, though the test statistics and p-values are identical. In these cases, we can adjust the SAS code to include a divisor. As far as we can tell, this difference only occurs when using the predefined Base R contrast methods like `contr.helmert`.
```{r, eval=FALSE}
proc glm data=work.mycsv;
class drug;
model post = drug pre / solution;
estimate 'C vs A' drug -1 1 0 / divisor = 2;
estimate 'E vs CA' drug -1 -1 2 / divisor = 6;
run;
```