-
Notifications
You must be signed in to change notification settings - Fork 4
/
s2_Lab11_WYOR.Rmd
389 lines (262 loc) · 16.1 KB
/
s2_Lab11_WYOR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
```{r, include = FALSE}
source("global_stuff.R")
```
# WYOR
## Reading
Chapter 22 from @abdiExperimentalDesignAnalysis2009.
## Overview
WYOR refers to "Writing Your own statistical Recipes". A goal of this course has been to explore principles of statistical analysis to the point where you would be able to use those general principles to craft statistical analyses that are tailored to your designs of interest. In this final lab, the practical section discusses some of the general aspects of R formula for declaring ANOVA and linear regression models. This can help you analyze many designs similar to those discussed in class throughout the last two semesters. The conceptual section is a final example of a simulated statistical analysis, and a few parting thoughts.
## Practical I: R formula
Throughout this course we have used the `lm()` and `aov()` functions to conduct linear regressions and ANOVAs. For specific designs, we also demonstrated how to use the formula syntax to declare a design of interest. However, I did not present a more general description of the formula syntax, and how it can be used to declare many different kinds of designs. This practical section provides a quick look into the formula syntax. Also, check out this blog piece on writing formulas in R as an alternate resource, <https://conjugateprior.org/2013/01/formulae-in-r-anova/>.
### Read the formula() help file
It turns out that there is a help file for the formula syntax that is actually pretty helpful. So, make sure you read it.
```{r}
?formula()
?formula
```
### Formula basics
We normally see a formula inside aov or lm, as follows. The column name of the dependent variable of interest (DV) is located on the left, followed by a tilda (~), followed by the independent variable(s) of interest, and a pointer to your data frame.
```{r, eval = FALSE}
aov(DV ~ IV, data = your_data)
lm(DV ~ IV, data = your_data)
```
Formulas can also be declared outside of these functions. For example, we can assign a formula to a named object.
```{r}
my_formula <- DV~IV
```
And, we can see that the class of this object is "formula".
```{r}
class(my_formula)
```
Entering the name of the object alone will print out the formula to the console
```{r}
my_formula
```
There are also helper functions for formula objects that allow you inspect the individual terms in the formula.
```{r}
terms(my_formula)
```
If you have assigned a formula to an object, you can then use the object in place of the formula in `aov()` or `lm()`.
```{r}
library(tibble)
some_data <- tibble(DV=rnorm(20,0,1),
IV=rep(c("A","B"), each=10))
summary(aov(DV ~ IV, some_data))
# is the same as
my_formula <- DV ~ IV
summary(aov(my_formula, some_data))
```
### Formula operators
There are several formula operators to be aware of, including `~`, `+`, `:`, `*`, `^`, `%in%`, `/`, and `-`.
The tilda `~` operator is used to separately declare the dependent variable from the "model", or set of independent variables used to account for variation in the dependent variable.
```{r, eval=FALSE}
# examples of DV ~ model
# one factor
DV ~ A
# two factors no interaction
DV ~ A+B
# three factors, all possible interactions
DV ~ A*B*C
```
The plus `+` operator is used to "add" specific terms to the model. For example, if you had one, two, or three independent variables, you could have different models that contain one, two, or three of those variables.
```{r, eval=FALSE}
DV ~ A
DV ~ A+B
DV ~ A+B+C
```
Importantly, the `+` sign adds individual terms only and nothing more. For example, a factorial design with three independent variables (A, B, C), has several two-way interactions and a three-way interaction. However, using the plus operator, these interaction terms will not be added unless they are explicitly declared.
For example, we can inspect the terms for the formula `DV ~ A+B+C`, and see that there are only three terms, one for each independent variable.
```{r}
attributes(terms(DV ~ A+B+C))$factors
```
Interaction terms can be declared using `:`. For example, `A:B` would specify an interaction between and A and B. Using the `+` operator, we can add in individual interaction terms declared using the `:` operator.
```{r}
## add the two-interaction between A and B
DV ~ A + B + C + A:B
attributes(terms(DV ~ A + B + C + A:B))$factors
## add all two-way interactions
DV ~ A + B + C + A:B + A:C + B:C
attributes(terms(DV ~ A + B + C + A:B + A:C + B:C))$factors
## add all interactions
DV ~ A + B + C + A:B + A:C + B:C + A:B:C
attributes(terms(DV ~ A + B + C + A:B + A:C + B:C + A:B:C))$factors
```
The `*` operator signifies the crossing of factors. For example, in a 2x2 design, the two levels of A are fully crossed with the two levels of B. This design has two main effects (A and B), and one interaction term (A:B). The `*` operator is a shortcut to include all of the terms (main effects and interactions) in a crossed design, without having to specify each of them individually.
```{r}
DV ~ A*B
attributes(terms(DV ~ A*B))$factors
# is the same as
DV ~ A + B + A:B
# 3-factor crossed
DV ~ A*B*C
attributes(terms(DV ~ A*B*C))$factors
#is the same as
DV ~ A + B + C + A:B + A:C + B:C + A:B:C
```
The `-` operator can subtract terms from the model. For example, let's say you had a design with three crossed factors, but you did not want to include the 3-way interaction:
```{r}
# omit the three-way interaction
DV ~ A*B*C - A:B:C
attributes(terms(DV ~ A*B*C - A:B:C))$factors
# omit the two-way interactions
DV ~ A*B*C - A:B - A:C - B:C
```
The `%in%` operator is used to declare nesting, such as `A%in%B`, where the term on the left is nested in the term on the right. The `/` operator can also be used to indicate nesting.
```{r}
DV ~ A + A%in%B
attributes(terms(DV ~ A + A%in%B))$factors
DV ~ A+ A/B
attributes(terms(DV ~ A + A/B))$factors
```
### Error()
The above formulas would apply to between-subjects variables, or "fixed" effects. It is also possible to include random effects, however it is assumed that the designs are balanced. The `Error()` term is used to "specify error strata" (according the help file from aov). We have used `Error()` in class for designs with repeated measures. Here are some examples:
#### One Random factor
```{r}
DV ~ Error(A)
```
#### One Factor, with repeated Measures
Note, `subject` refers to the column name coding the subject variable in your data frame.
```{r}
DV ~ A + Error(subject)
```
#### Two factor, both repeated measures
```{r}
DV ~ A*B + Error(subject/(A*B))
```
#### Two factor, A between fixed, B repeated measures
```{r}
DV ~ A*B + Error(subject/B)
```
### Conclusion...when in doubt, reproduce a textbook example
Specifying formula in base R for a design of interest can be challenging, especially as the designs become more complicated. If you have unbalanced designs with multiple IVs, that are combinations of fixed and random and nested and repeated measures, then you will have more work to do. For example, you may have to learn about different R packages for mixed models such as `lme4` or `nlme`, along with the slightly different formula syntax that they use. To the extent possible, when using unfamiliar statistics software or packages, it can be very helpful to find a textbook example that you can trust (e.g., a fully worked out example analysis for a design of interest), and then attempt to reproduce that example using software as a way to confirm expected behavior.
## Conceptual I: Simulating statistics
Throughout the course we conducted several statistical simulations, including simulations for power analysis. The simulations had several purposes. First, they give you some experience with general coding in R (e.g., to run loops and stored data from the simulations). Second, they concretely illustrate concepts like random sampling and sampling distributions and show how simulated results should converge on analytic procedures over the long run. Last, they are very flexible and can be used to anticipate possible outcomes of experimental designs, as well as develop inferential models to interpret possible outcomes.
### Basic recipe for simulating designs in R
1. Declare a data frame that represents the structure of your design, down to the smallest detail you are interested in. For example, your data frame could include columns for the dependent variable and independent variables in the design. Each row could be an individual subject mean.
2. Make explicit assumptions about the distributions underlying your measurements. Populate the DV with values sampled from those distributions.
3. A null model assumes that the distribution of your DV does not change across the conditions/levels of the IV. Alternate models assume that the distributions between conditions/levels are different somehow. You get to choose what kind of simulation you are running, so you choose to specify the form of any differences in the distributions.
4. After you have populated the data frame with possible data, then you can analyse the simulated data to answer a question of interest. This could involve applying a standard inferential test, to generate a test-statistic (e..g, t, F, r, etc), or some other statistic of interest (e.g., a mean difference).
5. Repeat the process of randomly generating simulated data, and analyzing it to arrive at test-statistics of interest. Repeat roughly 10,000 times, save the test-statistics every time to produce a simulated sampling distribution for the each test-statistic.
6. Use the simulated sampling distribution for inference or power analysis for design planning.
### A conundrum: Concrete vs. abstract test-statistics
Many test-statistics, like z, t, F, and r, are somewhat abstract and opaque. For example, is an F-value of 3, large or small? Does it mean that you should care about a difference between two means? It depends on factors like the degrees of freedom etc. Alternatively, sometimes test-statistics could be more concrete. For example, if I told you one population had a mean height of 5 ft, and another had a mean height of 6ft, then we could have a mean difference of 1ft. Because we are familiar with feet to measure height, a 1 ft difference is fairly concrete and, relative to an F value, fairly immediate to interpret.
Using simulation techniques it is possible to evaluate "concrete" statistics. For example, consider a simple 1 factor design (between-subjects). The purpose of this design is simply to determine if there is a difference between the means of group A and group B.
The simulation below conducts such an experiment 10,000. It is a simulation of the null hypothesis, so measures are randomly drawn from a normal distribution with mean = 0 and sd = 1. I create several simulated sampling distributions of various test-statistics to illustrate how the simulation procedure is very flexible. Then, we look at whether the null-hypotheses agree.
```{r}
library(tibble)
library(dplyr)
# declare simulation paramters
N <- 10
effect_size <- 0
null_distribution <- tibble()
iterations <- 10000
# run simulation
for(i in 1:iterations){
# create a random sample of data
null_data <- tibble(subjects = as.factor(c(1:(N*2))),
IV = as.factor(rep(c("A","B"), each = N)),
DV = c(rnorm(N,0,1),rnorm(N,effect_size,1))
)
# run statistical analyses
aov_summary <- summary(aov(DV~IV, data = null_data))
SS_IV <- aov_summary[[1]]$`Sum Sq`[1]
SS_residuals <- aov_summary[[1]]$`Sum Sq`[2]
SS_total <- SS_IV+SS_residuals
MS_IV <- aov_summary[[1]]$`Mean Sq`[1]
MS_residuals <- aov_summary[[1]]$`Mean Sq`[2]
F_val <- aov_summary[[1]]$`F value`[1]
means <- null_data %>%
group_by(IV) %>%
summarize(meanDV = mean(DV),
sdDV = sd(DV))
mean_difference <- means[means$IV == "B",]$meanDV - means[means$IV == "A",]$meanDV
abs_mean_diff <- abs(mean_difference)
cohens_D <- mean_difference / sqrt((sum(means$sdDV)/2))
t <- t.test(DV~IV, var.equal=TRUE, null_data)$statistic
abs_t <- abs(t)
# save all test-statistics
sim_vals <- tibble(SS_IV,
SS_residuals,
SS_total,
MS_IV,
MS_residuals,
F_val,
mean_difference,
abs_mean_diff,
cohens_D,
t,
abs_t
)
# append saved test-statsistics to the null-distribution tibble
null_distribution <- rbind(null_distribution,
sim_vals)
}
```
We have now created separate null-distributions for each of the statistics that we saved. Each null-distribution has 10,000 values.
Consider the simulated F-distribution that we created. Let's get the critical value from the simulated distribution, and compare it to the known critical value of F.
```{r}
# write a general function to get critical values from
# a simulated distribution (vector of values)
get_critical_value <- function(x,alpha){
x_sorted <- sort(abs(x))
ind <- round(length(x)*(1-alpha))
return(x_sorted[ind])
}
# simulated F critical value
get_critical_value(null_distribution$F_val,
alpha = .05)
# analytical F critical value
qf(.95,1,18)
library(ggplot2)
ggplot(null_distribution, aes(x=F_val))+
geom_histogram(bins=100)+
geom_vline(xintercept=get_critical_value(null_distribution$F_val,
alpha = .05))
```
Now let's look instead at the sampling distribution of the mean difference between group A and B.
```{r}
ggplot(null_distribution, aes(x=mean_difference))+
geom_histogram(bins=100)
```
Let's convert this distribution to the absolute value of the mean difference, and then find the critical value (assuming alpha =.05) associated with this distribution.
```{r}
get_critical_value(null_distribution$abs_mean_diff,
alpha = .05)
ggplot(null_distribution, aes(x=abs_mean_diff))+
geom_histogram(bins=100)+
geom_vline(xintercept=get_critical_value(null_distribution$abs_mean_diff,
alpha = .05))
```
### Which null is the true null?
We have created several true null distributions. Above we looked at the values of F that can be produced by chance for this design. As well, we looked at the absolute mean differences that can be produced by chance for this design. We found the critical values for both test-statistics. We could use either of them for the purposes of "null-hypothesis testing".
However, the conundrum that I alluded to earlier is that the different Null distributions don't necessarilly agree. In this particular case, they agree fairly closely, but not perfectly.
For example, we could use the correlation coefficienty to quickly look check whether the F values are related to the absolute mean differences across the simulations.
```{r}
knitr::kable(round(cor(null_distribution),digits=2))
```
The table above shows the entire correlation matrix. We can see that there is a large positive correlation between F and the absolute mean difference. In other words, when large differences in the means are produced by chance large F values are also produced.
However, because the two vectors are not perfectly correlated, they each have different opinions on what constitutes a type I error. This issue is more easily inspected below, which shows a scatterplot of the simulated F values and simulated absolute mean differences, along with critical values for each.
```{r}
test <- null_distribution %>%
mutate(significant = case_when(
abs_mean_diff > .88 & F_val > 4.4 ~ "both",
abs_mean_diff > .88 & F_val < 4.4 ~ "Mean Difference",
abs_mean_diff < .88 & F_val > 4.4 ~ "F"
))
ggplot(test, aes(x=abs_mean_diff,
y=F_val,
color=significant))+
geom_point()+
geom_vline(xintercept=get_critical_value(null_distribution$abs_mean_diff,
alpha = .05))+
geom_hline(yintercept = get_critical_value(null_distribution$F_val,
alpha = .05))
```
The red points are simulations that exceed both critical values (these would be significant on both measures, and represent similar conclusions about what type I errors look like), note this group is less than 5% of the simulations
```{r}
test %>%
group_by(significant) %>%
summarize(counts= n(),
proportion = n()/10000)
```
## References