-
Notifications
You must be signed in to change notification settings - Fork 18
/
010-sampling.qmd
356 lines (233 loc) · 59.3 KB
/
010-sampling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
{{< include _setup.qmd >}}
# Sampling {#sec-sampling}
::: {.callout-note title="learning goals"}
* Discuss sampling theory and stratified sampling
* Reason about the limitations of different samples, especially convenience samples
* Consider sampling biases and how they affect your inferences
* Learn how to choose and justify an appropriate sample size for your experiment
:::
As we keep reminding you, experiments are designed to yield measurements of a causal effect. But a causal effect of what, and for whom? These are questions that are often given surprisingly little air time in our papers. Titles in our top journals read "Daxy thinking promotes fribbles," "Doing fonzy improves smoodling," or "Blicket practice produces more foozles than smonkers."^[Titles changed to protect the original authors. These researchers might very well have said more specific things in the text of their paper.] Each of these uses *generic language* to state a claim that is implied to be generally true [@dejesus2019],^[Generic language is a fascinating linguistic phenomenon. When we say things like "Mosquitoes transmit malaria," we don't mean that *all* mosquitoes do it, only something like "It's a valid and diagnostic generalization about mosquitoes in contrast to other relevant insects or other creatures that they are spreaders of malaria" [see @tessler2019].] but for each of these, we could reasonably ask, "For whom?" Is it everyone? Or a particular set of people? These are questions about our key theme, [generalizability]{.smallcaps}.
Let's focus on smoodling. We wouldn't let the authors get away with a fully universal version of their claim: "Doing [*any*] fonzy improves smoodling [*for everyone*]." The nongeneric version states a generalization that goes way beyond the evidence we actually have. But it seems that we are often okay with authors *implying* (with generic language) that their findings generalize broadly. Imagine for a second what the completely specific version of one of these titles might look like: "Reading one particular selection of fonzy for fifteen minutes in the lab improved 36 college students' smoodling scores on a questionnaire." This paper sounds pretty narrow in its applicability!
We've already run into generalizability in our treatment of statistical estimation and inference. When we estimated a particular quantity (say, the effect of fonzy), we did so in our own sample. But we then used inferential tools to reason about how the estimate in this **sample** related to the parameter in the **population** as a whole. How do we link up these *statistical* tools for generalization to the *scientific* questions we have about the generalizability of our findings? That's the question of this chapter.
A key set of decisions in experiment planning is what population to sample from and how to sample. We'll start by talking about the basics of **sampling theory**\index{sampling theory}: different ways of sampling and the generalizations they do and don't license. The second section of the chapter will then deal with **sampling biases**\index{sampling bias} that can compromise our effect estimates. A final set of key decisions is about **sample size** planning. In the third part of the chapter we'll address this issue, starting with classic **power analysis**\index{power analysis} but then introducing several other ways that an experimenter can plan and justify their sample size.
## Sampling theory
The basic idea of sampling is simple: you want to estimate some measurement for a large or infinite population by measuring a sample from that population.^[There are some tools for dealing with estimation in smaller populations where your sample is a substantial fraction of the population (e.g., a survey of your department where you get responses from half of the students). We won't discuss those here; our focus is on generalizing to large populations of humans.] Sampling strategies are split into two categories. **Probability sampling**\index{probability sampling} strategies are those in which each member of the population has some known, prespecified probability of being selected to be in the sample---think "generalizing to Japanese people by picking randomly from a list of everyone in Japan." **Non-probability sampling**\index{non-probability sampling} covers strategies in which probabilities are unknown or shifting, or in which some members of the population could never be included in the sample---think "generalizing to Germans by sending a survey to a German email list and asking people to forward the email to their family."
::: {.callout-note title="case study"}
### Is everyone bad at describing smells? {-}
Since Darwin,\index{Darwin, Charles} scientists have assumed that smell is a vestigial sense in humans---one that we don't even bother to encode in language. In English we don't even have consistent words for odors. We can say something is "stinky," "fragrant," or maybe "musty," but beyond these, most of our words for smells are about the *source* of the smell, not the qualities of it. Bananas, roses, and skunks all have distinctive smells, but we don't have any vocabulary for naming what is common or uncommon about them. And when we make up ad hoc vocabulary, it's typically quite inconsistent [@majid2014]. The same situation applies across many languages.
So, would it be a good generalization about human beings---all people---that olfaction as a sense is de-emphasized relative to, say, vision? This inference has a classic sample-to-population structure. Within several samples of participants using widely spoken languages, we observe limited and inconsistent vocabulary for smells, as well as poor discrimination. We use these samples to license an inference to the population---in this case, the entire human population.
```{r fig-sampling-majid2014}
#| fig-cap: Data from @majid2014 on the consistency of color and odor naming in English and Jahai speakers. Higher values indicate more consistent descriptions. Error bars show standard deviation.
#| fig-alt: A plot with means and error bars of codability by language; higher for color than odor in English but close in Jahai.
#| fig-width: 5
#| fig-height: 3
#| out-width: 55%
# data extracted using https://apps.automeris.io/wpd/
majid <- tribble(
~language, ~condition, ~stat, ~value,
"English", "color", "cil", 0.29,
"English", "color", "mean", 0.59,
"English", "color", "ciu", 0.87,
"English", "odor", "cil", 0.00,
"English", "odor", "mean", 0.04,
"English", "odor", "ciu", 0.08,
"Jahai", "color", "cil", 0.09,
"Jahai", "color", "mean", 0.28,
"Jahai", "color", "ciu", 0.45,
"Jahai", "odor", "cil", 0.14,
"Jahai", "odor", "mean", 0.23,
"Jahai", "odor", "ciu", 0.34
)
majid |>
pivot_wider(names_from = stat, values_from = value) |>
ggplot(aes(x = condition, y = mean, colour = condition)) +
facet_wrap(vars(language), strip.position = "bottom") +
geom_pointrange(aes(ymin = cil, ymax = ciu)) +
guides(color = "none") +
labs(x = "", y = "Codability\n(Simpson's diversity index)") +
theme(strip.placement = "outside",
# strip.text.x.bottom = element_text(margin = margin(b = 0)),
plot.margin = margin())
```
\vspace{-1em}
But these inferences about the universal lack of olfactory vocabulary are likely based on choosing nonrepresentative samples! Multiple hunter-gatherer groups appear to have large vocabularies for consistent smell description. For example, the Jahai, a hunter-gatherer group on the Malay Peninsula, have a vocabulary that includes at least twelve words for distinct odors, for example /cŋ$\symup{\varepsilon}$s/, which names odors with a "stinging smell" like gasoline, smoke, or bat droppings. When Jahai speakers are asked to name odors, they produce shorter and much more consistent descriptions than English speakers---in fact, their smell descriptions were as consistent as their color descriptions (@fig-sampling-majid2014). Further studies implicate the hunter-gatherer lifestyle as a factor: while several hunter-gatherer groups show good odor naming, nearby horticulturalist groups don't [@majid2018].
Generalizations about humans are tricky. If you want to estimate the average odor naming ability, you could take a random sample of humans and evaluate their odor naming. Most of the individuals in the sample would likely speak English, Mandarin, Hindi, or Spanish. Almost certainly, none of them would speak Jahai, which is spoken by only a little more than a thousand people and is listed as "endangered" by Ethnologue (<https://www.ethnologue.com/language/jhi>). Your estimate of low odor naming stability might be a good guess for the *majority* of the world's population, but would tell you little about the Jahai.
On the other hand, it's more complicated to jump from a statistical generalization about average ability to a richer claim like, "Humans have low olfactory naming ability." Such claims about universal aspects of the human experience require much more care and much stronger evidence [@piantadosi2014]. From a sampling perspective, human behavior and cognition show immense and complex **heterogeneity**\index{heterogeneity}---variability of individuals and variability across clusters. Put simply, if we want to know what people in general are like, we have to think carefully about which people we include in our studies.
:::
### Classical probability sampling\index{probability sampling}
In classical sampling theory,\index{sampling theory} there is some **sampling frame**\index{sampling frame} containing every member of the population---think of a giant list with every adult human's name in it. Then we use some kind of **sampling strategy**\index{sampling strategy}, maybe at the simplest just a completely random choice, to select $N$ humans from that sample frame, and then we include them in our experiment. This scenario is the one that informs all of our statistical results about how sample means converge to the population mean (as in @sec-inference).
Unfortunately, we very rarely do sampling of this sort in psychological research. Gathering true probability samples from the large populations that we'd like to generalize to is far too difficult and expensive. Consider the problems involved in doing some experiment with a sample of *all adult humans*, or even *adult English-speaking humans who are located in the United States*. As soon as you start to think about what it would take to collect a probability sample of this kind of population, the complexities get overwhelming. How will you find their names---what if they aren't in the phone book? How will you contact them---what if they don't have email? How will they do your experiment---what if they don't have an up-to-date web browser? What if they don't want to participate at all?
Instead, the vast majority of psychology research has been conducted with **convenience samples**\index{convenience sample}: non-probability samples that feature individuals who can be recruited easily, such as college undergraduates or workers on crowdsourcing platforms like Amazon Mechanical Turk\index{Amazon Mechanical Turk} or Prolific Academic\index{Prolific Academic} (see @sec-collection). We'll turn to these below.
For survey research, on the other hand---think of election polling---there are many sophisticated techniques for dealing with sampling; although this field is still imperfect, it has advanced considerably in trying to predict complex and dynamic behaviors. One of the basic ideas is the construction of **representative samples**\index{representative sample}: samples that match the population in their representation of one or more sociodemographic characteristics like gender, income, race and ethnicity, age, or political orientation.
Representative samples\index{representative sample} can be constructed by probability sampling,\index{probability sampling} but they can also be constructed through non-probability methods like recruiting quotas of individuals from different groups via various different convenience methods. These methods are critical for much social science research, but they have been used less frequently in experimental psychology research and aren't necessarily a critical part of the beginning experimentalist's toolkit.^[Readers can come up with counter-examples of recent studies that focus on representative sampling, but our guess is that they will prove the rule more generally. For example, a recent study tested the generality of growth mindset interventions for US high school students using a national sample [@yeager2019]. This large-scale study sampled more than 100 high schools from a sampling frame\index{sampling frame} of all registered high schools in the US, then randomly assigned students within schools that agreed to participate. They then checked that the schools that agreed to participate were representative of the broader population of schools. This study is great stuff, but we hope you agree that if you find yourself in this kind of situation---planning a multi-investigator five-year consortium study on a national sample---you might want to consult with a statistician and not use an introductory book like this one.]
\clearpage
::: {.callout-note title="depth"}
### Representative samples\index{representative sample} and stratified sampling\index{stratified sampling} {-}
**Stratified sampling**\index{stratified sampling} is a cool method that can help you get more precise estimates of your experimental effect, if you think it varies across some grouping in your sample. Imagine you're interested in a particular measure in a population---say, attitudes toward tea drinking across US adults---but you think that this measure will vary with one or more characteristics such as whether the adults are frequent, infrequent, or non-coffee drinkers. Even worse, your measure might be more variable within one group: perhaps most frequent and infrequent coffee drinkers feel okay about tea, but as a group, non-coffee drinkers tend to hate it (most don't drink any caffeinated beverages).
A simple random sample\index{convenience sample} from this heterogeneous population *will* yield statistical estimates that converge asymptotically to the correct population average for tea-drinking attitudes. But it will do so more slowly than ideal because any given sample may over- or under-sample nondrinkers just by chance. In a small sample, if you happen to get too many non-coffee drinkers, your estimate of attitudes will be biased downward; if you happen to get too few, you will be biased upward. All of this will come out in the wash eventually, but any individual sample (especially a small one) will be noisier than ideal.
![An illustration of stratified sampling.\index{stratified sampling} The left panel shows the sampling frame. The upper frames show the sampling frame stratified by a participant characteristic and a stratified sample. The lower frame shows a simple random sample, which happens to omit one group completely by chance.](images/sampling/stratified-sample.png){#fig-sampling-stratified width="75%" fig-alt="A diagram where box with 50 points of 3 colors goes to stratified sampling frame and sample (upper) or random sample (lower)."}
\vspace{-1em}
But, if you know the proportion of frequent, infrequent, or non-coffee drinkers in the population, you can perform stratified sampling\index{stratified sampling} within those subpopulations to ensure that your sample is representative along this dimension [@neyman1992]. This situation is pictured in @fig-sampling-stratified, which shows how a particular sampling frame\index{sampling frame} can be broken up into groups for stratified sampling. The result is a sample that matches the population proportions on a particular characteristic. In contrast, a simple random sample can over- or under-sample the subgroups by chance.
Stratified sampling\index{stratified sampling} can lead to substantial gains in the precision of your estimate. These gains are most prominent when either the groups differ a lot in their mean or when they differ a lot in their variance. There are several important refinements of stratified sampling in case you think these methods are important for your problem. In particular, **optimal sampling**\index{optimal sampling} can help you figure out how to over-sample groups with higher variance. On the other hand, if the characteristic on which you stratify participants doesn't relate to your outcome at all, then estimates from stratified sampling\index{stratified sampling} converge just as fast as random sampling (though it's a bit more of a pain to implement).
@Fig-sampling-stratified-sim shows a simulation of the scenario in @fig-sampling-stratified, in which each coffee preference group has a different tea attitude mean and the smallest group has the biggest variance. Although the numbers here are invented, it's clear that estimation error is much smaller in the stratified group and estimation error declines much more quickly as samples get larger.
```{r sampling-stratified-sim}
#| label: fig-sampling-stratified-sim
#| fig-cap: "A simulation showing the potential benefits of stratification.\\index{stratification} Each dot is an estimated mean for a sample of a particular size, sampled randomly or with stratification. Red points show the mean and standard deviation of sample estimates."
#| fig-alt: A plot with estimated mean over sample size for random and stratified samples; stratified has smaller standard deviations.
#| out-width: 70%
# rare_prop1 = .3
# rare_prop2 = .1
common_mean = -1
rare_mean1 = 0
rare_mean2 = 2.5
pop = 10000
population <- tibble(
measure = c(rnorm(mean = common_mean , n = pop * .5, sd= .25),
rnorm(mean = rare_mean1, n = pop * .3, sd= .25),
rnorm(mean = rare_mean2, n = pop * .2, sd= 2)),
group = c(rep("common", pop * .5),
rep("rare1", pop * .3),
rep("rare2", pop * .2)))
sims <- expand_grid(n_total = c(10, 25, 50, 100, 250),
sim = 1:100,
sample = c("Stratified","Random")) |>
rowwise() |>
mutate(data = ifelse(
sample == "Random",
list(sample_n(population, n_total)),
list(bind_rows(sample_n(filter(population, group == "common"),
n_total * .5),
sample_n(filter(population, group == "rare1"),
n_total * .3),
sample_n(filter(population, group == "rare2"),
n_total * .2))))) |>
unnest(cols = c(data)) |>
group_by(n_total, sim, sample) |>
summarise(mean = mean(measure))
ggplot(sims, aes(x = n_total, y = mean)) +
facet_grid(vars(sample)) +
geom_jitter(width = 3, height = 0, alpha = .2, size = 0.5) +
geom_hline(yintercept = mean(population$measure), linetype = "dashed", colour = pal$grey) +
stat_summary(#color = pal$red,
fun = mean, fun.min = \(x) mean(x) - sd(x),
fun.max = \(x) mean(x) + sd(x)) +
labs(x = "Total sample size (N)", y = "Mean")
```
\vspace{-1em}
Stratification\index{stratification} is everywhere, and it's useful even in convenience samples.\index{convenience sample} For example, researchers who are interested in development typically stratify their samples across ages (e.g., recruiting equal numbers of two- and three-year-olds for a study of preschoolers). You can estimate developmental change in a pure random sample, but you are guaranteed good coverage of the range of interest when you stratify.
If you have an outcome that you think varies with a particular characteristic, it's not a bad idea to consider stratification.\index{stratification} But don't go overboard---you can drive yourself to distraction finding the last left-handed nonbinary coffee drinker to complete your sample. Focus on stratifying when you know the measure varies with the characteristic of interest.
:::
## Convenience samples, generalizability, and the WEIRD problem
Now let's go back to the question of generalizability. How generalizable are the experimental effect estimates that we obtain in experiments that are conducted only with convenience samples?\index{convenience sample} We'll start by laying out the worst version of the problem of generalizability in experimental psychology. We'll then try to pull back from the brink and discuss some reasons why we might not want to be in despair despite some of the generalizability issues that plague the psychology literature.
### The worst version of the problem
Psychology is the study of the human mind. But from a sampling theory\index{sampling theory} standpoint, not a single estimate in the published literature is based on a simple random sample from the human population. And the situation is worse than that. Here are three of the most severe issues that have been raised regarding the generalizability of psychology research.
1. **Convenience samples**\index{convenience sample}. Almost all research in experimental psychology is performed with convenience samples. This issue has led to the remark that "the existing science of human behavior is largely the science of the behavior of sophomores" [@mcnemar1946, p. 333; quoted in @rosenthal1984, p. 261]. The samples we have easy access to just don't represent the populations we want to describe! At some point there was a social media account devoted to finding biology papers that made big claims about curing diseases and appending the qualifier "in mice" to them. We might consider whether we need to do the same to psychology papers. Would "Doing fonzy improves smoodling *in sophomore college undergraduates in the Western US*" make it into a top journal?
2. **The WEIRD problem**.\index{WEIRD (Western, Educated, Industrialized, Rich, and Democratic)} Not only are the convenience samples\index{convenience sample} that we study not representative of the local or national contexts in which they are recruited but those local and national contexts are also unrepresentative of the broad range of human experiences. @henrich2010 coined the term WEIRD (Western, educated, industrialized, rich, and democratic) to sum up some of the ways that typical participants in psychology experiments differ from other humans. The vast over-representation of WEIRD participants in the literature has led some researchers to suggest that published results simply reflect "WEIRD psychology"---a small and idiosyncratic part of a much broader universe of human psychology.^[The term WEIRD has been very useful in drawing attention to the lack of representation of the breadth of human experiences in experimental psychology. But one negative consequence of this idea has been the response that what we need to do as a field is to sample more "non-WEIRD" people. It is not helpful to suggest that every culture outside the WEIRD moniker is the same [@syed2020]. A better starting point is to consider the way that cultural variation might guide our choices about sampling.]
3. **The item sampling issue**. As we discussed in @sec-models and [-@sec-design], we're typically not just trying to generalize to new people; we're also trying to generalize to new stimuli [@westfall2015]. The problem is that our experiments often use a very small set of items, constructed by experimenters in an ad hoc way rather than sampled as representatives of a broader population of stimuli that we hope to generalize to with our effect size estimate. What's more, our statistical analyses sometimes fail to take stimulus variation into account. Unless we know about the relationship of our items to the broader population of stimuli, our estimates may be based on unrepresentative samples in yet another way.
In sum, experiments in the psychology literature primarily measure effects from WEIRD\index{WEIRD (Western, Educated, Industrialized, Rich, and Democratic)} convenience samples\index{convenience sample} of people and unsystematic samples of experimental stimuli. Should we throw up our hands and resign ourselves to an ungeneralizable "science" of sample-specific anecdotes [@yarkoni2020]?
### Reasons for hope and ways forward
We think the situation isn't as bleak as the arguments above might have suggested. Underlying each of the arguments above is the notion of **heterogeneity**\index{heterogeneity}, the idea that particular effects vary in the population.
Let's think through a very simple version of this argument. Say we have an experiment that measures the smoodling effect, and it turns out that smoodling is completely universal and invariant throughout the human population. Now, if we want to get a precise estimate of smoodling, we can take *any* sample we want because everyone will show the same pattern. Because smoodling is homogeneous, a nonrepresentative sample will not cause problems. There are some phenomena like this! For example, the Stroop task produces a consistent and similar interference effect for almost everyone [@hedge2018].
![An illustration of the interaction of heterogeneity and convenience samples. Colors indicate arbitrary population subgroups. The left-hand panels show sample composition. Individual plots show the distribution of effect sizes in each subgroup.](images/sampling/heterogeneity.png){#fig-sampling-heterogeneity .margin-caption fig-alt="A diagram with representative sample having close population/sample means; convenience having bias if effects heterogeneous."}
@Fig-sampling-heterogeneity\index{heterogeneity} illustrates this argument more broadly. If you have a representative sample\index{representative sample} (top), then your sample mean and your population mean will converge to the same value, regardless of whether the effect is homogeneous (right) or heterogeneous (right). That's the beauty of sampling theory.\index{sampling theory} If you have a convenience sample,\index{convenience sample} one part of the population is overrepresented in the sample. The convenience sample doesn't cause problems if the size of your effect is homogeneous in the population---as with the case of smoodling or Stroop. The trouble comes when you have an effect that is heterogeneous. Because one group is overrepresented, you get systematic bias in the sample mean relative to the population mean.
So the problems listed above---convenience samples,\index{convenience sample} WEIRD\index{WEIRD (Western, Educated, Industrialized, Rich, and Democratic)} samples, and narrow stimulus samples---only cause issues if effects are heterogeneous. Are they? The short answer is, *we don't know*. Convenience samples are fine in the presence of homogeneous effects, but we only use convenience samples, so we may not know which effects are homogeneous! Our metaphorical heads are in the sand.
We can't do better than this circularity without a theory of what should be variable and what should be consistent between individuals.^[Many people have theorized about the ways that culture and language in general might moderate psychological processes [e.g., @markus1991]. What we're talking about is related but slightly different---a theory not of what's different but of when there should be any difference and when there shouldn't be. As an example, Tsai's [-@tsai2007] "ideal affect" theory predicts that there should be more similarities in the distribution of actual affect across cultures, but that cultural differences should emerge in *ideal affect* (what people want to feel like) across cultures. This is a theory of when you should see homogeneity and when you should see heterogeneity.\index{heterogeneity}] As naive observers of human behavior, differences between people often loom large. We are keen observers of social characteristics like age, gender, race, class, and education. For this reason, our intuitive theories of psychology often foreground these characteristics as the primary locus for variation between people. Certainly these characteristics are important, but they fail to explain many of the *in*variances of human psychology as well. An alternative line of theorizing starts with the idea that "lower-level" parts of psychology---like perception---should be less variable than "higher-level" faculties like social cognition. This kind of theory sounds like a useful place to start, but there are also counter-examples in the literature, including cases of cultural variation in perception [@henrich2010].
Multi-lab, multi-nation studies can help to address questions about heterogeneity,\index{heterogeneity} breaking the circularity we described above. For example, ManyLabs\index{ManyLabs} 2 systematically investigated the replicability of a set of phenomena across cultures [@klein2018], finding limited variation in effects between WEIRD\index{WEIRD (Western, Educated, Industrialized, Rich, and Democratic)} sites and other sites. And in a study comparing a set of convenience and probability samples, @coppock2018 found limited demographic heterogeneity in another sample of experimental effects from across the social sciences. So there are at least some cases where we don't have to worry as much about heterogeneity. More generally, such large-scale studies offer the possibility of measuring and characterizing demographic and cultural variation---as well as how variation itself varies between phenomena.
## Biases in the sampling process
In fields like econometrics or epidemiology that use observational methods to estimate causal effects, reasoning about **sampling biases**\index{sampling bias} is a critical part of estimating generalizable effects. If your sample does not represent the population of interest,\index{population of interest} then your effect estimates will be biased.^[There is a deep literature on correcting these biases using causal inference frameworks. These techniques are well outside of the scope of this book, but if you're interested, you might look at some of the textbooks we recommended earlier, such as @cunningham2021.] In the kind of experimental work we are discussing, many of these issues are addressed by random assignment, including the first issue we treat: **collider bias**\index{collider bias}. Not so for the second one, **attrition bias**\index{attrition bias}, which is an issue even in randomized experiments.
### Collider bias\index{collider bias}
Imagine you want to measure the association between money and happiness through a (nonexperimental) survey. As we discussed in @sec-experiments, there are plenty of causal processes that could lead to this association. [Figure @fig-sampling-money] shows several of these scenarios. Money could truly cause happiness (1); happiness could cause you to make more money (2); or some third factor---say having lots of friends---could cause people to be happier *and* richer (3).
![Four reasons why money and happiness can be correlated in a particular sample: (1) causal relationship, (2) reverse causality, (3) confounding with friendship, and (4) collider bias\index{collider bias}. For this last scenario, we have to assume that our measurement is *conditioned* on being in this sample, meaning we only look at the association of money and happiness within the social services sample.](images/sampling/money.png){#fig-sampling-money .column-margin fig-alt="Four DAGs: arrow money M to happiness H; arrow H to M; arrows friends to M and H; arrows M and H to social services sample."}
But we can also create spurious associations if we are careless in our sampling. One prominent problem that we can induce is called **collider bias**\index{collider bias}. Suppose we recruited our sample from the clients of a social services agency. Unfortunately, both of our variables might affect presence in a social service agency (@fig-sampling-money, 4): people might be interacting with the agency for financial or benefits assistance, or else for psychological services (perhaps due to depression).
Being in a social services sample is called a **collider**\index{collider bias} variable because the two causal arrows *collide* into it (they both point to it). If we look just within the social services sample, we might see a *negative* association between wealth and happiness---on average the people coming for financial assistance would have less wealth and more happiness than the people coming for psychological services. The take-home here is that in observational research, you need to think carefully about the causal structure of your sampling process [@rohrer2018].
If you are doing experimental research, you are mostly protected from this kind of bias: Random assignment still "works" even in subselected samples. If you run a money intervention within a social-services population using random assignment, you can still make an unbiased estimate of the effect of money on happiness. But that estimate will only be valid *for members of that subselected population*.
### Attrition bias
**Attrition**\index{attrition bias} is when people drop out of your study. You should do everything you can to improve participants' experiences (see @sec-collection), but sometimes---especially when a manipulation is onerous for participants or your experiment is longitudinal and requires tracking participants for some time---you will still have participants withdraw from the study.
Attrition on its own can be a threat to the generalizability of an experimental estimate. Imagine you do an experiment comparing a new very intense after-school math curriculum to a control curriculum in a sample of elementary school children over the course of a year. By the end of the year, suppose many of your participants have dropped out. The families who have stayed in the study are likely those who care most about math. Even if you see an effect of the curriculum intervention, this effect may generalize only to children in families who love math.
![Selective attrition\index{selective attrition} can lead to a bias even in the presence of random assignment. The dashed line indicates a causal relationship that is unobserved by the researcher.](images/sampling/attrition.png){#fig-sampling-attrition .column-margin fig-alt="A DAG with arrow from curriculum intervention to math scores and dashed arrow from high math interest to math scores."}
But there is a further problem with attrition, known as **selective attrition**\index{selective attrition}. If attrition is related to the outcome specifically within the treatment group (or for that matter, specifically within the control group), you can end up with a biased estimate, even in the presence of random assignment [@nunan2018]. Imagine that students in the control condition\index{control condition} of your math intervention experiment stayed in the sample, but the math intervention itself was so tough that most families dropped out except those who were very interested in math. Now, when you compare math scores at the end of the experiment, your estimate will be biased (@fig-sampling-attrition): scores in the math condition could be higher simply because of differences in who stuck around to the end.^[If you get deeper into drawing DAGs\index{directed acyclic graph (DAG)} like we are doing here, you will want to picture attrition as its own node in the graph, but that's beyond the scope of this book.]
Unfortunately, it turns out that attrition bias\index{attrition bias} can be pretty common even in short studies, especially when they are conducted online when a participant can drop out simply by closing a browser window. This bias can be serious enough to lead to false conclusions. For example, @zhou2016 ran an experiment in which they asked online participants to write about either four happy events (low difficulty) or 12 happy events (high difficulty) from the last year and then asked the participants to rate the difficulty of the task. Surprisingly, the high-difficulty task was rated as easier than the low-difficulty task! Selective attrition\index{selective attrition} was the culprit for this counterintuitive conclusion: while only 26% of participants dropped out of the low-difficulty condition, a full 69% dropped out of the high-difficulty task. The 31% that were left found it quite easy for them to generate 12 happy events, and so they rated the objectively harder task as less difficult.
Always try to track and report attrition information. That lets you---and others---understand whether attrition is leading to bias in your estimates or threats to the generalizability of your findings.^[If you get interested, there is a whole field of statistics that focuses on **missing data**\index{missing data} and provides models for reasoning about and dealing with cases where data might not be **missing completely at random**\index{missing completely at random (MCAR)} [@little2019 is the classic reference for these tools]. The causal inference frameworks referenced above also have very useful ways of thinking about this sort of bias.]
## Sample size planning
Now that you have spent some time considering your sample and what population it represents, how many people will your sample contain? Continuing to collect data until you observe a $p < 0.05$ in an inferential test is a good way to get a false positive.\index{false positive} This practice, known as "optional stopping," is a good example of a practice that invalidates $p$-values, much like the cases of analytic flexibility\index{analytic flexibility} discussed in @sec-replication and @sec-inference [@simmons2011].
Decisions about when to stop collecting data should not be data-dependent. Instead you should transparently declare your data collection **stopping rule**\index{stopping rule} in your study preregistration (see @sec-prereg). This step will reassure readers that there is no risk of bias from optional stopping. The simplest stopping rule is, "I'll collect data until I get to a target $N$"---all that's needed in this case is a value for $N$.
But how do you decide $N$? It's going to be dependent on the effect that you want to measure and how it varies in the population. Smaller effects will require larger sample sizes. Classically, $N$ was computed using **power analysis**\index{power analysis|(}, which can provide a sample size for which you have a good chance of rejecting the null hypothesis (given a particular expected effect size). We'll introduce this computation below.
Classical power analysis\index{power analysis} is not the only way to plan your sample size. There are a number of other useful strategies, some of which rely on the same kinds of computations as power analysis (@tbl-stopping-rules). Each of these can provide a valid justification for a particular sample size, but they are useful in different situations.
\clearpage
\footnotesize
\renewcommand{\arraystretch}{1.5}
{{< include md/010-sampling/stopping-rules.md >}}
\renewcommand{\arraystretch}{1}
\normalsize
\vspace{-1em}
### Power analysis\index{power analysis}
![A standard decision matrix for null-hypothesis significance testing.](images/sampling/power-alpha.png){#fig-sampling-neyman-pearson .column-margin fig-alt="A table of inference (reject/fail) vs reality (true/false): false positive, correct rejection, true positive, false negative."}
Let's start by reviewing the null-hypothesis significance testing paradigm that we introduced in @sec-inference. Recall that we introduced the Neyman-Pearson decision-theoretic view of testing in @sec-inference, shown again in @fig-sampling-neyman-pearson. The idea was that we've got some null hypothesis $H_0$ and some alternative $H_1$---something like "No effect" and "Yes, there is some effect with known size"---and we want to use data to decide which state we're in. $\alpha$ is our criterion for rejecting the null, conventionally set to $\alpha=0.05$.
But what if $H_0$ is actually false and the alternative $H_1$ is true? Not all experiments are equally well set up to reject the null in those cases. Imagine doing an experiment with $N = 3$. In that case, we'd almost always fail to reject the null, even if it were false. Our sample would almost certainly be too small to rule out sampling variation as the source of our observed data.
Let's try to quantify our willingness to *miss* the effect---the false negative rate.\index{false negative} We'll denote this probability with $\beta$. If $\beta$ is the probability of missing an effect (failing to reject the null when it's really false), then $1-\beta$ is the probability that we *correctly reject the null when it is false*. That's what we call the **statistical power**\index{statistical power} of the experiment.
![An illustration of how larger sample sizes lead to greater power.](images/sampling/power.png){#fig-sampling-power .margin-caption fig-alt="A plot of distributions under H_0/H_1; larger sample has lower significance threshold, greater proportion significant events."}
We can only compute power if we know the effect size for the alternative hypothesis. If the alternative hypothesis is a small effect, then the probability of rejecting the null will typically be low (unless the sample size is very large). In contrast, if the alternative hypothesis is a large effect, then the probability of rejecting the null will be higher.
The same dynamic holds with sample size: the same effect size will be easier to detect with a larger sample size than a smaller one. [Figure @fig-sampling-power] shows how this relationship works. A large sample size creates a tighter null distribution (right side) by reducing sampling error. A tighter null distribution means you can reject the null more of the time based on the variation in a true effect. If your sample size is too small to detect your effect much of the time, we call this being **underpowered**.^[You can also refer to a design as **overpowered**, though we object slightly to this characterization, since the value of large datasets is typically not just to reject the null but also to measure an effect with high precision and to investigate how it is moderated by other characteristics of the sample.]
Classical power analysis\index{power analysis} involves computing the sample size $N$ that's necessary in order to achieve some level of power, given $\alpha$ and a known effect size.^[Our focus here is on giving you a conceptual introduction to power analysis, but we refer you to @cohen1992 for a more detailed introduction.] The mathematics of the relationship between $\alpha$, $\beta$, $N$, and effect size have been worked out for a variety of different statistical tests [@cohen2013] and codified in software like G*Power [@faul2007] and the `pwr` package for R [@champely2020]. For other cases (including mixed effects models), you may have to conduct a simulation in which you generate many simulated experimental runs under known assumptions and compute how many of these lead to a significant effect; luckily, R packages exist for this purpose as well, including the `simr` package [@green2016].
### Power analysis in practice
Let's do a power analysis\index{power analysis} for our hypothetical money and happiness experiment. Imagine the experiment is a simple two-group design in which participants from a convenience population are randomly assigned either to receive \$1,000 and some advice on saving money (experimental condition)\index{experimental condition} vs just receiving the advice and no money (control condition)\index{control condition}. We then follow up a month later and collect self-reported happiness ratings. How many people should we have in our study in order to be able to reject the null? The answer to this question depends on our desired values of $\alpha$ and $\beta$ as well as our expected effect size for the intervention.
For $\alpha$ we will just set a conventional significance threshold of $\alpha = 0.05$. But what should be our desired level of power? The usual standard in the social sciences is to aim for power above 80% (e.g., $\beta < 0.20$); this gives you four out of five chances to observe a significant effect. But just like $\alpha = 0.05$, this is a conventional value that is perhaps a little bit too loose for modern standards---a strong test of a particular effect should probably have 90% or 95% power.^[Really, researchers interested in using power analysis in their work should give some thought to what sort of chance of a false negative\index{false negative} they are willing to accept. In exploratory research, perhaps a higher chance of missing an effect is reasonable; in contrast, in confirmatory research it might make sense to aim for a higher level of power.]
These choices are relatively easy, compared to the fundamental issue: our power analysis\index{power analysis} requires some expectation about our effect size. This is the *first fundamental problem of power analysis*: if you knew the effect size, you might not need to do the experiment!
So how are you supposed to get an estimate of effect size? Here are a few possibilities:
* **Meta-analysis**.\ If there is a good meta-analysis of the effect that you are trying to measure (or something closely related), then you are in luck. A strong meta-analysis will have not only a precise effect size estimate but also some diagnostics detecting and correcting potential publication bias\index{publication bias} in the literature (see @sec-meta). While these diagnostics are imperfect, they still can give you a sense for whether you can use the meta-analytic effect size estimate as the basis for a power analysis.\index{power analysis}
* **Specific prior study**. A more complicated scenario is when you have only one or a handful of prior studies that you would like to use as a guide. The trouble is that any individual effect in the literature is likely to be inflated by publication and other selective reporting\index{selective reporting} biases (see @sec-replication). Thus, using this estimate likely means your study will be underpowered---you might not get as lucky as a previous study did!
* **Pilot testing**. Many people (including us) at some point learned that one way to do a power analysis\index{power analysis} is to conduct a pilot study, estimate the effect size from the pilot, and then use this effect estimate for power analysis in the main study. We don't recommend this practice. The trouble is that your pilot study will have a small sample size, leading to a very imprecise estimate of effect size [@browne1995]. If you overestimate the effect size, your main study will be very underpowered. If you underestimate, the opposite will be true. Using a pilot for power analysis is a recipe for problems.
* **General expectations about an effect of interest**. In our view, perhaps the best way you can use power analysis\index{power analysis} (in the absence of a really strong meta-analysis,\ at least) is to start with a general idea about the size of effect you expect and would like to be able to detect. It is totally reasonable to say, "I don't know how big my effect is going to be, but let's see what my power would be if it were *medium-sized* (say $d=0.5$), since that's the kind of thing we're hoping for with our money intervention." This kind of power analysis can help you set your expectations about what range of effects you might be able to detect with a given sample size.
For our money study, using our general expectation of a medium-size effect, we can compute power for $d=0.5$. In this case, we'll simply use the two-sample $t$-test\index{t-test} introduced in @sec-inference, for which 80% power at $\alpha = 0.05$ and $d=0.5$ is achieved by having $N = `r ceiling(pwr::pwr.t.test(d = .5, power = .8, type = "two.sample", alternative = "two.sided")$n)`$ in each group.
::: {.callout-note title="code"}
Classic power analysis\index{power analysis} in R is quite simple using the `pwr` package. The package offers a set of test-specific functions like `pwr.t.test()`. For each, you supply three of the four parameters specifying effect size (`d`), number of observations (`n`), significance level (`sig.level`), and power (`power`); the function computes the fourth. For classic power analysis, we leave out `n`:
```{r, opts.label='code'}
pwr.t.test(d = .5,
power = .8,
sig.level = .05,
type = "two.sample",
alternative = "two.sided")
```
But it is also possible to use this same function to compute the power achieved at a combination of `n` and `d`, for example.
:::
\clearpage
There's a second issue, however. The *second fundamental problem of power analysis*\index{power analysis} is that the real effect size for an experiment may be zero. And in that case, *no* sample size will let you correctly reject the null. Going back to our discussion in @sec-inference, the null hypothesis significance testing\index{null hypothesis significance testing (NHST)} framework is just not set up to let you *accept* the null hypothesis. If you are interested in a bidirectional approach to hypothesis testing in which you can accept *and* reject the null, you may need to consider Bayes Factor\index{Bayes Factor (BF)} or **equivalence testing**\index{equivalence testing} approaches [@lakens2018], which don't fit the assumptions of classical power analysis.
### Alternative approaches to sample size planning
Let's now consider some alternatives to classic power analysis\index{power analysis} that can still yield reasonable sample size justifications.
1. **Resource constraint**. In some cases, there are fundamental resource constraints that limit data collection. For example, if you are doing fieldwork, sometimes the right stopping criterion for data collection is "when the field visit is over," since every additional datapoint is valuable. When prespecified, these kinds of sample size justifications can be quite reasonable, though they do not preclude being underpowered to test a particular hypothesis.
2. **Smallest effect size of interest (SESOI)**. SESOI\index{smallest effect size of interest (SESOI)} analysis is a variant on power analysis that includes some resource constraint planning. Instead of trying to intuit how big your target effect is, you instead choose a level below which you might not be interested in detecting the effect. This choice can be informed by theory (what is predicted), applied concerns (what sort of effect might be useful in a particular context), or resource constraints (how expensive or time-consuming it might be to run an experiment). In practice, SESOI analysis is simply a classic power analysis\index{power analysis} with a particular small effect as the target.
3. **Precision-based sample planning**. As we discussed in @sec-inference, the goal of research is not always to reject the null hypothesis! Sometimes---we'd argue that it should be most of the time---the goal is to estimate a particular causal effect of interest with a high level of precision, since these estimates are a prerequisite for building theories. If what you want is an estimate with known precision (say, a confidence interval\index{confidence interval (CI)} of a particular width), you can compute the sample size necessary to achieve that precision [@bland2009;@rothman2018].^[In our experience, this kind of planning is most useful when you are attempting to gather measurements with sufficient precision to compare between computational models. Since the models can make quantitative predictions that differ by some known amount, then it's clear how tight your confidence intervals need to be.]
4. **Sequential analysis**. Your stopping rule\index{stopping rule} need not be a hard cutoff at a specific $N$. Instead, it's possible to plan a **sequential analysis**\index{sequential analysis} using either frequentist or Bayesian methods, in which you plan to stop collecting data once a particular inferential threshold is reached. For the frequentist version, the key thing that keeps sequential analysis from being $p$-hacking\index{p-hacking} is that you prespecify particular values of $N$ at which you will conduct tests and then correct your $p$-values for having tested multiple times [@lakens2014]. For Bayesian sequential analysis\index{Bayesian sequential analysis}, you can actually compute a running Bayes Factor\index{Bayes Factor (BF)} as you collect data and stop when you reach a prespecified level of evidence [@schonbrodt2017]. This latter alternative has the advantage of allowing you to collect evidence *for* the null as well as against it.^[Another interesting variant is sequential parameter estimation, in which you collect data until a desired level of precision is achieved [@kelley2018]; this approach combines some of the benefits of both precision-based analysis and sequential analysis.]
In sum, there are many different ways of justifying your sample size or your stopping rule.\index{stopping rule} The most important things are (1) to prespecify your strategy and (2) to give a clear justification for your choice. [Table @tbl-sampling-justification] gives an example sample size justification that draws on several different concepts discussed here, using classical power computations as one part of the justification. A reviewer could easily follow the logic of this discussion and form their own conclusion about whether this study had an adequate sample size and whether it should have been conducted given the researchers' constraints.
\footnotesize
\renewcommand{\arraystretch}{1.5}
{{< include md/010-sampling/sampling-justification.md >}}
\renewcommand{\arraystretch}{1}
\normalsize
\clearpage
::: {.callout-note title="depth"}
### Sample sizes for replication studies {-}
Setting the sample size for a replication study has been a persistent issue in the metascience\index{metascience} literature. Naïvely speaking, it seems like you should be able to compute the effect size for the original study and then simply use that as the basis for a classical power analysis.\index{power analysis}
This naive approach has several flaws, however. First, the effect size from the original published paper is likely an overestimate of the true effect size due to publication bias [@nosek2021].\index{publication bias}
<!-- ^[Imagine the counterfactual world in which the same study had yielded a smaller and perhaps nonsignificant effect size. Would it still have been published in the same form, and would you still want to replicate? If not, then consider the fact that the authors "got lucky"---even if the effect is truly present, perhaps this particular experiment happened to observe a larger effect than the true value just by chance.] -->
Second, the power analysis\index{power analysis} will only yield the sample size at which the replication will have a particular chance of rejecting the null at some criterion. But it's quite possible that the original experiment could be $p<0.05$, the replication could be $p>0.05$, *and* the original experiment and the replication results are not significantly different from each other. So a statistically significant replication of the original effect size is not necessarily what you want to aim for.
Faced with these issues, a replication sample size can be planned in several other ways. First, replicators can use the standard strategies mentioned above such as SESOI\index{smallest effect size of interest (SESOI)} or resource-based planning to rule out large effects, either with high probability or within a known amount of time or money. If the SESOI is high or limited resources are allocated, these strategies can produce an inconclusive result, however. A conclusive answer can require a very substantial commitment of resources.
Second, @simonsohn2015 recommends the "small telescopes" approach. The idea is not to test whether there *is* an effect, but rather where there is an effect *large enough that the original study could have detected it*. The analogy is to astronomy. If a birdwatcher points their binoculars at the sky and claims to have discovered a new planet, we want to ask not just whether there is a planet at that location but also whether there is any possibility that they could have seen it using binoculars---if not, perhaps they are right but for the wrong reasons! Simonsohn shows that, if a replicator collects 2.5 times as large a sample as the original, they have 80% power to detect any effect that was reasonably detectable by the original. This simple rule of thumb provides one good starting place for conservative replication studies.
Finally, replicators can make use of sequential Bayesian analysis,\index{Bayesian sequential analysis} in which they attempt to gather substantial evidence relative to the support for $H_1$ *or* $H_0$. Sequential bayes is an appealing option because it allows for efficient collection of data that reflects whether an effect is likely to be present in a particular sample, especially in the face of the sometimes prohibitively large samples necessary for SESOI\index{smallest effect size of interest (SESOI)} or "small telescopes" analyses.
:::
## Chapter summary: Sampling
Your goal as an experimenter is to estimate a causal effect.
But the effect for whom?
This chapter has tried to help you think about how you generalize from your experimental sample to some target population. It's very rare to be conducting an experiment based on a probability sample in which every member of the population has an equal chance of being selected. In the case that you are using a convenience sample,\index{convenience sample} you will need to consider how bias introduced by the sample could relate to the effect estimate you observed. Do you think this effect is likely to be very heterogeneous in the population? Are there theories that suggest that it might be larger or smaller for the convenience sample you recruited?
Questions about generalizability and sampling depend on the precise construct you are studying, and there is no mechanistic procedure for answering them. Instead, you simply have to ask yourself: How does my sampling procedure qualify the inference I want to make based on my data? Being transparent about your reasoning can be very helpful---both to you and to readers of your work who want to contextualize the generality of your findings.
::: {.callout-note title="discussion questions"}
1. We want to understand human cognition generally, but do you think it is a more efficient research strategy to start by studying certain features of cognition (perception, for example) in WEIRD\index{WEIRD (Western, Educated, Industrialized, Rich, and Democratic)} convenience populations and then later check our generalizations in non-WEIRD groups? What are the arguments against this efficiency-based strategy?
2. One alternative position regarding sampling is that the most influential experiments aren't generalizations of some number to a population; they are demonstration experiments that show that some particular effect is possible under some circumstances (think Milgram's conformity studies, see @sec-ethics). On this argument, the specifics of population sampling are often secondary. Do you think this position makes sense?
<!-- 3. Form an argument about this controversial position: Nothing you learn from US college undergraduates is likely to generalize to the US population as a whole, so we should dramatically decrease the use of this convenience population. -->
3. One line of argument says that we can't ever make generalizations about the human mind because so much of the historical human population is simply inaccessible to us (we can't do experiments on ancient Greek psychology). In other words, sampling from a particular population is *also* sampling a particular moment in time. How should we qualify our research interpretations to deal with this issue?
:::
::: {.callout-note title="readings"}
* The original polemic article on the WEIRD\index{WEIRD (Western, Educated, Industrialized, Rich, and Democratic)} problem: Henrich, Joseph, Steven J. Heine, and Ara Norenzayan [-@henrich2010]. "The WEIRDest People in the World?" *Behavioral and Brain Sciences* 33 (2–3): 61–83.
* A very accessible introduction to power analysis\index{power analysis|)} from its originator: Cohen, Jacob [-@cohen1992]. "A Power Primer." *Psychological Bulletin* 112 (1): 155.
* A thoughtful and in-depth discussion of generalizability issues: Yarkoni, Tal [-@yarkoni2020]. "The Generalizability Crisis." *Behavioral and Brain Sciences* 45:1–37.
:::