016-meta.qmd

{{< include _setup.qmd >}}

{{< include helper/meta/meta_prelims.qmd >}}

# Meta-analysis {#sec-meta}

::: {.callout-note title="learning goals"}
- Discuss the benefits of synthesizing evidence across studies
- Conduct a simple fixed-effects or random-effects meta-analysis
- Reason about the role of within-study and across-study biases in meta-analysis
:::

Throughout this book, we have focused on how to design individual experiments that maximize [measurement precision]{.smallcaps} and minimize bias. But even when we do our best to get a precise, unbiased estimate in an individual experiment, one study can never be definitive. Variability in participant demographics, stimuli, and experimental methods may limit the [generalizability]{.smallcaps} of our findings. Additionally, even well-powered individual studies have some amount of statistical error, limiting their precision. Synthesizing evidence across studies is critical for developing a balanced and appropriately evolving view of the overall evidence on an effect of interest and for understanding sources of variation in the effect. 

Synthesizing evidence rigorously takes more than putting a search term into Google Scholar, downloading articles that look topical or interesting, and qualitatively summarizing your impressions of those studies. While this ad hoc method can be an essential first step in performing a literature review [@grant2009typology], it is not systematic and doesn't provide a *quantitative* summary of a particular effect. Further, it doesn't tell you anything about potential biases in the literature---for example, a bias for the publication of positive effects. 

To address these issues, a more systematic, quantitative review of the literature is often more informative. This chapter focuses on a specific type of quantitative review called **meta-analysis**:\ a method for combining effect sizes across different studies. (If you need a refresher on effect size, see @sec-estimation, where we introduce the concept.)^[We'll primarily be using Cohen's $d$,\index{Cohen's $d$} the standardized difference between means, which we introduced in @sec-estimation. There are many more varieties of effect size available, but we focus here on $d$ because it's common and easy to reason about in the context of the statistical tools we introduced in the earlier sections of the book.] We incude a chapter on meta-analysis in *Experimentology* because we believe it's an important tool that can focus experimental researchers on issues of [measurement precision]{.smallcaps} and [bias reduction]{.smallcaps}, two of our key themes. 

By combining information from multiple studies, meta-analysis\ often provides more precise estimates of an effect size than any single study. In addition, meta-analysis also allows the researcher to look at the extent to which an effect varies across studies. If an effect does vary across studies, meta-analysis also can be used to test whether certain study characteristics systematically produce different results (e.g., whether an effect is larger in certain populations). 

::: {.callout-note title="case study"}
### Towel reuse by hotel guests {-}

Imagine you are staying in a hotel and you have just taken a shower. Do you throw the towels on the floor or hang them back up again? In a widely cited study on the power of social norms, @goldstein2008room manipulated whether a sign encouraging guests to reuse towels focused on environmental impacts (e.g., "help reduce water use") or social norms (e.g., "most guests reuse their towels"). Across two studies, they found that guests were significantly more likely to reuse their towels after receiving the social norm message (Study 1: odds ratio [OR] = 1.46, 95% CI [1.00, 2.16], $p = 0.05$; Study 2: OR = 1.35, 95% CI [1.04, 1.77], $p = 0.03$). 

However, five subsequent studies by other researchers did not find significant evidence that social norm messaging increased towel reuse. (ORs ranged from 0.22 to 1.34, and no hypothesis-consistent $p$-value was less than 0.05). This caused many researchers to wonder if there is any effect at all. To examine this question, @scheibehenne2016 statistically combined evidence across the studies via meta-analysis.\ This meta-analysis indicated that using social norm messages did significantly increase hotel towel reuse, on average (OR = 1.26, 95% CI [1.07, 1.46], $p < 0.005$). This [case study]{.smallcaps} demonstrates an important strength of meta-analysis: by pooling evidence from multiple studies, meta-analysis can generate more powerful insights than any one study alone. We will also see how meta-analysis can be used to assess variability in effects across studies.
:::

<!-- All ES's come from Wagenmakers et al. SI -->

Meta-analysis\ often teaches us something about a body of evidence that we do not intuitively grasp when we casually read through a bunch of articles. In the above [case study]{.smallcaps}, merely reading the individual studies might give the impression that social norm messages do not increase hotel towel reuse. But meta-analysis indicated that the average effect is beneficial, although there might be substantial variation in effect sizes across studies.^[Given the billions of hotel bookings worldwide every year, even a small effect might have lead to a substantial environmental impact!]

## The basics of evidence synthesis

As we explore the details of conducting a meta-analysis,\ we'll turn to another running example: a meta-analysis of studies investigating the "contact hypothesis" on intergroup relations. 

According to the contact hypothesis, prejudice toward members of minority groups can be reduced through intergroup contact interventions, in which members of majority and minority groups work together to pursue a common goal [@allport1954nature]. To aggregate the evidence on the contact hypothesis, @paluck2019contact meta-analyzed studies that tested the effects of randomized intergroup contact interventions on long-term prejudice-related outcomes. 

Using a systematic literature search, @paluck2019contact searched for all papers that tested these effects and then extracted effect size estimates from each paper.^[This book will not cover the process of conducting a systematic literature search and extracting effect sizes, but these topics are critical to understand if you plan to conduct a meta-analysis or other evidence synthesis. Our experience is that extracting effect sizes from papers with inconsistent reporting standards can be especially tricky, so it can be helpful to talk to someone with experience in meta-analysis to get advice about this.] Because not every paper reports standardized effect sizes\index{standardized effect size}---or even means and standard deviations for every group---this process can often involve scraping information from plots, tables, and statistical tests to try to reconstruct effect sizes.^[For example, if the outcome variable is continuous, we could estimate Cohen's $d$\index{Cohen's $d$} from group means and standard deviations reported in the paper, even without having access to raw data.]  

Following best practices for meta-analysis (where there are almost never privacy concerns to worry about), @paluck2019contact shared their data openly. The first few lines are shown in @tbl-meta-dataset. We'll use these data as our running example throughout. 

<!-- \vspace{1em} -->
\footnotesize
```{r meta-dataset}
#| label: tbl-meta-dataset
#| tbl-cap: "The first few lines of extracted effect sizes (d) and their variances (var\\_d) in the @paluck2019contact meta-analysis."

head(df) |>
  select(name = name_short, pub_date, target = target_spelled_out, n_total, d, var_d)
```
\normalsize
\vspace{-1em}

As we've seen throughout this book, visualizing data before and after analysis helps benchmark and check our intuitions about the formal statistical results. In a meta-analysis,\ a common way to plot effect sizes is the **forest plot**\index{forest plot}, which depicts individual studies' estimates and confidence intervals\index{confidence interval (CI)}. In the forest plot\index{forest plot} in @fig-meta-forest,^[You can ignore for now the final line, "RE Model"; we will return to this later.] the larger squares correspond to more precise studies; notice how much narrower their confidence intervals are than those of less precise studies.

```{r meta-forest}
#| label: fig-meta-forest
#| fig-cap: "A forest plot\\index{forest plot} for @paluck2019contact meta-analysis. Studies are ordered from smallest to largest standard error."
#| fig-alt: A forest plot with estimated standardized mean difference points and 95% confidence interval error bars for each study.
#| cap-location: margin
#| out-width: 80%
#| fig-width: 6
#| fig-height: 7

metafor::forest(re_model, header = TRUE, xlab = "Standardized mean difference",
                fonts = .font)
```

::: {.callout-note title="code"}
In this chapter, we use the wonderful `metafor` package [@viechtbauer2010]. With this package, you must first fit your meta-analytic model. But once you've fit your model `mod`, you can simply call `forest(mod)` to create a plot like the one above.
:::

<!-- Cohen's $d$---which, if you recall from @sec-estimation, represents the standardized mean difference---was used as the effect size index. As we show in the remainder of the chapter, the meta-analytic tools @paluck2019contact used provide several useful insights about this proposed prejudice-reduction intervention. -->

### How not to synthesize evidence

Many people's first instinct in evidence synthesis is to count how many studies supported versus did not support the hypothesis under investigation. This technique usually amounts to counting the number of studies with "significant" $p$-values, since---for better or for worse---"significance" is largely what drives the take-home conclusions researchers report [@mcshane2017statistical; @nelson1986interpretation]. In meta-analysis,\ we call this practice of counting the number of significant $p$-values **vote-counting**\index{vote-counting} [@borenstein2021introduction]. For example, in the @paluck2019contact meta-analysis, almost all studies had a positive effect size, but only `r sum(df$sig)` of `r nrow(df)` were significant. So, based on this vote-count, we would have the impression that most studies do not support the contact hypothesis.

Many qualitative literature reviews use this vote-counting\index{vote-counting} approach, although often not explicitly. Despite its intuitive appeal, vote-counting can be very misleading because it characterizes evidence solely in terms of dichotomized $p$-values, while entirely ignoring effect sizes. In @sec-replication, we saw how fetishizing statistical significance can mislead us when we consider individual studies. These problems also apply when considering multiple studies. 

For example, small studies may consistently produce nonsignificant effects due to their limited power. But when many such studies are combined in a meta-analysis,\ the meta-analysis may provide strong evidence of a positive average effect. Inversely, many studies might have statistically significant effects, but if their effect sizes are small, then a meta-analysis might indicate that the average effect size is too small to be practically meaningful. In these cases, vote-counting\index{vote-counting} based on statistical significance can lead us badly astray [@borenstein2021introduction]. To avoid these pitfalls, meta-analysis combines the effect size estimates from each study (not just their $p$-values), weighting them in a principled way.

### Fixed-effects meta-analysis\index{meta-analysis!fixed-effects}

If vote-counting\index{vote-counting} is a bad idea, how should we combine results across studies? Another intuitive approach might be to average effect sizes from each study. For example, in Paluck et al.'s meta-analysis, the mean of the studies' effect size estimates is `r round(mean(df$d), digits)`. This averaging approach is a step in the right direction, but it has an important limitation: averaging effect size estimates gives equal weight to each study. A small study [e.g., @clunies1989changing with $N = 30$] contributes as much to the mean effect size as a large study [e.g., @boisjoly2006empathy with $N = 1,243$]. Larger studies provide more precise estimates of effect sizes than small studies, so weighting all studies equally is not ideal. Instead, larger studies should carry more weight in the analysis. 

To address this issue, **fixed-effects meta-analysis**\index{meta-analysis!fixed-effects} uses a **weighted average** approach. Larger, more precise studies are given more weight in the calculation of the overall effect size. Specifically, each study is weighted by the inverse of its variance (i.e., the inverse of its squared standard error). This makes sense because larger, more precise studies have smaller variances, and thus get more weight in the analysis. 

In general terms, the fixed-effect pooled estimate is:
$$\widehat{\mu} = \frac{ \sum_{i=1}^k w_i \widehat{\theta}_i}{\sum_{i=1}^k w_i}$$
where $k$ is the number of studies, $\widehat{\theta}_i$ is the point estimate of the $i^{th}$ study, and $w_i = 1/\widehat{\sigma}^2_i$ is study $i$'s weight in the analysis (i.e., the inverse of its variance).^[If you are curious, the standard error of the fixed-effect $\widehat{\mu}$ is $\frac{1}{\sum_{i=1}^k w_i}$. This standard error can be used to construct a confidence interval\index{confidence interval (CI)} or $p$-value, as described in @sec-inference.]

<!-- In Paluck et al.'s meta-analysis, we would calculate the fixed-effect estimate, $\widehat{\mu}$, as: -->

<!-- <!-- hard-coded from df$d[1:2] and df$se_d[1:2] -->
<!-- $$\widehat{\mu} = \frac{ \frac{\widehat{\theta}_{study1}}{\widehat{\sigma}^2_{study1}} + \frac{\widehat{\theta}_{study2}}{\widehat{\sigma}^2_{study2}} + \cdots}{ \frac{1}{\widehat{\sigma}^2_{study1}} + \frac{1}{\widehat{\sigma}^2_{study2}} + \cdots } = -->
<!-- \frac{ \frac{0.03}{0.08^2} + \frac{0.30}{0.08^2} + \cdots }{ \frac{1}{0.08^2} + \frac{1}{0.08^2} + \cdots }$$ -->

\clearpage
Using the fixed-effects formula, we can estimate that the overall effect size in Paluck et al.'s meta-analysis\ is a standardized mean difference of $\widehat{\mu}$ = `r round(as.numeric(fe_model$b), 2)`; 95% confidence interval [`r paste0(round(fe_model$ci.lb, 2), ", ", round(fe_model$ci.ub, 2))`]; $p < 0.001$. Because Cohen's $d$\index{Cohen's $d$} is our effect size index, this estimate would suggest that intergroup contact decreased prejudice by `r round(as.numeric(fe_model$b), 2)` standard deviations.

::: {.callout-note title="code"}
Fitting meta-analytic models in `metafor` is quite simple. For example, for the fixed-effects model above, we simply ran the `rma()` function and specified that we wanted a fixed-effects analysis. 

```{r, opts.label='code'}
fe_model <- rma(yi = d, vi = var_d, data = paluck, method = "FE")
```

Then `summary(fe_model)` gives us the relevant information about the fitted model.
:::

### Limitations of fixed-effects meta-analysis\index{meta-analysis!fixed-effects}

One of the limitations of fixed-effect meta-analysis is that it assumes that the true effect size is, well, *fixed*! In other words, fixed-effect meta-analysis assumes that there is a single effect size that all studies are estimating. This is a stringent assumption. It's easy to imagine that it could be violated. Imagine, for example, that intergroup contact decreased prejudice when the group succeeded at its joint goal but *increased* prejudice when the group failed. If we meta-analyzed two studies under these conditions---one in which intergroup contact substantially increased prejudice and one in which intergroup contact substantially decreased prejudice---it might appear that the true effect of intergroup contact was close to zero, when in fact both of the meta-analyzed studies had large effects. 

In Paluck et al.'s meta-analysis, studies differed in several ways that could lead to different true effects. For example, some studies recruited adult participants while others recruited children. If intergroup contact is more or less effective for adults versus children, then it is misleading to talk about a single (i.e., "fixed") intergroup contact effect. Instead, we would say that the effects of intergroup contact vary across studies, an idea called **heterogeneity**\index{heterogeneity}.

Does the concept of heterogeneity\index{heterogeneity} remind you of anything from when we analyzed repeated-measures data in @sec-models on models? Recall that, with repeated-measures data, we had to deal with the possibility of heterogeneity across participants---and of the ways we did so was by introducing participant-level random intercepts to our regression model. It turns out that we can do a similar thing in meta-analysis to deal with heterogeneity across studies.

### Random-effects meta-analysis\index{meta-analysis!random-effects}

While fixed-effect meta-analysis essentially assumes that all studies in the meta-analysis have the same population effect size, $\mu$, random-effects meta-analysis instead assumes that study effects come from a normal distribution\index{normal distribution} with mean $\mu$ and standard deviation $\tau$.^[Technically, other specifications of random-effects meta-analysis are possible. For example, robust variance estimation does not require making assumptions about the distribution of effects across studies [@hedges2010robust]. These approaches also have other substantial advantages, like their ability to handle effects that are clustered, e.g., because some papers contribute multiple estimates [@hedges2010robust; @pustejovsky2021meta], and their ability to provide better inference in meta-analyses with relatively few studies [@tipton2015small]. For these reasons, we often use these robust methods.] The larger the standard deviation, $\tau$, the more heterogeneous the effects are across studies. A random-effects model then estimates both $\mu$ and $\tau$, for example by maximum likelihood [@dersimonian1986meta; @brockwell2001comparison].
<!-- ^[A confidence interval and $p$-value for the random-effects estimate $\widehat{\mu}$ can be obtained using standard theory for maximum likelihood estimates with an additional adjustment that helps account for uncertainty in estimating $\tau$ [@knapp2003improved].] -->

Like fixed-effect meta-analysis,\index{meta-analysis!fixed-effects} the random-effects estimate\index{meta-analysis!random-effects} of $\widehat{\mu}$ is still a weighted average of studies' effect size estimates:
$$\widehat{\mu} = \frac{ \sum_{i=1}^k w_i \widehat{\theta}_i}{\sum_{i=1}^k w_i}$$
However, in random-effects meta-analysis, the inverse-variance weights now incorporate heterogeneity:\index{heterogeneity} $w_i = 1/\left(\widehat{\tau}^2 + \widehat{\sigma}^2_i \right)$. Where before we had one term in our weights, now we have two. That is because these weights represent the inverse of studies' *marginal* variances, taking into account both statistical error due to their finite sample sizes ($\widehat{\sigma}^2_i$ as before) and also genuine effect heterogeneity ($\widehat{\tau}^2$).
<!--^[The estimate of $\widehat{\tau}^2$ is a bit more complicated, but is essentially a weighted average of studies' squared residuals, $\widehat{\theta_i} - \widehat{\mu}$, while subtracting away variation due to statistical error, $\widehat{\sigma}^2_i$ [@dersimonian1986meta; @brockwell2001comparison].]  -->

Conducting a random-effects meta-analysis\index{meta-analysis!random-effects} of Paluck et al.'s dataset yields $\widehat{\mu}$ = `r round(as.numeric(re_model$b), 2)`; 95% confidence interval [`r paste0(round(re_model$ci.lb, 2), ", ", round(re_model$ci.ub, 2))`]; $p < 0.001$. That is,  *on average across studies*, intergroup contact was associated with a decrease in prejudice of `r round(as.numeric(re_model$b), 2)` standard deviations, substantially larger than the estimate from the fixed-effects model. This meta-analytic estimate is shown as the bottom line of @fig-meta-forest.

::: {.callout-note title="code"}
Fitting a random-effects model requires only a small change to the methods argument of `rma()`. (We also include the `knha` flag that adds a correction to the computation of standard errors and $p$-values.)

```{r, opts.label='code'}
re_model <- rma(yi = d, vi = var_d, data = paluck, method = "REML", knha = TRUE)
```
:::

\clearpage
```{r meta-densities}
#| label: fig-meta-densities
#| fig-cap: "Estimated distribution of population effects from random-effects meta-analysis of Paluck et. al's dataset (heavy red curve) and estimated density of studies' point estimates (thin black curve)."
#| fig-alt: A plot where 2 bell curves representing observed & estimated distributions of effect sizes are similar but not identical.
#| fig-width: 3.5
#| fig-height: 2.25
#| column: margin

# red line: fitted density of *population* effects from meta-analysis
# gray line: just a dumb density estimate from the *point estimates*
ggplot(tibble(x = c(-2, 2)), aes(x = x)) +
  
  # reference lines
  geom_vline(xintercept = 0, lineweight = 1, color = pal$grey) +
  geom_vline(xintercept = re_model$b, linetype = "dashed", lineweight = 1,
             color = pal$red) +
  
  # estimated density of estimates (nonparametric)
  geom_density(aes(x = d), data = df, adjust = 1.2) +
  
  # estimated density from meta-analysis (parametric)
  stat_function(fun = dnorm, n = 101, lineweight = 1.2, color = pal$red,
                args = list(mean = re_model$b, sd = sqrt(re_model$tau2) )) +

  scale_x_continuous(breaks = seq(-2, 2, 0.5)) +
  scale_y_continuous(breaks = NULL) +
  labs(x = "Standardized mean difference", y = "")
```

Based on the random-effects model, intergroup contact effects appear to differ across studies. Paluck et al. estimated that the standard deviation of effects across studies was $\widehat{\tau}$ = `r tau_stats[1]` ; 95% confidence interval [`r tau_stats[2]`, `r tau_stats[3]`]. This estimate indicates a substantial amount of heterogeneity!\index{heterogeneity} To  visualize these results, we can plot the estimated density of the population effects, which is just a normal distribution\index{normal distribution} with mean $\widehat{\mu}$ and standard deviation $\widehat{\tau}$ (@fig-meta-densities).

This meta-analysis\ highlights an important point: that the overall effect size  estimate $\widehat{\mu}$ represents only the *mean* population effect across studies. It tells us nothing about how much the effects *vary* across studies. Thus, we recommend always reporting the heterogeneity\index{heterogeneity} estimate $\widehat{\tau}$, preferably along with other related metrics that help summarize the distribution of effect sizes across studies [@riley2011interpretation; @wang2019simple; @mathur_mam; @npphat]. Reporting the heterogeneity helps readers know how consistent or inconsistent the effects are across studies, which may point to the need to investigate *moderators*\index{moderator} of the effect (i.e., factors that are associated with larger or smaller effects, such as whether participants were adults or children).

One common approach to investigating moderators in meta-analysis is meta-regression, in which moderators are included as covariates in a random-effects meta-analysis model\index{meta-analysis!random-effects} [@thompson2002should]. As in standard regression, coefficients can then be estimated for each moderator, representing the mean difference in population effect between studies with versus without the moderator.

::: {.callout-note title="depth"}
### Single-paper meta-analysis and pooled analysis {-}

Thus far, we have described meta-analysis\ as a tool for summarizing results reported across multiple papers. However, some people have argued that meta-analysis should also be used to summarize the results of multiple studies reported in a single paper [@goh2016mini]. For instance, in a paper where you describe three different experiments on a hypothesis, you could (1) extract summary information (e.g., means and standard deviations) from each study, (2) compute the effect size, and then (3) combine the effect sizes in a meta-analysis.

Single-paper meta-analyses come with many of the same strengths and weaknesses we've discussed. One unique weakness, though, is that having a small number of studies means you typically have low power to detect heterogeneity\index{heterogeneity} and moderators.\index{moderator} This low power can lead researchers to claim there are no significant differences between their studies. But an alternative explanation is that there simply wasn't enough power to detect those differences!

As an alternative, you can also pool the actual data from the three studies, as opposed to just pooling summary statistics. For example, if you have data from 10 participants in each of the three experiments, you could pool them into a single dataset with 30 participants and include random effects of your condition manipulation across experiments (as described in @sec-models). This strategy is often referred to as **pooled**\index{pooled data analysis} or **integrative** data analysis (and occasionally as "mega-analysis," which sounds cool).\index{mega-analysis|see{pooled data analysis}}

![Example data from three individual studies (left) can be pooled into one data analysis (middle) or meta-analyzed as three effect sizes (right).](images/meta/meta_v_ipd.png){#fig-meta-v-ipd width=90% fig-alt="Spreadsheets where: 3 studies each have 10 rows of data; pooled analysis has all 30 rows; meta-analysis has a row per study."}

\vspace{-1em}
One of the benefits of pooled data analysis\index{pooled data analysis} is that it can give you more power to detect moderators.\index{moderator} For instance, imagine that the effect of an intergroup contact treatment is moderated by age. If we performed a traditional meta-analysis, we would only have three observations in our data set, yielding very low power. However, we have many more observations (and much more variation in the moderator) in the pooled data analysis, which can lead to higher power (@fig-meta-v-ipd).

Pooled data analysis\index{pooled data analysis} is not without its own limitations [@cooper2009relative]. And, of course, sometimes it doesn,t make as much sense to pool datasets (e.g., when measures are different from one another). Nonetheless, we believe that pooled data analysis and meta-analysis\ are both useful tools to keep in mind in a paper reporting multiple experiments!
:::


## Bias in meta-analysis

Meta-analysis is a great tool for synthesizing evidence across studies, but the accuracy of a meta-analysis can be compromised by bias. We'll talk about two categories of bias here: **within-study**\index{within-study bias} and **across-study** biases\index{across-study bias}. Either type can lead to meta-analytic estimates that are too large, too small, or even in the wrong direction altogether. 
<!-- We will now discuss examples of each type of bias as well as ways to address these biases when conducting a meta-analysis. This includes mitigating the biases at the outset through sound meta-analysis design and also assessing the robustness of the ultimate conclusions to possible remaining bias. -->

### Within-study biases

Within-study biases\index{within-study bias}---such as demand characteristics\index{demand characteristics}, confounds, and order effects, all discussed in @sec-design---not only impact the validity of individual studies but also any attempt to synthesize those studies. And of course, if individual study results are affected by analytic flexibility\index{analytic flexibility} ($p$-hacking),\index{p-hacking} meta-analyzing these will result in inflated effect size estimates. In other words: garbage in, garbage out. 

For example, @paluck2019contact noted that early studies on intergroup contact almost exclusively used nonrandomized designs. Imagine a hypothetical study where researchers studied a completely ineffective intergroup contact intervention, and nonrandomly assigned low-prejudice people to the intergroup contact condition and high-prejudice people to the control condition. In a scenario like this, the researcher would of course find that the prejudice was lower in the intergroup contact condition. But the effect would not be a true contact intervention effect, but rather a spurious effect of nonrandom assignment (i.e., confounding). Now imagine meta-analyzing many studies with similarly poor designs. The meta-analyst might find impressive evidence of an intergroup contact effect, even if none existed. 

To mitigate this problem, meta-analysts often exclude studies that may be especially affected by within-study bias.\index{within-study bias} [For example, @paluck2019contact excluded nonrandomized studies]. Of course, these decisions can't be made on the basis of their effects on the meta-analytic estimate or else this post hoc exclusion itself will lead to bias! For this reason, inclusion and exclusion criteria for meta-analyses should be preregistered whenever possible. 

Sometimes certain sources of bias cannot be eliminated by excluding studies---often because studies in a particular domain share certain fundamental limitations (for example, attrition in drug trials). After data have been collected, meta-analysts should also assess studies' risks of bias qualitatively using established rating tools [@sterne2016robins]. Doing so allows the meta-analyst to communicate how much within-study bias\index{within-study bias} there may be.^[If you're interested in assessing within-study bias, you can take a look at the Risk of Bias tool (<https://sites.google.com/site/riskofbiastool/welcome/rob-2-0-tool>) developed by Cochrane, an organization devoted to evidence synthesis.]
<!-- @paluck2019contact did not use such tools, but they could have used it to communicate, for example, the extent to which participants might have differentially dropped out of the study.]  -->

Meta-analysts can also conduct sensitivity analyses\index{sensitivity analysis} to assess how much results might be affected by different within-study biases\index{within-study bias} or by excluding certain types of studies [@art]. For example, if nonrandom assignment is a concern, a meta-analyst may run the analyses including only randomized studies, versus including all studies, in order to determine how much including nonrandomized studies changes the meta-analytic estimate.

These two options parallel our discussion of experimental preregistration in @sec-prereg: to allay concerns about results-dependent meta-analysis, researchers can either preregister their analyses ahead of time or else be transparent about their choices after the fact. Sensitivity analyses\index{sensitivity analysis} can allay concerns that a specific choice of exclusion criteria is critically related to the reported results.

### Across-study biases

Across-study biases\index{across-study bias|(} occur if, for example, researchers **selectively report**\index{selective reporting} certain types of findings or selectively publish certain types of findings (publication bias,\index{publication bias} as discussed in @sec-replication and @sec-prereg). Often, these across-study biases favor statistically significant positive results, which means the meta-analytic estimate based on those studies will be inflated relative to the true effect. 

::: {.callout-note title="accident report"}
### Quantifying publication bias in the social sciences {-}

It's typically very hard to quantify publication bias\index{publication bias} because you don't know how many studies are out there in researchers' "file drawers"---unpublished studies are by definition not available. But a recent study took advantage of a unique opportunity to try and quantify publication bias within a known pool of studies. 

Time-sharing Experiments in the Social Sciences (TESS)\index{Time-sharing Experiments in the Social Sciences (TESS)} is an innovative project that lets researchers apply to run experiments on nationally representative samples\index{representative sample} in the US. In 2014, @franco2014 and colleagues took advantage of this application process by examining the entire population of 221 studies conducted through TESS. 

Using this information, Franco and colleagues examined the records of these studies to determine whether the researchers found statistically significant results, a mixture of statistically significant and nonsignificant results, or only nonsignificant results. Then, they examined the likelihood that these results were published in the scientific literature.

Over 60% of studies with statistically significant results were published, compared to a mere 25% of studies that produced only statistically nonsignificant results. This finding was important because it quantified how strong publication bias actually was, at least in one particular population of studies. This estimate may not be general: for example, perhaps TESS\index{Time-sharing Experiments in the Social Sciences (TESS)} studies were easier to put in the file drawer because they cost nothing for the researchers to run. But even a lower level of publication bias can have a substantial effect on a meta-analysis, meaning that it is crucial to check for---and potentially, correct for---publication bias.\index{publication bias}
:::

Like within-study biases,\index{within-study bias} meta-analysts often try to mitigate across-study biases\index{across-study bias} by being careful about what studies make it into the meta-analysis. Meta-analysts don't only want to capture high-profile, published studies on their effect of interest but also studies published in low-profile journals and the so-called gray literature [i.e., unpublished dissertations and theses\; @lefebvre2019searching].^[Evidence is mixed regarding whether including gray literature actually reduces across-study biases in meta-analysis [@tsuji2020addressing; @sapbe], but it is still common practice to try to include this literature.]

There are also statistical methods to help assess how robust the results may be to across-study biases.\index{across-study bias} Among the most popular tools to assess and correct for publication bias is the **funnel plot**\index{funnel plot} [@duval2000trim; @egger1997bias]. A funnel plot shows the relationship between studies' effect estimates and their precision (usually their standard error). These plots are called "funnel plots" because, if there is no publication bias, then as precision increases, the effects "funnel" toward the meta-analytic estimate. As the precision is smaller, they spread out more because of greater measurement error.\index{measurement error} @Fig-meta-funnel-unbiased is an example of one type of funnel plot\index{funnel plot} [@sapb] for a simulated meta-analysis of 100 studies with no publication bias.\index{publication bias}

```{r meta-funnel-unbiased}
#| label: fig-meta-funnel-unbiased
#| fig-cap: "A significance funnel plot\\index{funnel plot}  for a meta-analysis simulated to have no publication bias. Orange points: studies with $p < 0.05$ and positive estimates. Grey points: studies with $p$ $\\ge$ $0.05$ or negative estimates. Black diamond: random-effects estimate of $\\widehat{\\mu}$."
#| fig-alt: A plot showing a funnel-shaped distribution of observed effect sizes against their corresponding standard errors.
#| cap-location: margin
#| out.width: 80%

# simulate meta with publication bias, but keep all studies
set.seed(451)
d_all <- sim_data2(
  tibble(k = 100, per.cluster = 1, mu = .5, V = 1, V.gam = 0, # no clustering
         sei.min = 0.05, sei.max = 3, true.dist = "norm",
         SE.corr = FALSE, select.SE = FALSE, eta = 10),  # set publication bias
  keep.all.studies = TRUE)

# published studies only
dp <- d_all |> filter(publish == 1)

meta_all <- metafor::rma(yi = yi, vi = vi, data = d_all, method = "REML",
                         knha = TRUE)

PublicationBias::significance_funnel(yi = meta_all$yi, vi = meta_all$vi,
                                     xmin = -8, xmax = 8) +
  labs(color = "") +
  theme_get() +
  theme(legend.position = "top")
```

::: {.callout-note title="code"}
For this plot, we use the `PublicationBias` package [@braginsky2023pubbias] and the `significance_funnel()` function. (An alternative function is the `metafor` function `funnel()`, which results in a more "classic" funnel plot.\index{funnel plot}) We use our fitted model `re_model`:

```{r, opts.label='code'}
significance_funnel(yi = re_model$yi, vi = re_model$vi)
```

Because meta-analysis is such a well-established method, many of the relevant operations are "plug and play."
:::

As implied by the "funnel" moniker,\index{funnel plot} our plot looks a little like a funnel. Larger studies (those with smaller standard errors) cluster more closely around the mean of `r round(meta_all$b, digits)` than do smaller studies, but large and small studies alike have point estimates centered around the mean. That is, the funnel plot is symmetric.^[Classic funnel plots look more like @fig-meta-classic-funnel. Our version is different in a couple of ways. Most prominently, we don't have the vertical axis reversed (which we think is confusing). We also don't have the left boundary highlighted, because we think folks don't typically select for negative studies.]

\clearpage
```{r meta-classic-funnel}
#| label: fig-meta-classic-funnel
#| fig-cap: "A classic funnel plot.\\index{funnel plot}"
#| fig-alt: A plot showing an inverted funnel-shaped distribution of observed effect sizes against their corresponding standard errors.
#| fig-height: 5
#| cap-location: margin
#| out.width: 80%

meta_p <- metafor::rma(yi = yi, vi = vi, data = dp, method = "REML", knha = TRUE)

metafor::funnel(meta_all, xlab = "Standardized mean difference", family = .font)
```

Not all funnel plots\index{funnel plot} are symmetric! @Fig-meta-funnel-biased is what happens to our hypothetical meta-analysis if all studies with $p<0.05$ and positive estimates are published, but only 10% of studies with $p \ge 0.05$ or negative estimates are published. The introduction of publication bias\index{publication bias} dramatically inflates the pooled estimate from `r round(meta_all$b, digits)` to `r round(meta_p$b, digits)`. Also, there appears to be a correlation between studies' estimates and their standard errors, such that smaller studies tend to have larger estimates than do larger studies. This correlation is often called funnel plot asymmetry because the funnel plot starts to look like a right triangle rather than a funnel. Funnel plot\index{funnel plot} asymmetry *can* be a diagnostic for publication bias, though it isn't always a perfect indicator, as we'll see in the next subsection.

```{r meta-funnel-biased}
#| label: fig-meta-funnel-biased
#| fig-cap: "A significance funnel plot\\index{funnel plot} for the same simulated meta-analysis after publication bias has occurred. Orange points: studies with $p < 0.05$ and positive estimates. Grey points: studies with $p$ $\\ge$ $0.05$ or negative estimates. Black diamond: random-effects estimate of $\\widehat{\\mu}$."
#| fig-alt: A plot showing a positively sloped linear distribution of observed effect sizes against their corresponding standard errors.
#| cap-location: margin
#| out.width: 80%

PublicationBias::significance_funnel(yi = meta_p$yi, vi = meta_p$vi,
                                     xmin = -8, xmax = 8) +
  labs(color = "") +
  theme_get() +
  theme(legend.position = "top")
```

### Across-study bias correction\index{across-study bias}

How do we identify and correct bias across studies? Given that some forms of publication bias\index{publication bias} induce a correlation between studies' point estimates and their standard errors, several popular statistical methods, such as trim-and-fill [@duval2000trim] and Egger's regression [@egger1997bias] are designed to quantify funnel plot\index{funnel plot} asymmetry. 

Funnel plot asymmetry\index{funnel plot} does not always imply that there is publication bias, though. Nor does publication bias always cause funnel plot asymmetry. Sometimes funnel plot asymmetry is driven by genuine differences in the effects being studied in small and large studies [@egger1997bias; @lau2006case]. For example, in a meta-analysis of intervention studies, if the most effective interventions are also the most expensive or difficult to implement, these highly effective interventions might be used primarily in the smallest studies ("small study effects"). 

Funnel plots\index{funnel plot} and related methods are best suited to detecting publication bias in which (1) small studies with large positive point estimates are more likely to be published than small studies with small or negative point estimates; and (2) the largest studies are published regardless of the magnitude of their point estimates. That model of publication bias is sometimes what is happening, but not always!

A more flexible approach for detecting publication bias\index{publication bias} uses **selection models**\index{selection model}. These models can detect other forms of publication bias that funnel plots\index{funnel plot} may not detect, such as publication bias that favors *significant* results. We won't cover these methods in detail here, but we think they are a better approach to the question, along with related sensitivity analyses.\index{sensitivity analysis}^[High-level overviews of selection models are given in @mcshane2016adjusting and @smt. For more methodological detail, see @hedges1984estimation, @iyengar1988, and @vevea1995. For a tutorial on fitting and interpreting selection models, see @smt. For sensitivity analyses, see @sapb.]

You may also have heard of "$p$-methods" to detect across-study biases\index{across-study bias} such as $p$-curve and $p$-uniform [@simonsohn2014p; @van2015meta]. These methods essentially assess whether the significant $p$-values "bunch up" just under 0.05, which is taken to indicate publication bias. These methods are increasingly popular in psychology and have their merits. However, they are actually simplified versions of selection models [e.g., @hedges1984estimation] that work only under considerably more restrictive settings than the original selection models\index{selection model} [for example, when there is not heterogeneity\index{heterogeneity} across studies\; @mcshane2016adjusting]. For this reason, it is usually (although not always) better to use selection models in place of the more restrictive $p$-methods.

Going back to our running example, Paluck et al. used a regression-based approach to assess and correct for publication bias. This approach provided significant evidence of a relationship between the standard error and effect size (i.e., an asymmetric funnel plot\index{funnel plot}). Again, this asymmetry could reflect publication bias or other sources of correlation between studies' estimates and their standard errors. Paluck et al. also used this same regression-based approach to try to correct for potential publication bias.\index{publication bias} Results from this model indicated that the bias-corrected effect size estimate was close to zero. In other words, even though all studies estimated that intergroup contact decreased prejudice, it is still possible that there are unpublished studies that did not find this (or found that intergroup contact increased prejudice).

::: {.callout-note title="accident report"}
### Garbage in, garbage out? Meta-analyzing potentially problematic research {-}

Botox can help eliminate wrinkles. But some researchers have suggested that, when used to paralyze the muscles associated with frowning, Botox may also help treat clinical depression. As surprising as this claim may sound, a quick examination of the literature would lead many to conclude that this treatment works. Studies that randomly assign depressed patients to receive either Botox or saline injections do indeed find that Botox recipients show decreased depression. And when you combine all available evidence in a meta-analysis, you find that this effect is quite large: *d* = 0.83, 95% CI [0.52, 1.14]. 

As @coles2019does argued though, this estimated effect may be impacted by both within- and between-study bias. First, participants are not supposed to know whether they have been randomly assigned to receive Botox or a control saline injections. But only one of these treatments leads the upper half of your face to be paralyzed! After a couple weeks, you're pretty likely to know whether you received the Botox treatment or control saline injection. Thus, the apparent effect of Botox on depression could instead be a placebo effect.\index{placebo effect}

Second, only 50% of the outcomes that researchers measured were reported in the final publications, raising concerns about selective reporting.\index{selective} reporting Perhaps researchers examining the effects of Botox on depression only reported the measures that showed a positive effect, not the ones that didn't.

Taken together, these two criticisms suggest that, despite the impressive meta-analytic estimate, the effect of Botox on depression is far from certain. 
:::

\index{across-study bias|)}

## Chapter summary: Meta-analysis

Taken together, Paluck and colleagues' use of meta-analysis provided several important insights that would have been easy to miss in a nonquantitative review. First, despite a preponderance of nonsignificant findings, intergroup contact interventions were estimated to decrease prejudice by on average `r round(as.numeric(re_model$b), 2)` standard deviations. On the other hand, there was considerable heterogeneity\index{heterogeneity} in intergroup contact effects, suggesting important moderators\index{moderator} of the effectiveness of these interventions. And finally, publication bias\index{publication bias} was a substantial concern, indicating a need for follow-up research using a registered report\index{registered report} format that will be published regardless of whether the outcome is positive (@sec-prereg).

Overall, meta-analysis is a key technique for aggregating evidence across studies. Meta-analysis allows researchers to move beyond the bias of naive techniques like vote counting and toward a more quantitative summary of an experimental effect. Unfortunately, a meta-analysis is only as good as the literature it's based on, so the aspiring meta-analyst must be aware of both within- and between-study biases! 

::: {.callout-note title="discussion questions"}
1. Imagine that you read the following result in the abstract of a meta-analysis: "In 83 randomized studies of middle school children, replacing one hour of class time with mindfulness meditation significantly improved standardized test scores (standardized mean difference\index{standardized mean difference (SMD)} $\widehat{\mu} = 0.05$; 95% confidence interval: [$0.01, 0.09$]; $p<0.05$)." Why is this a problematic way to report on meta-analysis results? Suggest a better sentence to replace this one.

2. As you read the rest of the meta-analysis, you find that the authors conclude that "these findings demonstrate robust benefits of meditation for children, suggesting that test scores improve even when the meditation is introduced as a replacement for normal class time." You recall that the heterogeneity\index{heterogeneity} estimate was $\widehat{\tau} = 0.90$. Do you think that this result regarding the heterogeneity tends to support, or rather tends to undermine, the concluding sentence of the meta-analysis? Why?

3. What kinds of within-study biases\index{within-study bias} would concern you in the meta-analysis described in the prior two questions? How might you assess the credibility of the meta-analyzed studies and of the meta-analysis as a whole in light of these possible biases?

4. Imagine you conduct a meta-analysis on a literature in which statistically significant results in either direction are much more likely to be published that nonsignificant results. Draw the funnel plot\index{funnel plot} you would expect to see. Is the plot symmetric or asymmetric?

5. Why do you think small studies receive more weight in random-effects meta-analysis\index{meta-analysis!random-effects} than in fixed-effects meta-analysis?\index{meta-analysis!fixed-effects} Can you see why this is true mathematically based on the equations given above, and can you also explain the intuition in simple language?
:::

::: {.callout-note title="readings"}
* A nice, free textbook with lots of good code examples: Harrer, Mathias, Pim Cuijpers, Furukawa Toshi A, and David D. Ebert [-@harrer2021]. *Doing Meta-Analysis with R: A Hands-On Guide*. Chapman & Hall/CRC Press. Available free online at <https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R>.
:::