diff --git a/_freeze/chapters/05-dags/execute-results/html.json b/_freeze/chapters/05-dags/execute-results/html.json index 8af2e566..4d43a3b4 100644 --- a/_freeze/chapters/05-dags/execute-results/html.json +++ b/_freeze/chapters/05-dags/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "601d2587c2ab39f18197db14c081c724", + "hash": "ef54ae368f094413395f1ad22d0328d5", "result": { - "markdown": "# Expressing causal questions as DAGs {#sec-dags}\n\n\n\n\n\n## Visualizing Causal Assumptions\n\n> Draw your assumptions before your conclusions --@hernan2021\n\nCausal diagrams are a tool to visualize your assumptions about the causal structure of the questions you're trying to answer.\nIn a randomized experiment, the causal structure is quite simple.\nWhile there may be many causes of an outcome, the only cause of the exposure is the randomization process itself (we hope!).\nIn many non-randomized settings, however, the structure of your question can be a complex web of causality.\nCausal diagrams help communicate what we think this structure looks like.\nIn addition to being open about what we think the causal structure is, causal diagrams have incredible mathematical properties that allow us to identify a way to estimate unbiased causal effects even with observational data.\n\nCausal diagrams are also increasingly common.\nData collected as a review of causal diagrams in applied health research papers show a drastic increase in use over time [@Tennant2021].\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Percentage of health research papers using causal diagrams over time.](05-dags_files/figure-html/fig-dag-usage-1.png){#fig-dag-usage width=672}\n:::\n:::\n\n\nThe type of causal diagrams we use are also called directed acyclic graphs (DAGs)[^1].\nThese graphs are directed because they include arrows going in a specific direction.\nThey're acyclic because they don't go in circles; a variable can't cause itself, for instance.\nDAGs are used for various problems, but we're specifically concerned with *causal* DAGs.\nThis class of DAGs is sometimes called Structural Causal Models (SCMs) because they are a model of the causal structure of a question [@hernan2021; @Pearl_Glymour_Jewell_2021].\n\n[^1]: An essential but rarely observed detail of DAGs is that dag is also an [affectionate Australian insult](https://en.wikipedia.org/wiki/Dag_(slang)) referring to the dung-caked fur of a sheep, a *daglock*.\n\nDAGs depict causal relationships between variables.\nVisually, the way they depict variables is as *edges* and *nodes*.\nEdges are the arrows going from one variable to another, sometimes called arcs or just arrows.\nNodes are the variables themselves, sometimes called vertices, points, or just variables.\nIn @fig-dag-basic, there are two nodes, `x` and `y`, and one edge going from `x` to `y`.\nHere, we are saying that `x` causes `y`.\n`y` \"listens\" to `x` [@Pearl_Glymour_Jewell_2021].\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A causal directed acyclic graph (DAG). DAGs depict causal relationships. In this DAG, the assumption is that `x` causes `y`.](05-dags_files/figure-html/fig-dag-basic-1.png){#fig-dag-basic width=288}\n:::\n:::\n\n\nIf we're interested in the causal effect of `x` on `y`, we're trying to estimate a numeric representation of that arrow.\nUsually, though, there are many other variables and arrows in the causal structure of a given question.\nA series of arrows is called a *path*.\nThere are three types of paths you'll see in DAGs: forks, chains, and colliders (sometimes called inverse forks).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three types of causal relationships: forks, chains, and colliders. The direction of the arrows and the relationships of interest dictate which type of path a series of variables is. Forks represent a mutual cause, chains represent direct causes, and colliders represent a mutual descendant.](05-dags_files/figure-html/fig-dag-path-types-1.png){#fig-dag-path-types width=672}\n:::\n:::\n\n\nForks represent a common cause of two variables.\nHere, we're saying that `q` causes both `x` and `y`, the traditional definition of a confounder.\nThey're called forks because the arrows from `x` to `y` are in different directions.\nChains, on the other hand, represent a series of arrows going in the same direction.\nHere, `q` is called a *mediator*: it is along the causal path from `x` to `y`.\nIn this diagram, the only path from `x` to `y` is mediated through `q`.\nFinally, a collider is a path where two arrowheads meet at a variable.\nBecause causality always goes forward in time, this naturally means that the collider variable is caused by two other variables.\nHere, we're saying that `x` and `y` both cause `q`.\n\n::: callout-tip\n## Are DAGs SEMs?\n\nIf you're familiar with structural equation models (SEMs), a modeling technique commonly used in psychology and other social science settings, you may notice some similarities between SEMs and DAGs.\nDAGs are a form of *non-parametric* SEM.\nSEMs estimate entire graphs using parametric assumptions.\nCausal DAGs, on the other hand, don't estimate anything; an arrow going from one variable to another says nothing about the strength or functional form of that relationship, only that we think it exists.\n:::\n\nOne of the significant benefits of DAGs is that they help us identify sources of bias and, often, provide clues on how to address them.\nHowever, talking about an unbiased effect estimate only makes sense when we have a specific causal question in mind.\nSince each arrow represents a cause, it's causality all the way down; no individual arrow is inherently problematic.\nHere, we're interested in the effect of `x` on `y`.\nThis question defines which paths we're interested in and which we're not.\n\nThese three types of paths have different implications for the statistical relationship between `x` and `y`.\nIf we only look at the correlation between the two variables under these assumptions:\n\n1. In the fork, `x` and `y` will be associated, despite there being no arrow from `x` to `y`.\n2. In the chain, `x` and `y` are related only through `q`.\n3. In the collider, `x` and `y` will *not* be related.\n\nPaths that transmit association are called *open paths*.\nPaths that do not transmit association are called *closed paths*.\nForks and chains are open, while colliders are closed.\n\nSo, should we adjust for `q`?\nThat depends on the nature of the path.\nForks are confounding paths.\nBecause `q` causes both `x` and `y`, `x` and `y` will have a spurious association.\nThey both contain information from `q`, their mutual cause.\nThat mutual causal relationship makes `x` and `y` associated statistically.\nAdjusting for `q` will *block* the bias from confounding and give us the true relationship between `x` and `y`.\n\n::: callout-tip\n## Adjustment\n\nWe can use a variety of techniques to account for a variable.\nWe use the term \"adjustment\" or \"controlling for\" to refer to any technique that removes the effect of variables we're not interested in.\n:::\n\n@fig-confounder-scatter depicts this effect visually.\nHere, `x` and `y` are continuous, and by definition of the DAG, they are unrelated.\n`q`, however, causes both.\nThe unadjusted effect is biased because it includes information about the open path from `x` to `y` via `q`.\nWithin levels of `q`, however, `x` and `y` are unrelated.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. With forks, the relationship is biased by `q`. When accounting for `q`, we see the true null relationship.](05-dags_files/figure-html/fig-confounder-scatter-1.png){#fig-confounder-scatter width=672}\n:::\n:::\n\n\nFor chains, whether or not we adjust for mediators depends on the research question.\nHere, adjusting for `q` would result in a null estimate of the effect of `x` on `y`.\nBecause the only effect of `x` on `y` is via `q`, no other effect remains.\nThe effect of `x` on `y` mediated by `q` is called the *indirect* effect, while the effect of `x` on `y` directly is called the *direct* effect.\nIf we're only interested in the direct effect, controlling for `q` might be what we want.\nIf we want to know about both effects, we shouldn't try to adjust for `q`.\nWe'll learn more about estimating these and other mediation effects in @sec-mediation.\n\n@fig-mediator-scatter shows this effect visually.\nThe unadjusted effect of `x` on `y` represents the total effect.\nSince the total effect is due entirely to the path mediated by `q`, when we adjust for `q`, no relationship remains.\nThis null effect is the direct effect.\nNeither of these effects is due to bias, but each answers a different research question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. With chains, whether and how we should account for `q` depends on the research question. Without doing so, we see the impact of the total effect of `x` and `y`, including the indirect effect via `q`. When accounting for `q`, we see the direct (null) effect of `x` on `y`.](05-dags_files/figure-html/fig-mediator-scatter-1.png){#fig-mediator-scatter width=672}\n:::\n:::\n\n\nColliders are different.\nIn the collider DAG of @fig-dag-path-types, `x` and `y` are *not* associated, but both cause `q`.\nAdjusting for `q` has the opposite effect than with confounding: it *opens* a biasing pathway.\nSometimes, people draw the path opened up by conditioning on a collider connecting `x` and `y`.\n\nVisually, we can see this happen when `x` and `y` are continuous and `q` is binary.\nIn @fig-collider-scatter, when we don't include `q`, we find no relationship between `x` and `y`.\nThat's the correct result.\nHowever, when we include `q`, we can detect information about both `x` and `y`, and they appear correlated: across levels of `x`, those with `q = 0` have lower levels of `y`.\nAssociation seemingly flows back in time.\nOf course, that can't happen from a causal perspective, so controlling for `q` is the wrong thing to do.\nWe end up with a biased effect of `x` on `y`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. The unadjusted relationship between the two is unbiased. When accounting for `q`, we open a colliding backdoor path and bias the relationship between `x` and `y`.](05-dags_files/figure-html/fig-collider-scatter-1.png){#fig-collider-scatter width=672}\n:::\n:::\n\n\nHow can this be?\nSince `x` and `y` happen before `q`, `q` can't impact them.\nLet's turn the DAG on its side and consider @fig-collider-time.\nIf we break down the two time points, at time point 1, `q` hasn't happened yet, and `x` and `y` are unrelated.\nAt time point 2, `q` happens due to `x` and `y`.\n*But causality only goes forward in time*.\n`q` happening later can't change the fact that `x` and `y` happened independently in the past.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A collider relationship over two points in time. At time point one, there is no relationship between `x` and `y`. Both cause `q` by time point two, but this does not change what already happened at time point one.](05-dags_files/figure-html/fig-collider-time-1.png){#fig-collider-time width=672}\n:::\n:::\n\n\nCausality only goes forward.\nAssociation, however, is time-agnostic.\nIt's just an observation about the numerical relationships between variables.\nWhen we control for the future, we risk introducing bias.\nIt takes time to develop an intuition for this.\nConsider a case where `x` and `y` are the only causes of `q`, and all three variables are binary.\nWhen *either* `x` or `y` equals 1, then `q` happens.\nIf we know `q = 1` and `x = 0` then logically it must be that `y = 1`.\nThus, knowing about `q` gives us information about `y` via `x`.\nThis example is extreme, but it shows how this type of bias, sometimes called *collider-stratification bias* or *selection bias*, occurs: conditioning on `q` provides statistical information about `x` and `y` and distorts their relationship [@Banack2023].\n\n::: callout-tip\n## Exchangeability revisited\n\nWe commonly refer to exchangability as the assumption of no confounding.\nActually, this isn't quite right.\nIt's the assumption of no *open, non-causal* paths [@hernan2021].\nMany times, these are confounding pathways.\nHowever, conditioning on a collider can also open paths.\nEven though these aren't confounders, doing so creates non-exchangeability between the two groups: they are different in a way that matters to the exposure and outcome.\n\nOpen, non-causal paths are also called *backdoor paths*.\nWe'll use this terminology often because it captures the idea well: these are any open paths biasing the effect we're interested in estimating.\n:::\n\nCorrectly identifying the causal structure between the exposure and outcome thus helps us 1) communicate the assumptions we're making about the relationships between variables and 2) identify sources of bias.\nImportantly, in doing 2), we are also often able to identify ways to prevent bias based on the assumptions in 1).\nIn the simple case of the three DAGs in @fig-dag-path-types, we know whether or not to control for `q` depending on the nature of the causal structure.\nThe set or sets of variables we need to adjust for is called the *adjustment set*.\nDAGs can help us identify adjustment sets even in complex settings [@vanderzander2019].\n\n::: callout-tip\n## What about interaction?\n\nDAGs don't make a statement about interaction or effect estimate modification, even though they are an important part of inference.\nTechnically, interaction is a matter of the functional form of the relationships in the DAG.\nMuch as we don't need to specify how we will model a variable in the DAG (e.g., with splines), we don't need to determine how variables statistically interact.\nThat's a matter for the modeling stage.\n\nThere are several ways we use interactions in causal inference.\nIn one extreme, they are simply a matter of functional form: interaction terms are included in models but marginalized to get an overall causal effect.\nConversely, we're interested in *joint causal effects*, where the two variables interacting are both causal.\nIn between, we can use interaction terms to identify *heterogeneous causal effects*, which vary by a second variable that is not assumed to be causal.\nAs with many tools in causal inference, we use the same statistical technique in many ways to answer different questions.\nWe'll revisit this topic in detail in [Chapter -@sec-interaction].\n\nMany people have tried expressing interaction in DAGs using different types of arcs, nodes, and other annotations, but no approach has taken off as the preferred way [@weinberg2007; @Nilsson2021].\n:::\n\nLet's take a look at an example in R.\nWe'll learn to build DAGs, visualize them, and identify important information like adjustment sets.\n\n## DAGs in R\n\nFirst, consider a research question: Does listening to a comedy podcast the morning before an exam improve graduate students' test scores?\nWe can diagram this using the method described in @sec-diag (@fig-diagram-podcast).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A sentence diagram for the question: Does listening to a comedy podcast the morning before an exam improve graduate student test scores? The population is graduate students. The start time is morning, and the outcome time is after the exam.](../images/podcast-diagram.png){#fig-diagram-podcast width=2267}\n:::\n:::\n\n\nThe tool we'll use for making DAGs is ggdag.\nggdag is a package that connects ggplot2, the most powerful visualization tool in R, to dagitty, an R package with sophisticated algorithms for querying DAGs.\n\nTo create a DAG object, we'll use the `dagify()` function.`dagify()` returns a `dagitty` object that works with both the dagitty and ggdag packages.\nThe `dagify()` function takes formulas, separated by commas, that specify causes and effects, with the left element of the formula defining the effect and the right all of the factors that cause it.\nThis is just like the type of formula we specify for most regression models in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n effect1 ~ cause1 + cause2 + cause3,\n effect2 ~ cause1 + cause4,\n ...\n)\n```\n:::\n\n\nWhat are all of the factors that cause graduate students to listen to a podcast the morning before an exam?\nWhat are all of the factors that could cause a graduate student to do well on a test?\nLet's posit some here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggdag)\ndagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ndag {\nexam\nhumor\nmood\npodcast\nprepared\nhumor -> podcast\nmood -> exam\nmood -> podcast\nprepared -> exam\nprepared -> podcast\n}\n```\n\n\n:::\n:::\n\n\nIn the code above, we assume that:\n\n- a graduate student's mood, sense of humor, and how prepared they feel for the exam could influence whether they listened to a podcast the morning of the test\n- their mood and how prepared they are also influence their exam score\n\nNotice we *do not* see podcast in the exam equation; this means that we assume that there is **no** causal relationship between podcast and the exam score.\n\nThere are some other useful arguments you'll often find yourself supplying to `dagify()`:\n\n- `exposure` and `outcome`: Telling ggdag the variables that are the exposure and outcome of your research question is required for many of the most valuable queries we can make of DAGs.\n- `latent`: This argument lets us tell ggdag that some variables in the DAG are unmeasured. `latent` helps identify valid adjustment sets with the data we actually have.\n- `coords`: Coordinates for the variables. You can choose between algorithmic or manual layouts, as discussed below. We'll use `time_ordered_coords()` here.\n- `labels`: A character vector of labels for the variables.\n\nLet's create a DAG object, `podcast_dag`, with some of these attributes, then visualize the DAG with `ggdag()`.\n`ggdag()` returns a ggplot object, so we can add additional layers to the plot, like themes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"),\n # time point 2\n \"podcast\",\n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\nggdag(podcast_dag, use_labels = \"label\", text = FALSE) +\n theme_dag()\n```\n\n::: {.cell-output-display}\n![Proposed DAG to answer the question: Does listening to a comedy podcast the morning before an exam improve graduate students' test scores?](05-dags_files/figure-html/fig-dag-podcast-1.png){#fig-dag-podcast width=384}\n:::\n:::\n\n\n::: callout-note\nFor the rest of the chapter, we'll use `theme_dag()`, a ggplot theme from ggdag meant for DAGs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_set(\n theme_dag() %+replace%\n # also add some additional styling\n theme(\n legend.position = \"bottom\",\n strip.text.x = element_text(margin = margin(2, 0, 2, 0, \"mm\"))\n )\n)\n```\n:::\n\n:::\n\n::: callout-tip\n## DAG coordinates\n\nYou don't need to specify coordinates to ggdag.\nIf you don't, it uses algorithms designed for automatic layouts.\nThere are many such algorithms, and they focus on different aspects of the layout, e.g., the shape, the space between the nodes, minimizing how many edges cross, etc.\nThese layout algorithms usually have a component of randomness, so it's good to use a seed if you want to get the same result.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# no coordinates specified\nset.seed(123)\npod_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n\n# automatically determine layouts\npod_dag |>\n ggdag(text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-14-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nWe can also ask for a specific layout, e.g., the popular Sugiyama algorithm for DAGs [@sugiyama1981].\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |>\n ggdag(layout = \"sugiyama\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-15-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nFor causal DAGs, the time-ordered layout algorithm is often best, which we can specify with `time_ordered_coords()` or `layout = \"time_ordered\"`.\nWe'll discuss time ordering in greater detail below.\nEarlier, we explicitly told ggdag which variables were at which time points, but we don't need to.\nNotice, though, that the time ordering algorithm puts `podcast` and `exam` at the same time point since one doesn't cause another (and thus predate it).\nWe know that's not the case: listening to the podcast happened before taking the exam.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |>\n ggdag(layout = \"time_ordered\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-16-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nYou can manually specify coordinates using a list or data frame and provide them to the `coords` argument of `dagify()`.\nAdditionally, because ggdag is based on dagitty, you can use `dagitty.net` to create and organize a DAG using a graphical interface, then export the result as dagitty code for ggdag to consume.\n\nAlgorithmic layouts are lovely for fast visualization of DAGs or particularly complex graphs.\nOnce you want to share your DAG, it's usually best to be more intentional about the layout, perhaps by specifying the coordinates manually.\n`time_ordered_coords()` is often the best of both worlds, and we'll use it for most DAGs in this book.\n:::\n\nWe've specified the DAG for this question and told ggdag what the exposure and outcome of interest are.\nAccording to the DAG, there is no direct causal relationship between listening to a podcast and exam scores.\nAre there any other open paths?\n`ggdag_paths()` takes a DAG and visualizes the open paths.\nIn @fig-paths-podcast, we see two open paths: `podcast <- mood -> exam\"` and `podcast <- prepared -> exam`. These are both forks---*confounding pathways*. Since there is no causal relationship between listening to a podcast and exam scores, the only open paths are *backdoor* paths, these two confounding pathways.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag |>\n # show the whole dag as a light gray \"shadow\"\n # rather than just the paths\n ggdag_paths(shadow = TRUE, text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![`ggdag_paths()` visualizes open paths in a DAG. There are two open paths in `podcast_dag`: the fork from `mood` and the fork from `prepared`.](05-dags_files/figure-html/fig-paths-podcast-1.png){#fig-paths-podcast width=672}\n:::\n:::\n\n\n::: callout-tip\n`dagify()` returns a `dagitty()` object, but underneath the hood, ggdag converts `dagitty` objects to tidy DAGs, a structure that holds both the `dagitty` object and a `dataframe` about the DAG.\nThis is handy if you want to manipulate the DAG programmatically.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag_tidy <- podcast_dag |>\n tidy_dagitty()\n\npodcast_dag_tidy\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A DAG with 5 nodes and 5 edges\n#\n# Exposure: podcast\n# Outcome: exam\n#\n# A tibble: 7 × 9\n name x y direction to xend yend\n \n1 exam 3 0 NA NA\n2 humor 1 0 -> podcast 2 0\n3 mood 1 1 -> exam 3 0\n4 mood 1 1 -> podcast 2 0\n5 podcast 2 0 NA NA\n6 prepared 1 -1 -> exam 3 0\n7 prepared 1 -1 -> podcast 2 0\n# ℹ 2 more variables: circular , label \n```\n\n\n:::\n:::\n\n\nMost of the quick plotting functions transform the `dagitty` object to a tidy DAG if it's not already, then manipulate the data in some capacity.\nFor instance, `dag_paths()` underlies `ggdag_paths()`; it returns a tidy DAG with data about the paths.\nYou can use several dplyr functions on these objects directly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag_tidy |>\n dag_paths() |>\n filter(set == 2, path == \"open path\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A DAG with 3 nodes and 2 edges\n#\n# Exposure: podcast\n# Outcome: exam\n#\n# A tibble: 4 × 11\n set name x y direction to xend yend\n \n1 2 exam 3 0 NA NA\n2 2 podcast 2 0 NA NA\n3 2 prepar… 1 -1 -> exam 3 0\n4 2 prepar… 1 -1 -> podc… 2 0\n# ℹ 3 more variables: circular , label ,\n# path \n```\n\n\n:::\n:::\n\n\nTidy DAGs are not pure data frames, but you can retrieve either the `dataframe` or `dagitty` object to work with them directly using `pull_dag_data()` or `pull_dag()`.\n`pull_dag()` can be useful when you want to work with dagitty functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dagitty)\npodcast_dag_tidy |>\n pull_dag() |>\n paths()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$paths\n[1] \"podcast <- mood -> exam\" \n[2] \"podcast <- prepared -> exam\"\n\n$open\n[1] TRUE TRUE\n```\n\n\n:::\n:::\n\n:::\n\nBackdoor paths pollute the statistical association between `podcast` and `exam`, so we must account for them.\n`ggdag_adjustment_set()` visualizes any valid adjustment sets implied by the DAG.\n@fig-podcast-adustment-set shows adjusted variables as squares.\nAny arrows coming out of adjusted variables are removed from the DAG because the path is longer open at that variable.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag,\n text = FALSE,\n use_labels = \"label\"\n)\n```\n\n::: {.cell-output-display}\n![A visualization of the minimal adjustment set for the podcast-exam DAG. If this DAG is correct, two variables are required to block the backdoor paths: `mood` and `prepared`.](05-dags_files/figure-html/fig-podcast-adustment-set-1.png){#fig-podcast-adustment-set fig-align='center' width=384}\n:::\n:::\n\n\n@fig-podcast-adustment-set shows the *minimal adjustment set*.\nBy default, ggdag returns the set(s) that can close all backdoor paths with the fewest number of variables possible.\nIn this DAG, that's just one set: `mood` and `prepared`.\nThis set makes sense because there are two backdoor paths, and the only other variables on them besides the exposure and outcome are these two variables.\nSo, at minimum, we must account for both to get a valid estimate.\n\n::: callout-tip\n`ggdag()` and friends usually use `tidy_dagitty()` and `dag_*()` or `node_*()` functions to change the underlying data frame.\nSimilarly, the quick plotting functions use ggdag's geoms to visualize the resulting DAG(s).\nIn other words, you can use the same data manipulation and visualization strategies that you use day-to-day directly with ggdag.\n\nHere's a condensed version of what `ggdag_adjustment_set()` is doing:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npodcast_dag_tidy |>\n # add adjustment sets to data\n dag_adjustment_sets() |>\n ggplot(aes(\n x = x,\n y = y,\n xend = xend,\n yend = yend,\n color = adjusted,\n shape = adjusted\n )) +\n # ggdag's custom geoms: add nodes, edges, and labels\n geom_dag_point() +\n # remove adjusted paths\n geom_dag_edges_link(data = \\(.df) filter(.df, adjusted != \"adjusted\")) +\n geom_dag_label_repel() +\n # you can use any ggplot function, too\n facet_wrap(~set) +\n scale_shape_manual(values = c(adjusted = 15, unadjusted = 19))\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-22-1.png){fig-align='center' width=432}\n:::\n:::\n\n:::\n\nMinimal adjustment sets are only one type of valid adjustment set [@vanderzander2019].\nSometimes, other combinations of variables can get us an unbiased effect estimate.\nTwo other options available in ggdag are full adjustment sets and canonical adjustment sets.\nFull adjustment sets are every combination of variables that result in a valid set.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag,\n text = FALSE,\n use_labels = \"label\",\n # get full adjustment sets\n type = \"all\"\n)\n```\n\n::: {.cell-output-display}\n![All valid adjustment sets for `podcast_dag`.](05-dags_files/figure-html/fig-adustment-set-all-1.png){#fig-adustment-set-all fig-align='center' width=624}\n:::\n:::\n\n\nIt turns out that we can also control for `humor`.\n\nCanonical adjustment sets are a bit more complex: they are all possible ancestors of the exposure and outcome minus any likely descendants.\nIn fully saturated DAGs (DAGs where every node causes anything that comes after it in time), the canonical adjustment set is the minimal adjustment set.\n\n::: callout-tip\nMost of the functions in ggdag use dagitty underneath the hood.\nIt's often helpful to call dagitty functions directly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjustmentSets(podcast_dag, type = \"canonical\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ humor, mood, prepared }\n```\n\n\n:::\n:::\n\n:::\n\nUsing our proposed DAG, let's simulate some data to see how accounting for the minimal adjustment set might occur in practice.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(10)\nsim_data <- podcast_dag |>\n simulate_data()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsim_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 500 × 5\n exam humor mood podcast prepared\n \n 1 -1.17 -0.275 0.00523 0.555 -0.224 \n 2 -1.19 -0.308 0.224 -0.594 -0.980 \n 3 0.613 -1.93 -0.624 -0.0392 -0.801 \n 4 0.0643 -2.88 -0.253 0.802 0.957 \n 5 -0.376 2.35 0.738 0.0828 0.843 \n 6 0.833 -1.24 0.899 1.05 0.217 \n 7 -0.451 1.40 -0.422 0.125 -0.819 \n 8 2.12 -0.114 -0.895 -0.569 0.000869\n 9 0.938 -0.205 -0.299 0.230 0.191 \n10 -0.207 -0.733 1.22 -0.433 -0.873 \n# ℹ 490 more rows\n```\n\n\n:::\n:::\n\n\nSince we have simulated this data, we know that this is a case where *standard methods will succeed* (see @sec-standard) and, therefore, can estimate the causal effect using a basic linear regression model.\n@fig-dag-sim shows a forest plot of the simulated data based on our DAG.\nNotice the model that only included the exposure resulted in a spurious effect (an estimate of -0.1 when we know the truth is 0).\nIn contrast, the model that adjusted for the two variables as suggested by `ggdag_adjustment_set()` is not spurious (much closer to 0).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Model that does not close backdoor paths\nlibrary(broom)\nunadjusted_model <- lm(exam ~ podcast, sim_data) |>\n tidy(conf.int = TRUE) |>\n filter(term == \"podcast\") |>\n mutate(formula = \"podcast\")\n\n## Model that closes backdoor paths\nadjusted_model <- lm(exam ~ podcast + mood + prepared, sim_data) |>\n tidy(conf.int = TRUE) |>\n filter(term == \"podcast\") |>\n mutate(formula = \"podcast + mood + prepared\")\n\nbind_rows(\n unadjusted_model,\n adjusted_model\n) |>\n ggplot(aes(x = estimate, y = formula, xmin = conf.low, xmax = conf.high)) +\n geom_vline(xintercept = 0, linewidth = 1, color = \"grey80\") +\n geom_pointrange(fatten = 3, size = 1) +\n theme_minimal(18) +\n labs(\n y = NULL,\n caption = \"correct effect size: 0\"\n )\n```\n\n::: {.cell-output-display}\n![Forest plot of simulated data based on the DAG described in @fig-dag-podcast.](05-dags_files/figure-html/fig-dag-sim-1.png){#fig-dag-sim width=672}\n:::\n:::\n\n\n## Structures of Causality\n\n### Advanced Confounding\n\nIn `podcast_dag`, `mood` and `prepared` were *direct* confounders: an arrow was going directly from them to `podcast` and `exam`.\nOften, backdoor paths are more complex.\nLet's consider such a case by adding two new variables: `alertness` and `skills_course`.\n`alertness` represents the feeling of alertness from a good mood, thus the arrow from `mood` to `alertness`.\n`skills_course` represents whether the student took a College Skills Course and learned time management techniques.\nNow, `skills_course` is what frees up the time to listen to a podcast as well as being prepared for the exam.\n`mood` and `prepared` are no longer direct confounders: they are two variables along a more complex backdoor path.\nAdditionally, we've added an arrow going from `humor` to `mood`.\nLet's take a look at @fig-podcast_dag2.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag2 <- dagify(\n podcast ~ mood + humor + skills_course,\n alertness ~ mood,\n mood ~ humor,\n prepared ~ skills_course,\n exam ~ alertness + prepared,\n coords = time_ordered_coords(),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n alertness = \"alertness\",\n skills_course = \"college\\nskills course\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\n\nggdag(podcast_dag2, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![An expanded version of `podcast_dag` that includes two additional variables: `skills_course`, representing a College Skills Course, and `alertness`.](05-dags_files/figure-html/fig-podcast_dag2-1.png){#fig-podcast_dag2 width=480}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nNow there are *three* backdoor paths we need to close: `podcast <- humor -> mood -> alertness -> exam`, `podcast <- mood -> alertness -> exam`, and`podcast <- skills_course -> prepared -> exam`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_paths(podcast_dag2, use_labels = \"label\", text = FALSE, shadow = TRUE)\n```\n\n::: {.cell-output-display}\n![Three open paths in `podcast_dag2`. Since there is no effect of `podcast` on `exam`, all three are backdoor paths that must be closed to get the correct effect.](05-dags_files/figure-html/fig-podcast_dag2-paths-1.png){#fig-podcast_dag2-paths width=1056}\n:::\n:::\n\n\nThere are four minimal adjustment sets to close all three paths (and eighteen full adjustment sets!).\nThe minimal adjustment sets are `alertness + prepared`, `alertness + skills_course`, `mood + prepared`, `mood + skills_course`.\nWe can now block the open paths in several ways.\n`mood` and `prepared` still work, but we've got other options now.\nNotably, `prepared` and `alertness` could happen at the same time or even after `podcast`.\n`skills_course` and `mood` still happen before both `podcast` and `exam`, so the idea is still the same: the confounding pathway starts before the exposure and outcome.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(podcast_dag2, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![Valid minimal adjustment sets that will close the backdoor paths in @fig-podcast_dag2-paths.](05-dags_files/figure-html/fig-podcast_dag2-set-1.png){#fig-podcast_dag2-set width=672}\n:::\n:::\n\n\nDeciding between these adjustment sets is a matter of judgment: if all data are perfectly measured, the DAG is correct, and we've modeled them correctly, then it doesn't matter which we use.\nEach adjustment set will result in an unbiased estimate.\nAll three of those assumptions are usually untrue to some degree.\nLet's consider the path via `skills_course` and `prepared`.\nIt may be that we are better able to assess whether or not someone took the College Skills Course than how prepared for the exam they are.\nIn that case, an adjustment set with `skills_course` is a better option.\nBut perhaps we better understand the relationship between preparedness and exam results.\nIf we have it measured, controlling for that might be better.\nWe could get the best of both worlds by including both variables: between the better measurement of `skills_course` and the better modeling of `prepared`, we might have a better chance of minimizing confounding from this path.\n\n### Selection Bias and Mediation\n\nSelection bias is another name for the type of bias that is induced by adjusting for a collider [@lu2022].\nIt's called \"selection bias\" because a common form of collider-induced bias is a variable inherently stratified upon by the design of the study---selection *into* the study.\nLet's consider a case based on the original `podcast_dag` but with one additional variable: whether or not the student showed up to the exam.\nNow, there is an indirect effect of `podcast` on `exam`: listening to a podcast influences whether or not the students attend the exam.\nThe true result of `exam` is missing for those who didn't show up; by studying the group of people who *did* show up, we are inherently stratifying on this variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared + showed_up,\n showed_up ~ podcast + mood + prepared,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"),\n # time point 2\n \"podcast\",\n \"showed_up\",\n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\",\n showed_up = \"showed up\"\n )\n)\nggdag(podcast_dag3, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![Another variant of `podcast_dag`, this time including the inherent stratification on those who appear for the exam. There is still no direct effect of `podcast` on `exam`, but there is an indirect effect via `showed_up`.](05-dags_files/figure-html/fig-podcast_dag3-1.png){#fig-podcast_dag3 width=432}\n:::\n:::\n\n\nThe problem is that `showed_up` is both a collider and a mediator: stratifying on it induces a relationship between many of the variables in the DAG but blocks the indirect effect of `podcast` on `exam`.\nLuckily, the adjustment sets can handle the first problem; because `showed_up` happens *before* `exam`, we're less at risk of collider bias between the exposure and outcome.\nUnfortunately, we cannot calculate the total effect of `podcast` on `exam` because part of the effect is missing: the indirect effect is closed at `showed_up`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 |>\n adjust_for(\"showed_up\") |>\n ggdag_adjustment_set(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![The adjustment set for `podcast_dag3` given that the data are inherently conditioned on showing up to the exam. In this case, there is no way to recover an unbiased estimate of the total effect of `podcast` on `exam`.](05-dags_files/figure-html/fig-podcast_dag3-as-1.png){#fig-podcast_dag3-as width=432}\n:::\n:::\n\n\nSometimes, you can still estimate effects in this situation by changing the estimate you wish to calculate.\nWe can't calculate the total effect because we are missing the indirect effect, but we can still calculate the direct effect of `podcast` on `exam`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 |>\n adjust_for(\"showed_up\") |>\n ggdag_adjustment_set(effect = \"direct\", text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![The adjustment set for `podcast_dag3` when targeting a different effect. There is one minimal adjustment set that we can use to estimate the direct effect of `podcast` on `exam`.](05-dags_files/figure-html/fig-podcast_dag3-direct-1.png){#fig-podcast_dag3-direct width=432}\n:::\n:::\n\n\n#### M-Bias and Butterfly Bias {#sec-m-bias}\n\nA particular case of selection bias that you'll often see people talk about is *M-bias*.\nIt's called M-bias because it looks like an M when arranged top to bottom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nm_bias() |>\n ggdag()\n```\n\n::: {.cell-output-display}\n![A DAG representing M-Bias, a situation where a collider predates the exposure and outcome.](05-dags_files/figure-html/fig-m-bias-1.png){#fig-m-bias width=384}\n:::\n:::\n\n\n::: callout-tip\nggdag has several quick-DAGs for demonstrating basic causal structures, including `confounder_triangle()`, `collider_triangle()`, `m_bias()`, and `butterfly_bias()`.\n:::\n\nWhat's theoretically interesting about M-bias is that `m` is a collider but occurs before `x` and `y`.\nRemember that association is blocked at a collider, so there is no open path between `x` and `y`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaths(m_bias())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$paths\n[1] \"x <- a -> m <- b -> y\"\n\n$open\n[1] FALSE\n```\n\n\n:::\n:::\n\n\nLet's focus on the `mood` path of the podcast-exam DAG.\nWhat if we were wrong about mood, and the actual relationship was M-shaped?\nLet's say that, rather than causing `podcast` and `exam`, `mood` was itself caused by two mutual causes of `podcast` and `exam`, `u1` and `u2`, as in @fig-podcast_dag4.\nWe don't know what `u1` and `u2` are, and we don't have them measured.\nAs above, there are no open paths in this subset of the DAG.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag4 <- dagify(\n podcast ~ u1,\n exam ~ u2,\n mood ~ u1 + u2,\n coords = time_ordered_coords(list(\n c(\"u1\", \"u2\"),\n \"mood\",\n \"podcast\",\n \"exam\"\n )),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n u1 = \"unmeasured\",\n u2 = \"unmeasured\"\n ),\n # we don't have them measured\n latent = c(\"u1\", \"u2\")\n)\n\nggdag(podcast_dag4, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![A reconfiguration of @fig-dag-podcast where `mood` is a collider on an M-shaped path.](05-dags_files/figure-html/fig-podcast_dag4-1.png){#fig-podcast_dag4 width=528}\n:::\n:::\n\n\nThe problem arises when we think our original DAG is the right DAG: `mood` is in the adjustment set, so we control for it.\nBut this induces bias!\nIt opens up a path between `u1` and `u2`, thus creating a path from `podcast` to `exam`.\nIf we had either `u1` or `u2` measured, we could adjust for them to close this path, but we don't.\nThere is no way to close this open path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag4 |>\n adjust_for(\"mood\") |>\n ggdag_adjustment_set(use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![The adjustment set where `mood` is a collider. If we control for `mood` and don't know about or have the unmeasured causes of `mood`, we have no means of closing the backdoor path opened by adjusting for a collider.](05-dags_files/figure-html/fig-podcast_dag4-as-1.png){#fig-podcast_dag4-as width=528}\n:::\n:::\n\n\nOf course, the best thing to do here is not control for `mood` at all.\nSometimes, though, that is not an option.\nImagine if, instead of `mood`, this turned out to be the real structure for `showed_up`: since we inherently control for `showed_up`, and we don't have the unmeasured variables, our study results will always be biased.\nIt's essential to understand if we're in that situation so we can address it with sensitivity analysis to understand just how biased the effect would be.\n\nLet's consider a variation on M-bias where `mood` causes `podcast` and `exam` and `u1` and `u2` are mutual causes of `mood` and the exposure and outcome.\nThis arrangement is sometimes called butterfly or bowtie bias, again because of its shape.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbutterfly_bias(x = \"podcast\", y = \"exam\", m = \"mood\", a = \"u1\", b = \"u2\") |>\n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![In butterfly bias, `mood` is both a collider and a confounder. Controlling for the bias induced by `mood` opens a new pathway because we've also conditioned on a collider. We can't properly close all backdoor paths without either `u1` or `u2`.](05-dags_files/figure-html/fig-butterfly_bias-1.png){#fig-butterfly_bias width=480}\n:::\n:::\n\n\nNow, we're in a challenging position: we need to control for `mood` because it's a confounder, but controlling for `mood` opens up the pathway from `u1` to `u2`.\nBecause we don't have either variable measured, we can't then close the path opened from conditioning on `mood`.\nWhat should we do?\nIt turns out that, when in doubt, controlling for `mood` is the better of the two options: confounding bias tends to be worse than collider bias, and M-shaped structures of colliders are sensitive to slight deviations (e.g., if this is not the exact structure, often the bias isn't as bad) [@DingMiratrix2015].\n\nAnother common form of selection bias is from *loss to follow-up*: people drop out of a study in a way that is related to the exposure and outcome.\nWe'll come back to this topic in [Chapter -@sec-longitudinal].\n\n### Causes of the exposure, causes of the outcome\n\nLet's consider one other type of causal structure that's important: causes of the exposure and not the outcome, and their opposites, causes of the outcome and not the exposure.\nLet's add a variable, `grader_mood`, to the original DAG.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag5 <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared + grader_mood,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"),\n # time point 2\n c(\"podcast\", \"grader_mood\"),\n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"student\\nmood\",\n humor = \"humor\",\n prepared = \"prepared\",\n grader_mood = \"grader\\nmood\"\n )\n)\nggdag(podcast_dag5, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![A DAG containing a cause of the exposure that is not the cause of the outcome (`humor`) and a cause of the outcome that is not a cause of the exposure (`grader_mood`).](05-dags_files/figure-html/fig-podcast_dag5-1.png){#fig-podcast_dag5 width=480}\n:::\n:::\n\n\nThere are now two variables that aren't related to *both* the exposure and the outcome: `humor`, which causes `podcast` but not `exam`, and `grader_mood`, which causes `exam` but not `podcast`.\nLet's start with `humor`.\n\nVariables that cause the exposure but not the outcome are also called *instrumental variables* (IVs).\nIVs are an unusual circumstance where, under certain conditions, controlling for them can make other types of bias worse.\nWhat's unique about this is that IVs can *also* be used to conduct an entirely different approach to estimating an unbiased effect of the exposure on the outcome.\nIVs are commonly used this way in econometrics and are increasingly popular in other areas.\nIn short, IV analysis allows us to estimate the causal effect using a different set of assumptions than the approaches we've talked about thus far.\nSometimes, a problem intractable using propensity score methods can be addressed using IVs and vice versa.\nWe'll talk more about IVs in @sec-iv-friends.\n\nSo, if you're *not* using IV methods, should you include an IV in a model meant to address confounding?\nIf you're unsure if the variable is an IV or not, you should probably add it to your model: it's more likely to be a confounder than an IV, and, it turns out, the bias from adding an IV is usually small in practice.\nSo, like adjusting for a potential M-structure variable, the risk of bias is worse from confounding [@Myers2011].\n\nNow, let's talk about the opposite of an IV: a cause of the outcome that is not the cause of the exposure.\nThese variables are sometimes called *competing exposures* (because they also cause the outcome) or *precision variables* (because, as we'll see, they increase the precision of causal estimates).\nWe'll call them precision variables because we're concerned about the relationship to the research question at hand, not to another research question where they are exposures [@Brookhart2006].\n\nLike IVs, precision variables do not occur along paths from the exposure to the outcome.\nThus, including them is not necessary.\nUnlike IVs, including precision variables is beneficial.\nIncluding other causes of the outcomes helps a statistical model capture some of its variation.\nThis doesn't impact the point estimate of the effect, but it does reduce the variance, resulting in smaller standard errors and narrower confidence intervals.\nThus, we recommend including them when possible.\n\nSo, even though we don't need to control for `grader_mood`, if we have it in the data set, we should.\nSimilarly, `humor` is not a good addition to the model unless we think it really might be a confounder; if it is a valid instrument, we might want to consider using IV methods to estimate the effect instead.\n\n### Measurement Error and Missingness\n\nDAGs can also help us understand the bias arising from mismeasurements in the data, including the worst mismeasurement: not measuring it at all.\nWe'll cover these topics in [Chapter -@sec-missingness], but the basic idea is that by separating the actual value from the observed value, we can better understand how such biases may behave [@Hernán2009].\nHere's a basic example of a bias called *recall bias*.\nRecall bias is when the outcome influences a participant's memory of exposure, so it's a particular problem in retrospective studies where the earlier exposure is not recorded until after the outcome happens.\nAn example of when this can occur is a case-control study of cancer.\nSomeone *with* cancer may be more motivated to ruminate on their past exposures than someone *without* cancer.\nSo, their memory about a given exposure may be more refined than someone without.\nBy conditioning on the observed version of the exposure, we open up many collider paths.\nUnfortunately, there is no way to close them all.\nIf this is the case, we must investigate how severe the bias would be in practice.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_dag <- dagify(\n exposure_observed ~ exposure_real + exposure_error,\n outcome_observed ~ outcome_real + outcome_error,\n outcome_real ~ exposure_real,\n exposure_error ~ outcome_real,\n labels = c(\n exposure_real = \"Exposure\\n(truth)\",\n exposure_error = \"Measurement Error\\n(exposure)\",\n exposure_observed = \"Exposure\\n(observed)\",\n outcome_real = \"Outcome\\n(truth)\",\n outcome_error = \"Measurement Error\\n(outcome)\",\n outcome_observed = \"Outcome\\n(observed)\"\n ),\n exposure = \"exposure_real\",\n outcome = \"outcome_real\",\n coords = time_ordered_coords()\n)\n\nerror_dag |>\n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG representing measurement error in observing the exposure and outcome. In this case, the outcome impacts the participant's memory of the exposure, also known as recall bias.](05-dags_files/figure-html/fig-error_dag-1.png){#fig-error_dag width=528}\n:::\n:::\n\n\n## Recommendations in building DAGs\n\nIn principle, using DAGs is easy: specify the causal relationships you think exist and then query the DAG for information like valid adjustment sets.\nIn practice, assembling DAGs takes considerable time and thought.\nNext to defining the research question itself, it's one of the most challenging steps in making causal inferences.\nVery little guidance exists on best practices in assembling DAGs.\n@Tennant2021 collected data on DAGs in applied health research to better understand how researchers used them.\n@tbl-dag-properties shows some information they collected: the median number of nodes and arcs in a DAG, their ratio, the saturation percent of the DAG, and how many were fully saturated.\nSaturating DAGs means adding all possible arrows going forward in time, e.g., in a fully saturated DAG, any given variable at time point 1 has arrows going to all variables in future time points, and so on.\nMost DAGs were only about half saturated, and very few were fully saturated.\n\nOnly about half of the papers using DAGs reported the adjustment set used.\nIn other words, researchers presented their assumptions about the research question but not the implications about how they should handle the modeling stage or if they did use a valid adjustment set.\nSimilarly, the majority of studies did not report the estimand of interest.\n\n::: callout-note\nThe estimand is the target of interest in terms of what we're trying to estimate, as discussed briefly in [Chapter -@sec-whole-game].\nWe'll discuss estimands in detail in [Chapter -@sec-estimands].\n:::\n\n\n::: {#tbl-dag-properties .cell tbl-cap='A table of DAG properties in applied health research. Number of nodes and arcs are the median number of variables and arrows in the analyzed DAGs, while the Node to Arc ratio is their ratio. Saturation proportion is the proportion of all possible arrows going forward in time to other included variables. Fully saturated DAGs are those that include all such arrows. The researchers also analyzed whether studies reported their estimands and adjustment sets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n \n \n \n \n
CharacteristicN = 1441
DAG properties
Number of Nodes12 (9, 16)
Number of Arcs29 (19, 41)
Node to Arc Ratio2.30 (1.78, 3.00)
Saturation Proportion0.46 (0.31, 0.67)
Fully Saturated
    Yes4 (3%)
    No140 (97%)
Reporting
Reported Estimand
    Yes40 (28%)
    No104 (72%)
Reported Adjustment Set
    Yes80 (56%)
    No64 (44%)
1 Median (IQR); n (%)
\n
\n```\n\n:::\n:::\n\n\nIn this section, we'll offer some advice from @Tennant2021 and our own experience assembling DAGs.\n\n### Iterate early and often\n\nOne of the best things you can do for the quality of your results is to make the DAG before you conduct the study, ideally before you even collect the data.\nIf you're already working with your data, at minimum, build your DAG before doing data analysis.\nThis advice is similar in spirit to pre-registered analysis plans: declaring your assumptions ahead of time can help clarify what you need to do, reduce the risk of overfitting (e.g., determining confounders incorrectly from the data), and give you time to get feedback on your DAG.\n\nThis last benefit is significant: you should ideally democratize your DAG.\nShare it early and often with others who are experts on the data, domain, and models.\nIt's natural to create a DAG, present it to your colleagues, and realize you have missed something important.\nSometimes, you will only agree on some details of the structure.\nThat's a good thing: you know now where there is uncertainty in your DAG.\nYou can then examine the results from multiple plausible DAGs or address the uncertainty with sensitivity analyses.\n\nIf you have more than one candidate DAG, check their adjustment sets.\nIf two DAGs have overlapping adjustment sets, focus on those sets; then, you can move forward in a way that satisfies the plausible assumptions you have.\n\n### Consider your question\n\nAs we saw in @fig-podcast_dag3, some questions can be challenging to answer with certain data, while others are more approachable.\nYou should consider precisely what it is you want to estimate.\nDefining your target estimate is an important topic and the subject of [Chapter -@sec-estimands].\n\nAnother important detail about how your DAG relates to your question is the population and time.\nMany causal structures are not static over time and space.\nConsider lung cancer: the distribution of causes of lung cancer was considerably different before the spread of smoking.\nIn medieval Japan, before the spread of tobacco from the Americas centuries later, the causal structure for lung cancer would have been practically different from what it is in Japan today, both in terms of tobacco use and other factors (age of the population, etc.).\n\nThe same is true for confounders.\nEven if something *can* cause the exposure and outcome, if the prevalence of that thing is zero in the population you're analyzing, it's irrelevant to the causal question.\nIt may also be that, in some populations, it doesn't affect one of the two.\nThe reverse is also true: something might be unique to the target population.\nThe use of tobacco in North America several centuries ago was unique among the world population, even though ceremonial tobacco use was quite different from modern recreational use.\nMany changes won't happen as dramatically as across centuries, but sometimes, they do, e.g., if regulation in one country effectively eliminates the population's exposure to something.\n\n### Order nodes by time\n\nAs discussed earlier, we recommend ordering your variables by time, either left-to-right or up-to-down.\nThere are two reasons for this.\nFirst, time ordering is an integral part of your assumptions.\nAfter all, something happening before another thing is a requirement for it to be a cause.\nThinking this through carefully will clarify your DAG and the variables you need to address.\n\nSecond, after a certain level of complexity, it's easier to read a DAG when arranged by time because you have to think less about that dimension; it's inherent to the layout.\nThe time ordering algorithm in ggdag automates much of this for you, although, as we saw earlier, it's sometimes helpful to give it more information about the order.\n\nA related topic is feedback loops [@murray2022].\nOften, we think about two things that mutually cause each other as happening in a circle, like global warming and A/C use (A/C use increases global warming, which makes it hotter, which increases A/C use, and so on).\nIt's tempting to visualize that relationship like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n ac_use ~ global_temp,\n global_temp ~ ac_use,\n labels = c(ac_use = \"A/C use\", global_temp = \"Global\\ntemperature\")\n) |>\n ggdag(layout = \"circle\", edge_type = \"arc\", text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG representing the reciprocal relationship between A/C use and global temperature because of global warming. Feedback loops are useful mental shorthands to describe variables that impact each other over time compactly, but they are not true causal diagrams.](05-dags_files/figure-html/fig-feedback-loop-1.png){#fig-feedback-loop width=432}\n:::\n:::\n\n\nFrom a DAG perspective, this is a problem because of the *A* part of *DAG*: it's cyclic!\nImportantly, though, it's also not correct from a causal perspective.\nFeedback loops are a shorthand for what really happens, which is that the two variables mutually affect each other *over time*.\nCausality only goes forward in time, so it doesn't make sense to go back and forth like in @fig-feedback-loop.\n\nThe real DAG looks something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n global_temp_2000 ~ ac_use_1990 + global_temp_1990,\n ac_use_2000 ~ ac_use_1990 + global_temp_1990,\n global_temp_2010 ~ ac_use_2000 + global_temp_2000,\n ac_use_2010 ~ ac_use_2000 + global_temp_2000,\n global_temp_2020 ~ ac_use_2010 + global_temp_2010,\n ac_use_2020 ~ ac_use_2010 + global_temp_2010,\n coords = time_ordered_coords(),\n labels = c(\n ac_use_1990 = \"A/C use\\n(1990)\",\n global_temp_1990 = \"Global\\ntemperature\\n(1990)\",\n ac_use_2000 = \"A/C use\\n(2000)\",\n global_temp_2000 = \"Global\\ntemperature\\n(2000)\",\n ac_use_2010 = \"A/C use\\n(2010)\",\n global_temp_2010 = \"Global\\ntemperature\\n(2010)\",\n ac_use_2020 = \"A/C use\\n(2020)\",\n global_temp_2020 = \"Global\\ntemperature\\n(2020)\"\n )\n) |>\n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG showing the relationship between A/C use and global temperature over time. The true causal relationship in a feedback loop goes *forward*.](05-dags_files/figure-html/fig-feedforward-1.png){#fig-feedforward width=480}\n:::\n:::\n\n\nThe two variables, rather than being in a feed*back* loop, are actually in a feed*forward* loop: they co-evolve over time.\nHere, we only show four discrete moments in time (the decades from 1990 to 2020), but of course, we could get much finer depending on the question and data.\n\nAs with any DAG, the proper analysis approach depends on the question.\nThe effect of A/C use in 2000 on the global temperature in 2020 produces a different adjustment set than the global temperature in 2000 on A/C use in 2020.\nSimilarly, whether we also model this change over time or just those two time points depends on the question.\nOften, these feedforward relationships require you to address *time-varying* confounding, which we'll discuss in [Chapter -@sec-longitudinal].\n\n### Consider the whole data collection process\n\nAs @fig-podcast_dag3 showed us, it's essential to consider the *way* we collected data as much as the causal structure of the question.\nConsidering the whole data collection process is particularly true if you're working with \"found\" data---a data set not intentionally collected to answer the research question.\nWe are always inherently conditioning on the data we have vs. the data we don't have.\nIf other variables influenced the data collection process in the causal structure, you need to consider the impact.\nDo you need to control for additional variables?\nDo you need to change the effect you are trying to estimate?\nCan you answer the question at all?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n::: callout-tip\n## What about case-control studies?\n\nA standard study design in epidemiology is the case-control study.\nCase-control studies are beneficial when the outcome under study is rare or takes a very long time to happen (like many types of cancer).\nParticipants are selected into the study based on their outcome: once a person has an event, they are entered as a case and matched with a control who hasn't had the event.\nOften, they are matched on other factors as well.\n\nMatched case-control studies are selection biased by design [@mansournia2013].\nIn @fig-case-control, when we condition on selection into the study, we lose the ability to close all backdoor paths, even if we control for `confounder`.\nFrom the DAG, it would appear that the entire design is invalid!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ confounder + exposure,\n selection ~ outcome + confounder,\n exposure ~ confounder,\n exposure = \"exposure\",\n outcome = \"outcome\",\n coords = time_ordered_coords()\n) |>\n ggdag(edge_type = \"arc\", text_size = 2.2)\n```\n\n::: {.cell-output-display}\n![A DAG representing a matched case-control study. In such a study, selection is determined by outcome status and any matched confounders. Selection into the study is thus a collider. Since it is inherently stratified on who is actually in the study, such data are limited in the types of causal effects they can estimate.](05-dags_files/figure-html/fig-case-control-1.png){#fig-case-control width=432}\n:::\n:::\n\n\nLuckily, this isn't wholly true.\nCase-control studies are limited in the type of causal effects they can estimate (causal odds ratios, which under some circumstances approximate causal risk ratios).\nWith careful study design and sampling, the math works out such that these estimates are still valid.\nExactly how and why case-control studies work is beyond the scope of this book, but they are a remarkably clever design.\n:::\n\n### Include variables you don't have\n\nIt's critical that you include *all* variables important to the causal structure, not just the variables you have measured in your data.\nggdag can mark variables as unmeasured (\"latent\"); it will then return only usable adjustment sets, e.g., those without the unmeasured variables.\nOf course, the best thing to do is to use DAGs to help you understand what to measure in the first place, but there are many reasons why your data might be different.\nEven data intentionally collected for the research question might not have a variable discovered to be a confounder after data collection.\n\nFor instance, if we have a DAG where `exposure` and `outcome` have a confounding pathway consisting of `confounder1` and `confounder2`, we can control for either to successfully debias the estimate:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ exposure + confounder1,\n exposure ~ confounder2,\n confounder2 ~ confounder1,\n exposure = \"exposure\",\n outcome = \"outcome\"\n) |>\n adjustmentSets()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ confounder1 }\n{ confounder2 }\n```\n\n\n:::\n:::\n\n\nThus, if just one is missing (`latent`), then we are ok:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ exposure + confounder1,\n exposure ~ confounder2,\n confounder2 ~ confounder1,\n exposure = \"exposure\",\n outcome = \"outcome\",\n latent = \"confounder1\"\n) |>\n adjustmentSets()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ confounder2 }\n```\n\n\n:::\n:::\n\n\nBut if both are missing, there are no valid adjustment sets.\n\nWhen you don't have a variable measured, you still have a few options.\nAs mentioned above, you may be able to identify alternate adjustment sets.\nIf the missing variable is required to close all backdoor paths completely, you can and should conduct a sensitivity analysis to understand the impact of not having it.\nThis is the subject of [Chapter -@sec-sensitivity].\n\nUnder some lucky circumstances, you can also use a *proxy* confounder [@miao2018].\nA proxy confounder is a variable closely related to the confounder such that controlling for it controls for some of the effects of the missing variable.\nConsider an expansion of the fundamental confounding relationship where `q` has a cause, `p`, as in @fig-proxy-confounder.\nTechnically, if we don't have `q`, we can't close the backdoor path, and our effect will be biased.\nPractically, though, if `p` is highly correlated with `q`, it can serve as a method to reduce the confounding from `q`.\nYou can think of `p` as a mismeasured version of `q`; it will seldom wholly control for the bias via `q`, but it can help minimize it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n y ~ x + q,\n x ~ q,\n q ~ p,\n coords = time_ordered_coords()\n) |>\n ggdag(edge_type = \"arc\")\n```\n\n::: {.cell-output-display}\n![A DAG with a confounder, `q`, and a proxy confounder, `p`. The true adjustment set is `q`. Since `p` causes `q`, it contains information about `q` and can reduce the bias if we don't have `q` measured.](05-dags_files/figure-html/fig-proxy-confounder-1.png){#fig-proxy-confounder width=432}\n:::\n:::\n\n\n### Saturate your DAG, then prune\n\nIn discussing @tbl-dag-properties, we mentioned *saturated* DAGs.\nThese are DAGs where all possible arrows are included based on the time ordering, e.g., every variable causes variables that come after it in time.\n\n*Not* including an arrow is a bigger assumption than including one.\nIn other words, your default should be to have an arrow from one variable to a future variable.\nThis default is counterintuitive to many people.\nHow can it be that we need to be so careful about assessing causal effects yet be so liberal in applying causal assumptions in the DAG?\nThe answer to this lies in the strength and prevalence of the cause.\nTechnically, an arrow present means that *for at least a single observation*, the prior node causes the following node.\nThe arrow similarly says nothing about the strength of the relationship.\nSo, a minuscule causal effect on a single individual justifies the presence of an arrow.\nIn practice, such a case is probably not relevant.\nThere is *effectively* no arrow.\n\nThe more significant point, though, is that you should feel confident to add an arrow.\nThe bar for justification is much lower than you think.\nInstead, it's helpful to 1) determine your time ordering, 2) saturate the DAG, and 3) prune out implausible arrows.\n\nLet's experiment by working through a saturated version of the podcast-exam DAG.\n\nFirst, the time-ordering.\nPresumably, the student's sense of humor far predates the day of the exam.\nMood in the morning, too, predates listening to the podcast or exam score, as does preparation.\nThe saturated DAG given this ordering is:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A saturated version of `podcast_dag`: variables have all possible arrows going forward to other variables over time.](05-dags_files/figure-html/fig-podcast_dag_sat-1.png){#fig-podcast_dag_sat width=528}\n:::\n:::\n\n\nThere are a few new arrows here.\nHumor now causes the other two confounders, as well as exam score.\nSome of them make sense.\nSense of humor probably affects mood for some people.\nWhat about preparedness?\nThis relationship seems a little less plausible.\nSimilarly, we know that a sense of humor does not affect exam scores in this case because the grading is blinded.\nLet's prune those two.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A pruned version of @fig-podcast_dag_sat: we've removed implausible arrows from the fully saturated DAGs.](05-dags_files/figure-html/fig-podcast_dag_pruned-1.png){#fig-podcast_dag_pruned width=528}\n:::\n:::\n\n\nThis DAG seems more reasonable.\nSo, was our original DAG wrong?\nThat depends on several factors.\nNotably, both DAGs produce the same adjustment set: controlling for `mood` and `prepared` will give us an unbiased effect if either DAG is correct.\nEven if the new DAG were to produce a different adjustment set, whether the result is meaningfully different depends on the strength of the confounding.\n\n### Include instruments and precision variables\n\nTechnically, you do not need to include instrumental and precision variables in your DAG.\nThe adjustment sets will be the same with and without them.\nHowever, adding them is helpful for two reasons.\nFirstly, they demonstrate your assumptions about their relationships and the variables under study.\nAs discussed above, *not* including an arrow is a more significant assumption than including one, so it's valuable information about how you think the causal structure operates.\nSecondly, it impacts your modeling decision.\nYou should always include precision variables in your model to reduce variability in your estimate so it helps you identify those.\nInstruments are also helpful to see because they may guide alternative or complementary modeling strategies, as we'll discuss in @sec-evidence.\n\n### Focus on the causal structure, then consider measurement bias\n\nAs we saw above, missingness and measurement error can be a source of bias.\nAs we'll see in [Chapter -@sec-missingness], we have several strategies to approach such a situation.\nYet, almost everything we measure is inaccurate to some degree.\nThe true DAG for the data at hand inherently conditions on the measured version of variables.\nIn that sense, your data are always subtly-wrong, a sort of unreliable narrator.\nWhen should we include this information in the DAG?\nWe recommend first focusing on the causal structure of the DAG as if you had perfectly measured each variable [@hernan2021].\nThen, consider how mismeasurement and missingness might affect the realized data, particularly related to the exposure, outcome, and critical confounders.\nYou may prefer to present this as an alternative DAG to consider strategies for addressing the bias arising from those sources, e.g., imputation or sensitivity analyses.\nAfter all, the DAG in @fig-error_dag makes you think the question is unanswerable because we have no method to close all backdoor paths.\nAs with all open paths, that depends on the severity of the bias and our ability to reckon with it.\n\n\n\n\n\n### Pick adjustment sets most likely to be successful\n\nOne area where measurement error is an important consideration is when picking an adjustment set.\nIn theory, if a DAG is correct, any adjustment set will work to create an unbiased result.\nIn practice, variables have different levels of quality.\nPick an adjustment set most likely to succeed because it contains accurate variables.\nSimilarly, non-minimal adjustment sets are helpful to consider because, together, several variables with measurement error along a backdoor path may be enough to minimize the practical bias resulting from that path.\n\nWhat if you don't have certain critical variables measured and thus do not have a valid adjustment set?\nIn that case, you should pick the adjustment set with the best chance of minimizing the bias from other backdoor paths.\nAll is not lost if you don't have every confounder measured: get the highest quality estimate you can, then conduct a sensitivity analysis about the unmeasured variables to understand the impact.\n\n### Use robustness checks\n\nFinally, we recommend checking your DAG for robustness.\nYou can never verify the correctness of your DAG under most conditions, but you can use the implications in your DAG to support it.\nThree types of robustness checks can be helpful depending on the circumstances.\n\n1. **Negative controls** [@Lipsitch2010]. These come in two flavors: negative exposure controls and negative outcome controls. The idea is to find something associated with one but not the other, e.g., the outcome but not the exposure, so there should be no effect. Since there should be no effect, you now have a measurement for how well you control for *other* effects (e.g., the difference from null). Ideally, the confounders for negative controls are similar to the research question.\n2. **DAG-data consistency** [@Textor2016]. Negative controls are an implication of your DAG. An extension of this idea is that there are *many* such implications. Because blocking a path removes statistical dependencies from that path, you can check those assumptions in several places in your DAG.\n3. **Alternate adjustment sets**. Adjustment sets should give roughly the same answer because, outside of random and measurement errors, they are all sets that block backdoor paths. If more than one adjustment set seems reasonable, you can use that as a sensitivity analysis by checking multiple models.\n\nWe'll discuss these in detail in [Chapter -@sec-sensitivity].\nThe caveat here is that these should be complementary to your initial DAG, not a way of *replacing* it.\nIn fact, if you use more than one adjustment set during your analysis, you should report the results from all of them to avoid overfitting your results to your data.\n", + "markdown": "# Expressing causal questions as DAGs {#sec-dags}\n\n\n\n\n\n## Visualizing Causal Assumptions\n\n> Draw your assumptions before your conclusions --@hernan2021\n\nCausal diagrams are a tool to visualize your assumptions about the causal structure of the questions you're trying to answer.\nIn a randomized experiment, the causal structure is quite simple.\nWhile there may be many causes of an outcome, the only cause of the exposure is the randomization process itself (we hope!).\nIn many non-randomized settings, however, the structure of your question can be a complex web of causality.\nCausal diagrams help communicate what we think this structure looks like.\nIn addition to being open about what we think the causal structure is, causal diagrams have incredible mathematical properties that allow us to identify a way to estimate unbiased causal effects even with observational data.\n\nCausal diagrams are also increasingly common.\nData collected as a review of causal diagrams in applied health research papers show a drastic increase in use over time [@Tennant2021].\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Percentage of health research papers using causal diagrams over time.](05-dags_files/figure-html/fig-dag-usage-1.png){#fig-dag-usage width=672}\n:::\n:::\n\n\nThe type of causal diagrams we use are also called directed acyclic graphs (DAGs)[^1].\nThese graphs are directed because they include arrows going in a specific direction.\nThey're acyclic because they don't go in circles; a variable can't cause itself, for instance.\nDAGs are used for various problems, but we're specifically concerned with *causal* DAGs.\nThis class of DAGs is sometimes called Structural Causal Models (SCMs) because they are a model of the causal structure of a question [@hernan2021; @Pearl_Glymour_Jewell_2021].\n\n[^1]: An essential but rarely observed detail of DAGs is that dag is also an [affectionate Australian insult](https://en.wikipedia.org/wiki/Dag_(slang)) referring to the dung-caked fur of a sheep, a *daglock*.\n\nDAGs depict causal relationships between variables.\nVisually, the way they depict variables is as *edges* and *nodes*.\nEdges are the arrows going from one variable to another, sometimes called arcs or just arrows.\nNodes are the variables themselves, sometimes called vertices, points, or just variables.\nIn @fig-dag-basic, there are two nodes, `x` and `y`, and one edge going from `x` to `y`.\nHere, we are saying that `x` causes `y`.\n`y` \"listens\" to `x` [@Pearl_Glymour_Jewell_2021].\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A causal directed acyclic graph (DAG). DAGs depict causal relationships. In this DAG, the assumption is that `x` causes `y`.](05-dags_files/figure-html/fig-dag-basic-1.png){#fig-dag-basic width=288}\n:::\n:::\n\n\nIf we're interested in the causal effect of `x` on `y`, we're trying to estimate a numeric representation of that arrow.\nUsually, though, there are many other variables and arrows in the causal structure of a given question.\nA series of arrows is called a *path*.\nThere are three types of paths you'll see in DAGs: forks, chains, and colliders (sometimes called inverse forks).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three types of causal relationships: forks, chains, and colliders. The direction of the arrows and the relationships of interest dictate which type of path a series of variables is. Forks represent a mutual cause, chains represent direct causes, and colliders represent a mutual descendant.](05-dags_files/figure-html/fig-dag-path-types-1.png){#fig-dag-path-types width=672}\n:::\n:::\n\n\nForks represent a common cause of two variables.\nHere, we're saying that `q` causes both `x` and `y`, the traditional definition of a confounder.\nThey're called forks because the arrows from `x` to `y` are in different directions.\nChains, on the other hand, represent a series of arrows going in the same direction.\nHere, `q` is called a *mediator*: it is along the causal path from `x` to `y`.\nIn this diagram, the only path from `x` to `y` is mediated through `q`.\nFinally, a collider is a path where two arrowheads meet at a variable.\nBecause causality always goes forward in time, this naturally means that the collider variable is caused by two other variables.\nHere, we're saying that `x` and `y` both cause `q`.\n\n::: callout-tip\n## Are DAGs SEMs?\n\nIf you're familiar with structural equation models (SEMs), a modeling technique commonly used in psychology and other social science settings, you may notice some similarities between SEMs and DAGs.\nDAGs are a form of *non-parametric* SEM.\nSEMs estimate entire graphs using parametric assumptions.\nCausal DAGs, on the other hand, don't estimate anything; an arrow going from one variable to another says nothing about the strength or functional form of that relationship, only that we think it exists.\n:::\n\nOne of the significant benefits of DAGs is that they help us identify sources of bias and, often, provide clues on how to address them.\nHowever, talking about an unbiased effect estimate only makes sense when we have a specific causal question in mind.\nSince each arrow represents a cause, it's causality all the way down; no individual arrow is inherently problematic.\nHere, we're interested in the effect of `x` on `y`.\nThis question defines which paths we're interested in and which we're not.\n\nThese three types of paths have different implications for the statistical relationship between `x` and `y`.\nIf we only look at the correlation between the two variables under these assumptions:\n\n1. In the fork, `x` and `y` will be associated, despite there being no arrow from `x` to `y`.\n2. In the chain, `x` and `y` are related only through `q`.\n3. In the collider, `x` and `y` will *not* be related.\n\nPaths that transmit association are called *open paths*.\nPaths that do not transmit association are called *closed paths*.\nForks and chains are open, while colliders are closed.\n\nSo, should we adjust for `q`?\nThat depends on the nature of the path.\nForks are confounding paths.\nBecause `q` causes both `x` and `y`, `x` and `y` will have a spurious association.\nThey both contain information from `q`, their mutual cause.\nThat mutual causal relationship makes `x` and `y` associated statistically.\nAdjusting for `q` will *block* the bias from confounding and give us the true relationship between `x` and `y`.\n\n::: callout-tip\n## Adjustment\n\nWe can use a variety of techniques to account for a variable.\nWe use the term \"adjustment\" or \"controlling for\" to refer to any technique that removes the effect of variables we're not interested in.\n:::\n\n@fig-confounder-scatter depicts this effect visually.\nHere, `x` and `y` are continuous, and by definition of the DAG, they are unrelated.\n`q`, however, causes both.\nThe unadjusted effect is biased because it includes information about the open path from `x` to `y` via `q`.\nWithin levels of `q`, however, `x` and `y` are unrelated.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. With forks, the relationship is biased by `q`. When accounting for `q`, we see the true null relationship.](05-dags_files/figure-html/fig-confounder-scatter-1.png){#fig-confounder-scatter width=672}\n:::\n:::\n\n\nFor chains, whether or not we adjust for mediators depends on the research question.\nHere, adjusting for `q` would result in a null estimate of the effect of `x` on `y`.\nBecause the only effect of `x` on `y` is via `q`, no other effect remains.\nThe effect of `x` on `y` mediated by `q` is called the *indirect* effect, while the effect of `x` on `y` directly is called the *direct* effect.\nIf we're only interested in the direct effect, controlling for `q` might be what we want.\nIf we want to know about both effects, we shouldn't try to adjust for `q`.\nWe'll learn more about estimating these and other mediation effects in @sec-mediation.\n\n@fig-mediator-scatter shows this effect visually.\nThe unadjusted effect of `x` on `y` represents the total effect.\nSince the total effect is due entirely to the path mediated by `q`, when we adjust for `q`, no relationship remains.\nThis null effect is the direct effect.\nNeither of these effects is due to bias, but each answers a different research question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. With chains, whether and how we should account for `q` depends on the research question. Without doing so, we see the impact of the total effect of `x` and `y`, including the indirect effect via `q`. When accounting for `q`, we see the direct (null) effect of `x` on `y`.](05-dags_files/figure-html/fig-mediator-scatter-1.png){#fig-mediator-scatter width=672}\n:::\n:::\n\n\nColliders are different.\nIn the collider DAG of @fig-dag-path-types, `x` and `y` are *not* associated, but both cause `q`.\nAdjusting for `q` has the opposite effect than with confounding: it *opens* a biasing pathway.\nSometimes, people draw the path opened up by conditioning on a collider connecting `x` and `y`.\n\nVisually, we can see this happen when `x` and `y` are continuous and `q` is binary.\nIn @fig-collider-scatter, when we don't include `q`, we find no relationship between `x` and `y`.\nThat's the correct result.\nHowever, when we include `q`, we can detect information about both `x` and `y`, and they appear correlated: across levels of `x`, those with `q = 0` have lower levels of `y`.\nAssociation seemingly flows back in time.\nOf course, that can't happen from a causal perspective, so controlling for `q` is the wrong thing to do.\nWe end up with a biased effect of `x` on `y`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. The unadjusted relationship between the two is unbiased. When accounting for `q`, we open a colliding backdoor path and bias the relationship between `x` and `y`.](05-dags_files/figure-html/fig-collider-scatter-1.png){#fig-collider-scatter width=672}\n:::\n:::\n\n\nHow can this be?\nSince `x` and `y` happen before `q`, `q` can't impact them.\nLet's turn the DAG on its side and consider @fig-collider-time.\nIf we break down the two time points, at time point 1, `q` hasn't happened yet, and `x` and `y` are unrelated.\nAt time point 2, `q` happens due to `x` and `y`.\n*But causality only goes forward in time*.\n`q` happening later can't change the fact that `x` and `y` happened independently in the past.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A collider relationship over two points in time. At time point one, there is no relationship between `x` and `y`. Both cause `q` by time point two, but this does not change what already happened at time point one.](05-dags_files/figure-html/fig-collider-time-1.png){#fig-collider-time width=672}\n:::\n:::\n\n\nCausality only goes forward.\nAssociation, however, is time-agnostic.\nIt's just an observation about the numerical relationships between variables.\nWhen we control for the future, we risk introducing bias.\nIt takes time to develop an intuition for this.\nConsider a case where `x` and `y` are the only causes of `q`, and all three variables are binary.\nWhen *either* `x` or `y` equals 1, then `q` happens.\nIf we know `q = 1` and `x = 0` then logically it must be that `y = 1`.\nThus, knowing about `q` gives us information about `y` via `x`.\nThis example is extreme, but it shows how this type of bias, sometimes called *collider-stratification bias* or *selection bias*, occurs: conditioning on `q` provides statistical information about `x` and `y` and distorts their relationship [@Banack2023].\n\n::: callout-tip\n## Exchangeability revisited\n\nWe commonly refer to exchangability as the assumption of no confounding.\nActually, this isn't quite right.\nIt's the assumption of no *open, non-causal* paths [@hernan2021].\nMany times, these are confounding pathways.\nHowever, conditioning on a collider can also open paths.\nEven though these aren't confounders, doing so creates non-exchangeability between the two groups: they are different in a way that matters to the exposure and outcome.\n\nOpen, non-causal paths are also called *backdoor paths*.\nWe'll use this terminology often because it captures the idea well: these are any open paths biasing the effect we're interested in estimating.\n:::\n\nCorrectly identifying the causal structure between the exposure and outcome thus helps us 1) communicate the assumptions we're making about the relationships between variables and 2) identify sources of bias.\nImportantly, in doing 2), we are also often able to identify ways to prevent bias based on the assumptions in 1).\nIn the simple case of the three DAGs in @fig-dag-path-types, we know whether or not to control for `q` depending on the nature of the causal structure.\nThe set or sets of variables we need to adjust for is called the *adjustment set*.\nDAGs can help us identify adjustment sets even in complex settings [@vanderzander2019].\n\n::: callout-tip\n## What about interaction?\n\nDAGs don't make a statement about interaction or effect estimate modification, even though they are an important part of inference.\nTechnically, interaction is a matter of the functional form of the relationships in the DAG.\nMuch as we don't need to specify how we will model a variable in the DAG (e.g., with splines), we don't need to determine how variables statistically interact.\nThat's a matter for the modeling stage.\n\nThere are several ways we use interactions in causal inference.\nIn one extreme, they are simply a matter of functional form: interaction terms are included in models but marginalized to get an overall causal effect.\nConversely, we're interested in *joint causal effects*, where the two variables interacting are both causal.\nIn between, we can use interaction terms to identify *heterogeneous causal effects*, which vary by a second variable that is not assumed to be causal.\nAs with many tools in causal inference, we use the same statistical technique in many ways to answer different questions.\nWe'll revisit this topic in detail in [Chapter -@sec-interaction].\n\nMany people have tried expressing interaction in DAGs using different types of arcs, nodes, and other annotations, but no approach has taken off as the preferred way [@weinberg2007; @Nilsson2021].\n:::\n\nLet's take a look at an example in R.\nWe'll learn to build DAGs, visualize them, and identify important information like adjustment sets.\n\n## DAGs in R\n\nFirst, consider a research question: Does listening to a comedy podcast the morning before an exam improve graduate students' test scores?\nWe can diagram this using the method described in @sec-diag (@fig-diagram-podcast).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A sentence diagram for the question: Does listening to a comedy podcast the morning before an exam improve graduate student test scores? The population is graduate students. The start time is morning, and the outcome time is after the exam.](../images/podcast-diagram.png){#fig-diagram-podcast width=2267}\n:::\n:::\n\n\nThe tool we'll use for making DAGs is ggdag.\nggdag is a package that connects ggplot2, the most powerful visualization tool in R, to dagitty, an R package with sophisticated algorithms for querying DAGs.\n\nTo create a DAG object, we'll use the `dagify()` function.`dagify()` returns a `dagitty` object that works with both the dagitty and ggdag packages.\nThe `dagify()` function takes formulas, separated by commas, that specify causes and effects, with the left element of the formula defining the effect and the right all of the factors that cause it.\nThis is just like the type of formula we specify for most regression models in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n effect1 ~ cause1 + cause2 + cause3,\n effect2 ~ cause1 + cause4,\n ...\n)\n```\n:::\n\n\nWhat are all of the factors that cause graduate students to listen to a podcast the morning before an exam?\nWhat are all of the factors that could cause a graduate student to do well on a test?\nLet's posit some here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggdag)\ndagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ndag {\nexam\nhumor\nmood\npodcast\nprepared\nhumor -> podcast\nmood -> exam\nmood -> podcast\nprepared -> exam\nprepared -> podcast\n}\n```\n\n\n:::\n:::\n\n\nIn the code above, we assume that:\n\n- a graduate student's mood, sense of humor, and how prepared they feel for the exam could influence whether they listened to a podcast the morning of the test\n- their mood and how prepared they are also influence their exam score\n\nNotice we *do not* see podcast in the exam equation; this means that we assume that there is **no** causal relationship between podcast and the exam score.\n\nThere are some other useful arguments you'll often find yourself supplying to `dagify()`:\n\n- `exposure` and `outcome`: Telling ggdag the variables that are the exposure and outcome of your research question is required for many of the most valuable queries we can make of DAGs.\n- `latent`: This argument lets us tell ggdag that some variables in the DAG are unmeasured. `latent` helps identify valid adjustment sets with the data we actually have.\n- `coords`: Coordinates for the variables. You can choose between algorithmic or manual layouts, as discussed below. We'll use `time_ordered_coords()` here.\n- `labels`: A character vector of labels for the variables.\n\nLet's create a DAG object, `podcast_dag`, with some of these attributes, then visualize the DAG with `ggdag()`.\n`ggdag()` returns a ggplot object, so we can add additional layers to the plot, like themes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"),\n # time point 2\n \"podcast\",\n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\nggdag(podcast_dag, use_labels = \"label\", text = FALSE) +\n theme_dag()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: The `text` argument of `geom_dag()` no longer accepts\nlogicals as of ggdag 0.3.0.\nℹ Set `use_text = FALSE`. To use a variable other than\n node names, set `text = variable_name`\nℹ The deprecated feature was likely used in the ggdag\n package.\n Please report the issue at\n .\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: The `use_labels` argument of `geom_dag()` must be a\nlogical as of ggdag 0.3.0.\nℹ Set `use_labels = TRUE` and `label = label`\nℹ The deprecated feature was likely used in the ggdag\n package.\n Please report the issue at\n .\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Proposed DAG to answer the question: Does listening to a comedy podcast the morning before an exam improve graduate students' test scores?](05-dags_files/figure-html/fig-dag-podcast-1.png){#fig-dag-podcast width=384}\n:::\n:::\n\n\n::: callout-note\nFor the rest of the chapter, we'll use `theme_dag()`, a ggplot theme from ggdag meant for DAGs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_set(\n theme_dag() %+replace%\n # also add some additional styling\n theme(\n legend.position = \"bottom\",\n strip.text.x = element_text(margin = margin(2, 0, 2, 0, \"mm\"))\n )\n)\n```\n:::\n\n:::\n\n::: callout-tip\n## DAG coordinates\n\nYou don't need to specify coordinates to ggdag.\nIf you don't, it uses algorithms designed for automatic layouts.\nThere are many such algorithms, and they focus on different aspects of the layout, e.g., the shape, the space between the nodes, minimizing how many edges cross, etc.\nThese layout algorithms usually have a component of randomness, so it's good to use a seed if you want to get the same result.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# no coordinates specified\nset.seed(123)\npod_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n\n# automatically determine layouts\npod_dag |>\n ggdag(text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-14-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nWe can also ask for a specific layout, e.g., the popular Sugiyama algorithm for DAGs [@sugiyama1981].\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |>\n ggdag(layout = \"sugiyama\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-15-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nFor causal DAGs, the time-ordered layout algorithm is often best, which we can specify with `time_ordered_coords()` or `layout = \"time_ordered\"`.\nWe'll discuss time ordering in greater detail below.\nEarlier, we explicitly told ggdag which variables were at which time points, but we don't need to.\nNotice, though, that the time ordering algorithm puts `podcast` and `exam` at the same time point since one doesn't cause another (and thus predate it).\nWe know that's not the case: listening to the podcast happened before taking the exam.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |>\n ggdag(layout = \"time_ordered\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-16-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nYou can manually specify coordinates using a list or data frame and provide them to the `coords` argument of `dagify()`.\nAdditionally, because ggdag is based on dagitty, you can use `dagitty.net` to create and organize a DAG using a graphical interface, then export the result as dagitty code for ggdag to consume.\n\nAlgorithmic layouts are lovely for fast visualization of DAGs or particularly complex graphs.\nOnce you want to share your DAG, it's usually best to be more intentional about the layout, perhaps by specifying the coordinates manually.\n`time_ordered_coords()` is often the best of both worlds, and we'll use it for most DAGs in this book.\n:::\n\nWe've specified the DAG for this question and told ggdag what the exposure and outcome of interest are.\nAccording to the DAG, there is no direct causal relationship between listening to a podcast and exam scores.\nAre there any other open paths?\n`ggdag_paths()` takes a DAG and visualizes the open paths.\nIn @fig-paths-podcast, we see two open paths: `podcast <- mood -> exam\"` and `podcast <- prepared -> exam`. These are both forks---*confounding pathways*. Since there is no causal relationship between listening to a podcast and exam scores, the only open paths are *backdoor* paths, these two confounding pathways.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag |>\n # show the whole dag as a light gray \"shadow\"\n # rather than just the paths\n ggdag_paths(shadow = TRUE, text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![`ggdag_paths()` visualizes open paths in a DAG. There are two open paths in `podcast_dag`: the fork from `mood` and the fork from `prepared`.](05-dags_files/figure-html/fig-paths-podcast-1.png){#fig-paths-podcast width=672}\n:::\n:::\n\n\n::: callout-tip\n`dagify()` returns a `dagitty()` object, but underneath the hood, ggdag converts `dagitty` objects to tidy DAGs, a structure that holds both the `dagitty` object and a `dataframe` about the DAG.\nThis is handy if you want to manipulate the DAG programmatically.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag_tidy <- podcast_dag |>\n tidy_dagitty()\n\npodcast_dag_tidy\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A DAG with 5 nodes and 5 edges\n#\n# Exposure: podcast\n# Outcome: exam\n#\n# A tibble: 7 × 9\n name x y direction to xend yend\n \n1 exam 3 0 NA NA\n2 humor 1 0 -> podcast 2 0\n3 mood 1 1 -> exam 3 0\n4 mood 1 1 -> podcast 2 0\n5 podcast 2 0 NA NA\n6 prepared 1 -1 -> exam 3 0\n7 prepared 1 -1 -> podcast 2 0\n# ℹ 2 more variables: circular , label \n```\n\n\n:::\n:::\n\n\nMost of the quick plotting functions transform the `dagitty` object to a tidy DAG if it's not already, then manipulate the data in some capacity.\nFor instance, `dag_paths()` underlies `ggdag_paths()`; it returns a tidy DAG with data about the paths.\nYou can use several dplyr functions on these objects directly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag_tidy |>\n dag_paths() |>\n filter(set == 2, path == \"open path\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A DAG with 3 nodes and 2 edges\n#\n# Exposure: podcast\n# Outcome: exam\n#\n# A tibble: 4 × 11\n set name x y direction to xend yend\n \n1 2 exam 3 0 NA NA\n2 2 podcast 2 0 NA NA\n3 2 prepar… 1 -1 -> exam 3 0\n4 2 prepar… 1 -1 -> podc… 2 0\n# ℹ 3 more variables: circular , label ,\n# path \n```\n\n\n:::\n:::\n\n\nTidy DAGs are not pure data frames, but you can retrieve either the `dataframe` or `dagitty` object to work with them directly using `pull_dag_data()` or `pull_dag()`.\n`pull_dag()` can be useful when you want to work with dagitty functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dagitty)\npodcast_dag_tidy |>\n pull_dag() |>\n paths()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$paths\n[1] \"podcast <- mood -> exam\" \n[2] \"podcast <- prepared -> exam\"\n\n$open\n[1] TRUE TRUE\n```\n\n\n:::\n:::\n\n:::\n\nBackdoor paths pollute the statistical association between `podcast` and `exam`, so we must account for them.\n`ggdag_adjustment_set()` visualizes any valid adjustment sets implied by the DAG.\n@fig-podcast-adustment-set shows adjusted variables as squares.\nAny arrows coming out of adjusted variables are removed from the DAG because the path is longer open at that variable.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag,\n text = FALSE,\n use_labels = \"label\"\n)\n```\n\n::: {.cell-output-display}\n![A visualization of the minimal adjustment set for the podcast-exam DAG. If this DAG is correct, two variables are required to block the backdoor paths: `mood` and `prepared`.](05-dags_files/figure-html/fig-podcast-adustment-set-1.png){#fig-podcast-adustment-set fig-align='center' width=384}\n:::\n:::\n\n\n@fig-podcast-adustment-set shows the *minimal adjustment set*.\nBy default, ggdag returns the set(s) that can close all backdoor paths with the fewest number of variables possible.\nIn this DAG, that's just one set: `mood` and `prepared`.\nThis set makes sense because there are two backdoor paths, and the only other variables on them besides the exposure and outcome are these two variables.\nSo, at minimum, we must account for both to get a valid estimate.\n\n::: callout-tip\n`ggdag()` and friends usually use `tidy_dagitty()` and `dag_*()` or `node_*()` functions to change the underlying data frame.\nSimilarly, the quick plotting functions use ggdag's geoms to visualize the resulting DAG(s).\nIn other words, you can use the same data manipulation and visualization strategies that you use day-to-day directly with ggdag.\n\nHere's a condensed version of what `ggdag_adjustment_set()` is doing:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npodcast_dag_tidy |>\n # add adjustment sets to data\n dag_adjustment_sets() |>\n ggplot(aes(\n x = x,\n y = y,\n xend = xend,\n yend = yend,\n color = adjusted,\n shape = adjusted\n )) +\n # ggdag's custom geoms: add nodes, edges, and labels\n geom_dag_point() +\n # remove adjusted paths\n geom_dag_edges_link(data = \\(.df) filter(.df, adjusted != \"adjusted\")) +\n geom_dag_label_repel() +\n # you can use any ggplot function, too\n facet_wrap(~set) +\n scale_shape_manual(values = c(adjusted = 15, unadjusted = 19))\n```\n\n::: {.cell-output-display}\n![](05-dags_files/figure-html/unnamed-chunk-22-1.png){fig-align='center' width=432}\n:::\n:::\n\n:::\n\nMinimal adjustment sets are only one type of valid adjustment set [@vanderzander2019].\nSometimes, other combinations of variables can get us an unbiased effect estimate.\nTwo other options available in ggdag are full adjustment sets and canonical adjustment sets.\nFull adjustment sets are every combination of variables that result in a valid set.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag,\n text = FALSE,\n use_labels = \"label\",\n # get full adjustment sets\n type = \"all\"\n)\n```\n\n::: {.cell-output-display}\n![All valid adjustment sets for `podcast_dag`.](05-dags_files/figure-html/fig-adustment-set-all-1.png){#fig-adustment-set-all fig-align='center' width=624}\n:::\n:::\n\n\nIt turns out that we can also control for `humor`.\n\nCanonical adjustment sets are a bit more complex: they are all possible ancestors of the exposure and outcome minus any likely descendants.\nIn fully saturated DAGs (DAGs where every node causes anything that comes after it in time), the canonical adjustment set is the minimal adjustment set.\n\n::: callout-tip\nMost of the functions in ggdag use dagitty underneath the hood.\nIt's often helpful to call dagitty functions directly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjustmentSets(podcast_dag, type = \"canonical\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ humor, mood, prepared }\n```\n\n\n:::\n:::\n\n:::\n\nUsing our proposed DAG, let's simulate some data to see how accounting for the minimal adjustment set might occur in practice.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(10)\nsim_data <- podcast_dag |>\n simulate_data()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsim_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 500 × 5\n exam humor mood podcast prepared\n \n 1 -1.17 -0.275 0.00523 0.555 -0.224 \n 2 -1.19 -0.308 0.224 -0.594 -0.980 \n 3 0.613 -1.93 -0.624 -0.0392 -0.801 \n 4 0.0643 -2.88 -0.253 0.802 0.957 \n 5 -0.376 2.35 0.738 0.0828 0.843 \n 6 0.833 -1.24 0.899 1.05 0.217 \n 7 -0.451 1.40 -0.422 0.125 -0.819 \n 8 2.12 -0.114 -0.895 -0.569 0.000869\n 9 0.938 -0.205 -0.299 0.230 0.191 \n10 -0.207 -0.733 1.22 -0.433 -0.873 \n# ℹ 490 more rows\n```\n\n\n:::\n:::\n\n\nSince we have simulated this data, we know that this is a case where *standard methods will succeed* (see @sec-standard) and, therefore, can estimate the causal effect using a basic linear regression model.\n@fig-dag-sim shows a forest plot of the simulated data based on our DAG.\nNotice the model that only included the exposure resulted in a spurious effect (an estimate of -0.1 when we know the truth is 0).\nIn contrast, the model that adjusted for the two variables as suggested by `ggdag_adjustment_set()` is not spurious (much closer to 0).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Model that does not close backdoor paths\nlibrary(broom)\nunadjusted_model <- lm(exam ~ podcast, sim_data) |>\n tidy(conf.int = TRUE) |>\n filter(term == \"podcast\") |>\n mutate(formula = \"podcast\")\n\n## Model that closes backdoor paths\nadjusted_model <- lm(exam ~ podcast + mood + prepared, sim_data) |>\n tidy(conf.int = TRUE) |>\n filter(term == \"podcast\") |>\n mutate(formula = \"podcast + mood + prepared\")\n\nbind_rows(\n unadjusted_model,\n adjusted_model\n) |>\n ggplot(aes(x = estimate, y = formula, xmin = conf.low, xmax = conf.high)) +\n geom_vline(xintercept = 0, linewidth = 1, color = \"grey80\") +\n geom_pointrange(fatten = 3, size = 1) +\n theme_minimal(18) +\n labs(\n y = NULL,\n caption = \"correct effect size: 0\"\n )\n```\n\n::: {.cell-output-display}\n![Forest plot of simulated data based on the DAG described in @fig-dag-podcast.](05-dags_files/figure-html/fig-dag-sim-1.png){#fig-dag-sim width=672}\n:::\n:::\n\n\n## Structures of Causality\n\n### Advanced Confounding\n\nIn `podcast_dag`, `mood` and `prepared` were *direct* confounders: an arrow was going directly from them to `podcast` and `exam`.\nOften, backdoor paths are more complex.\nLet's consider such a case by adding two new variables: `alertness` and `skills_course`.\n`alertness` represents the feeling of alertness from a good mood, thus the arrow from `mood` to `alertness`.\n`skills_course` represents whether the student took a College Skills Course and learned time management techniques.\nNow, `skills_course` is what frees up the time to listen to a podcast as well as being prepared for the exam.\n`mood` and `prepared` are no longer direct confounders: they are two variables along a more complex backdoor path.\nAdditionally, we've added an arrow going from `humor` to `mood`.\nLet's take a look at @fig-podcast_dag2.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag2 <- dagify(\n podcast ~ mood + humor + skills_course,\n alertness ~ mood,\n mood ~ humor,\n prepared ~ skills_course,\n exam ~ alertness + prepared,\n coords = time_ordered_coords(),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n alertness = \"alertness\",\n skills_course = \"college\\nskills course\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\n\nggdag(podcast_dag2, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![An expanded version of `podcast_dag` that includes two additional variables: `skills_course`, representing a College Skills Course, and `alertness`.](05-dags_files/figure-html/fig-podcast_dag2-1.png){#fig-podcast_dag2 width=480}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nNow there are *three* backdoor paths we need to close: `podcast <- humor -> mood -> alertness -> exam`, `podcast <- mood -> alertness -> exam`, and`podcast <- skills_course -> prepared -> exam`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_paths(podcast_dag2, use_labels = \"label\", text = FALSE, shadow = TRUE)\n```\n\n::: {.cell-output-display}\n![Three open paths in `podcast_dag2`. Since there is no effect of `podcast` on `exam`, all three are backdoor paths that must be closed to get the correct effect.](05-dags_files/figure-html/fig-podcast_dag2-paths-1.png){#fig-podcast_dag2-paths width=1056}\n:::\n:::\n\n\nThere are four minimal adjustment sets to close all three paths (and eighteen full adjustment sets!).\nThe minimal adjustment sets are `alertness + prepared`, `alertness + skills_course`, `mood + prepared`, `mood + skills_course`.\nWe can now block the open paths in several ways.\n`mood` and `prepared` still work, but we've got other options now.\nNotably, `prepared` and `alertness` could happen at the same time or even after `podcast`.\n`skills_course` and `mood` still happen before both `podcast` and `exam`, so the idea is still the same: the confounding pathway starts before the exposure and outcome.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(podcast_dag2, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![Valid minimal adjustment sets that will close the backdoor paths in @fig-podcast_dag2-paths.](05-dags_files/figure-html/fig-podcast_dag2-set-1.png){#fig-podcast_dag2-set width=672}\n:::\n:::\n\n\nDeciding between these adjustment sets is a matter of judgment: if all data are perfectly measured, the DAG is correct, and we've modeled them correctly, then it doesn't matter which we use.\nEach adjustment set will result in an unbiased estimate.\nAll three of those assumptions are usually untrue to some degree.\nLet's consider the path via `skills_course` and `prepared`.\nIt may be that we are better able to assess whether or not someone took the College Skills Course than how prepared for the exam they are.\nIn that case, an adjustment set with `skills_course` is a better option.\nBut perhaps we better understand the relationship between preparedness and exam results.\nIf we have it measured, controlling for that might be better.\nWe could get the best of both worlds by including both variables: between the better measurement of `skills_course` and the better modeling of `prepared`, we might have a better chance of minimizing confounding from this path.\n\n### Selection Bias and Mediation\n\nSelection bias is another name for the type of bias that is induced by adjusting for a collider [@lu2022].\nIt's called \"selection bias\" because a common form of collider-induced bias is a variable inherently stratified upon by the design of the study---selection *into* the study.\nLet's consider a case based on the original `podcast_dag` but with one additional variable: whether or not the student showed up to the exam.\nNow, there is an indirect effect of `podcast` on `exam`: listening to a podcast influences whether or not the students attend the exam.\nThe true result of `exam` is missing for those who didn't show up; by studying the group of people who *did* show up, we are inherently stratifying on this variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared + showed_up,\n showed_up ~ podcast + mood + prepared,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"),\n # time point 2\n \"podcast\",\n \"showed_up\",\n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\",\n showed_up = \"showed up\"\n )\n)\nggdag(podcast_dag3, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![Another variant of `podcast_dag`, this time including the inherent stratification on those who appear for the exam. There is still no direct effect of `podcast` on `exam`, but there is an indirect effect via `showed_up`.](05-dags_files/figure-html/fig-podcast_dag3-1.png){#fig-podcast_dag3 width=432}\n:::\n:::\n\n\nThe problem is that `showed_up` is both a collider and a mediator: stratifying on it induces a relationship between many of the variables in the DAG but blocks the indirect effect of `podcast` on `exam`.\nLuckily, the adjustment sets can handle the first problem; because `showed_up` happens *before* `exam`, we're less at risk of collider bias between the exposure and outcome.\nUnfortunately, we cannot calculate the total effect of `podcast` on `exam` because part of the effect is missing: the indirect effect is closed at `showed_up`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 |>\n adjust_for(\"showed_up\") |>\n ggdag_adjustment_set(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![The adjustment set for `podcast_dag3` given that the data are inherently conditioned on showing up to the exam. In this case, there is no way to recover an unbiased estimate of the total effect of `podcast` on `exam`.](05-dags_files/figure-html/fig-podcast_dag3-as-1.png){#fig-podcast_dag3-as width=432}\n:::\n:::\n\n\nSometimes, you can still estimate effects in this situation by changing the estimate you wish to calculate.\nWe can't calculate the total effect because we are missing the indirect effect, but we can still calculate the direct effect of `podcast` on `exam`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 |>\n adjust_for(\"showed_up\") |>\n ggdag_adjustment_set(effect = \"direct\", text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![The adjustment set for `podcast_dag3` when targeting a different effect. There is one minimal adjustment set that we can use to estimate the direct effect of `podcast` on `exam`.](05-dags_files/figure-html/fig-podcast_dag3-direct-1.png){#fig-podcast_dag3-direct width=432}\n:::\n:::\n\n\n#### M-Bias and Butterfly Bias {#sec-m-bias}\n\nA particular case of selection bias that you'll often see people talk about is *M-bias*.\nIt's called M-bias because it looks like an M when arranged top to bottom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nm_bias() |>\n ggdag()\n```\n\n::: {.cell-output-display}\n![A DAG representing M-Bias, a situation where a collider predates the exposure and outcome.](05-dags_files/figure-html/fig-m-bias-1.png){#fig-m-bias width=384}\n:::\n:::\n\n\n::: callout-tip\nggdag has several quick-DAGs for demonstrating basic causal structures, including `confounder_triangle()`, `collider_triangle()`, `m_bias()`, and `butterfly_bias()`.\n:::\n\nWhat's theoretically interesting about M-bias is that `m` is a collider but occurs before `x` and `y`.\nRemember that association is blocked at a collider, so there is no open path between `x` and `y`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaths(m_bias())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$paths\n[1] \"x <- a -> m <- b -> y\"\n\n$open\n[1] FALSE\n```\n\n\n:::\n:::\n\n\nLet's focus on the `mood` path of the podcast-exam DAG.\nWhat if we were wrong about mood, and the actual relationship was M-shaped?\nLet's say that, rather than causing `podcast` and `exam`, `mood` was itself caused by two mutual causes of `podcast` and `exam`, `u1` and `u2`, as in @fig-podcast_dag4.\nWe don't know what `u1` and `u2` are, and we don't have them measured.\nAs above, there are no open paths in this subset of the DAG.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag4 <- dagify(\n podcast ~ u1,\n exam ~ u2,\n mood ~ u1 + u2,\n coords = time_ordered_coords(list(\n c(\"u1\", \"u2\"),\n \"mood\",\n \"podcast\",\n \"exam\"\n )),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n u1 = \"unmeasured\",\n u2 = \"unmeasured\"\n ),\n # we don't have them measured\n latent = c(\"u1\", \"u2\")\n)\n\nggdag(podcast_dag4, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![A reconfiguration of @fig-dag-podcast where `mood` is a collider on an M-shaped path.](05-dags_files/figure-html/fig-podcast_dag4-1.png){#fig-podcast_dag4 width=528}\n:::\n:::\n\n\nThe problem arises when we think our original DAG is the right DAG: `mood` is in the adjustment set, so we control for it.\nBut this induces bias!\nIt opens up a path between `u1` and `u2`, thus creating a path from `podcast` to `exam`.\nIf we had either `u1` or `u2` measured, we could adjust for them to close this path, but we don't.\nThere is no way to close this open path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag4 |>\n adjust_for(\"mood\") |>\n ggdag_adjustment_set(use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![The adjustment set where `mood` is a collider. If we control for `mood` and don't know about or have the unmeasured causes of `mood`, we have no means of closing the backdoor path opened by adjusting for a collider.](05-dags_files/figure-html/fig-podcast_dag4-as-1.png){#fig-podcast_dag4-as width=528}\n:::\n:::\n\n\nOf course, the best thing to do here is not control for `mood` at all.\nSometimes, though, that is not an option.\nImagine if, instead of `mood`, this turned out to be the real structure for `showed_up`: since we inherently control for `showed_up`, and we don't have the unmeasured variables, our study results will always be biased.\nIt's essential to understand if we're in that situation so we can address it with sensitivity analysis to understand just how biased the effect would be.\n\nLet's consider a variation on M-bias where `mood` causes `podcast` and `exam` and `u1` and `u2` are mutual causes of `mood` and the exposure and outcome.\nThis arrangement is sometimes called butterfly or bowtie bias, again because of its shape.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbutterfly_bias(x = \"podcast\", y = \"exam\", m = \"mood\", a = \"u1\", b = \"u2\") |>\n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![In butterfly bias, `mood` is both a collider and a confounder. Controlling for the bias induced by `mood` opens a new pathway because we've also conditioned on a collider. We can't properly close all backdoor paths without either `u1` or `u2`.](05-dags_files/figure-html/fig-butterfly_bias-1.png){#fig-butterfly_bias width=480}\n:::\n:::\n\n\nNow, we're in a challenging position: we need to control for `mood` because it's a confounder, but controlling for `mood` opens up the pathway from `u1` to `u2`.\nBecause we don't have either variable measured, we can't then close the path opened from conditioning on `mood`.\nWhat should we do?\nIt turns out that, when in doubt, controlling for `mood` is the better of the two options: confounding bias tends to be worse than collider bias, and M-shaped structures of colliders are sensitive to slight deviations (e.g., if this is not the exact structure, often the bias isn't as bad) [@DingMiratrix2015].\n\nAnother common form of selection bias is from *loss to follow-up*: people drop out of a study in a way that is related to the exposure and outcome.\nWe'll come back to this topic in [Chapter -@sec-longitudinal].\n\n### Causes of the exposure, causes of the outcome\n\nLet's consider one other type of causal structure that's important: causes of the exposure and not the outcome, and their opposites, causes of the outcome and not the exposure.\nLet's add a variable, `grader_mood`, to the original DAG.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag5 <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared + grader_mood,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"),\n # time point 2\n c(\"podcast\", \"grader_mood\"),\n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"student\\nmood\",\n humor = \"humor\",\n prepared = \"prepared\",\n grader_mood = \"grader\\nmood\"\n )\n)\nggdag(podcast_dag5, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![A DAG containing a cause of the exposure that is not the cause of the outcome (`humor`) and a cause of the outcome that is not a cause of the exposure (`grader_mood`).](05-dags_files/figure-html/fig-podcast_dag5-1.png){#fig-podcast_dag5 width=480}\n:::\n:::\n\n\nThere are now two variables that aren't related to *both* the exposure and the outcome: `humor`, which causes `podcast` but not `exam`, and `grader_mood`, which causes `exam` but not `podcast`.\nLet's start with `humor`.\n\nVariables that cause the exposure but not the outcome are also called *instrumental variables* (IVs).\nIVs are an unusual circumstance where, under certain conditions, controlling for them can make other types of bias worse.\nWhat's unique about this is that IVs can *also* be used to conduct an entirely different approach to estimating an unbiased effect of the exposure on the outcome.\nIVs are commonly used this way in econometrics and are increasingly popular in other areas.\nIn short, IV analysis allows us to estimate the causal effect using a different set of assumptions than the approaches we've talked about thus far.\nSometimes, a problem intractable using propensity score methods can be addressed using IVs and vice versa.\nWe'll talk more about IVs in @sec-iv-friends.\n\nSo, if you're *not* using IV methods, should you include an IV in a model meant to address confounding?\nIf you're unsure if the variable is an IV or not, you should probably add it to your model: it's more likely to be a confounder than an IV, and, it turns out, the bias from adding an IV is usually small in practice.\nSo, like adjusting for a potential M-structure variable, the risk of bias is worse from confounding [@Myers2011].\n\nNow, let's talk about the opposite of an IV: a cause of the outcome that is not the cause of the exposure.\nThese variables are sometimes called *competing exposures* (because they also cause the outcome) or *precision variables* (because, as we'll see, they increase the precision of causal estimates).\nWe'll call them precision variables because we're concerned about the relationship to the research question at hand, not to another research question where they are exposures [@Brookhart2006].\n\nLike IVs, precision variables do not occur along paths from the exposure to the outcome.\nThus, including them is not necessary.\nUnlike IVs, including precision variables is beneficial.\nIncluding other causes of the outcomes helps a statistical model capture some of its variation.\nThis doesn't impact the point estimate of the effect, but it does reduce the variance, resulting in smaller standard errors and narrower confidence intervals.\nThus, we recommend including them when possible.\n\nSo, even though we don't need to control for `grader_mood`, if we have it in the data set, we should.\nSimilarly, `humor` is not a good addition to the model unless we think it really might be a confounder; if it is a valid instrument, we might want to consider using IV methods to estimate the effect instead.\n\n### Measurement Error and Missingness\n\nDAGs can also help us understand the bias arising from mismeasurements in the data, including the worst mismeasurement: not measuring it at all.\nWe'll cover these topics in [Chapter -@sec-missingness], but the basic idea is that by separating the actual value from the observed value, we can better understand how such biases may behave [@Hernán2009].\nHere's a basic example of a bias called *recall bias*.\nRecall bias is when the outcome influences a participant's memory of exposure, so it's a particular problem in retrospective studies where the earlier exposure is not recorded until after the outcome happens.\nAn example of when this can occur is a case-control study of cancer.\nSomeone *with* cancer may be more motivated to ruminate on their past exposures than someone *without* cancer.\nSo, their memory about a given exposure may be more refined than someone without.\nBy conditioning on the observed version of the exposure, we open up many collider paths.\nUnfortunately, there is no way to close them all.\nIf this is the case, we must investigate how severe the bias would be in practice.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_dag <- dagify(\n exposure_observed ~ exposure_real + exposure_error,\n outcome_observed ~ outcome_real + outcome_error,\n outcome_real ~ exposure_real,\n exposure_error ~ outcome_real,\n labels = c(\n exposure_real = \"Exposure\\n(truth)\",\n exposure_error = \"Measurement Error\\n(exposure)\",\n exposure_observed = \"Exposure\\n(observed)\",\n outcome_real = \"Outcome\\n(truth)\",\n outcome_error = \"Measurement Error\\n(outcome)\",\n outcome_observed = \"Outcome\\n(observed)\"\n ),\n exposure = \"exposure_real\",\n outcome = \"outcome_real\",\n coords = time_ordered_coords()\n)\n\nerror_dag |>\n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG representing measurement error in observing the exposure and outcome. In this case, the outcome impacts the participant's memory of the exposure, also known as recall bias.](05-dags_files/figure-html/fig-error_dag-1.png){#fig-error_dag width=528}\n:::\n:::\n\n\n## Recommendations in building DAGs\n\nIn principle, using DAGs is easy: specify the causal relationships you think exist and then query the DAG for information like valid adjustment sets.\nIn practice, assembling DAGs takes considerable time and thought.\nNext to defining the research question itself, it's one of the most challenging steps in making causal inferences.\nVery little guidance exists on best practices in assembling DAGs.\n@Tennant2021 collected data on DAGs in applied health research to better understand how researchers used them.\n@tbl-dag-properties shows some information they collected: the median number of nodes and arcs in a DAG, their ratio, the saturation percent of the DAG, and how many were fully saturated.\nSaturating DAGs means adding all possible arrows going forward in time, e.g., in a fully saturated DAG, any given variable at time point 1 has arrows going to all variables in future time points, and so on.\nMost DAGs were only about half saturated, and very few were fully saturated.\n\nOnly about half of the papers using DAGs reported the adjustment set used.\nIn other words, researchers presented their assumptions about the research question but not the implications about how they should handle the modeling stage or if they did use a valid adjustment set.\nSimilarly, the majority of studies did not report the estimand of interest.\n\n::: callout-note\nThe estimand is the target of interest in terms of what we're trying to estimate, as discussed briefly in [Chapter -@sec-whole-game].\nWe'll discuss estimands in detail in [Chapter -@sec-estimands].\n:::\n\n\n::: {#tbl-dag-properties .cell tbl-cap='A table of DAG properties in applied health research. Number of nodes and arcs are the median number of variables and arrows in the analyzed DAGs, while the Node to Arc ratio is their ratio. Saturation proportion is the proportion of all possible arrows going forward in time to other included variables. Fully saturated DAGs are those that include all such arrows. The researchers also analyzed whether studies reported their estimands and adjustment sets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n \n \n \n \n
CharacteristicN = 1441
DAG properties
Number of Nodes12 (9, 16)
Number of Arcs29 (19, 41)
Node to Arc Ratio2.30 (1.78, 3.00)
Saturation Proportion0.46 (0.31, 0.67)
Fully Saturated
    Yes4 (3%)
    No140 (97%)
Reporting
Reported Estimand
    Yes40 (28%)
    No104 (72%)
Reported Adjustment Set
    Yes80 (56%)
    No64 (44%)
1 Median (IQR); n (%)
\n
\n```\n\n:::\n:::\n\n\nIn this section, we'll offer some advice from @Tennant2021 and our own experience assembling DAGs.\n\n### Iterate early and often {#sec-dags-iterate}\n\nOne of the best things you can do for the quality of your results is to make the DAG before you conduct the study, ideally before you even collect the data.\nIf you're already working with your data, at minimum, build your DAG before doing data analysis.\nThis advice is similar in spirit to pre-registered analysis plans: declaring your assumptions ahead of time can help clarify what you need to do, reduce the risk of overfitting (e.g., determining confounders incorrectly from the data), and give you time to get feedback on your DAG.\n\nThis last benefit is significant: you should ideally democratize your DAG.\nShare it early and often with others who are experts on the data, domain, and models.\nIt's natural to create a DAG, present it to your colleagues, and realize you have missed something important.\nSometimes, you will only agree on some details of the structure.\nThat's a good thing: you know now where there is uncertainty in your DAG.\nYou can then examine the results from multiple plausible DAGs or address the uncertainty with sensitivity analyses.\n\nIf you have more than one candidate DAG, check their adjustment sets.\nIf two DAGs have overlapping adjustment sets, focus on those sets; then, you can move forward in a way that satisfies the plausible assumptions you have.\n\n### Consider your question\n\nAs we saw in @fig-podcast_dag3, some questions can be challenging to answer with certain data, while others are more approachable.\nYou should consider precisely what it is you want to estimate.\nDefining your target estimate is an important topic and the subject of [Chapter -@sec-estimands].\n\nAnother important detail about how your DAG relates to your question is the population and time.\nMany causal structures are not static over time and space.\nConsider lung cancer: the distribution of causes of lung cancer was considerably different before the spread of smoking.\nIn medieval Japan, before the spread of tobacco from the Americas centuries later, the causal structure for lung cancer would have been practically different from what it is in Japan today, both in terms of tobacco use and other factors (age of the population, etc.).\n\nThe same is true for confounders.\nEven if something *can* cause the exposure and outcome, if the prevalence of that thing is zero in the population you're analyzing, it's irrelevant to the causal question.\nIt may also be that, in some populations, it doesn't affect one of the two.\nThe reverse is also true: something might be unique to the target population.\nThe use of tobacco in North America several centuries ago was unique among the world population, even though ceremonial tobacco use was quite different from modern recreational use.\nMany changes won't happen as dramatically as across centuries, but sometimes, they do, e.g., if regulation in one country effectively eliminates the population's exposure to something.\n\n### Order nodes by time\n\nAs discussed earlier, we recommend ordering your variables by time, either left-to-right or up-to-down.\nThere are two reasons for this.\nFirst, time ordering is an integral part of your assumptions.\nAfter all, something happening before another thing is a requirement for it to be a cause.\nThinking this through carefully will clarify your DAG and the variables you need to address.\n\nSecond, after a certain level of complexity, it's easier to read a DAG when arranged by time because you have to think less about that dimension; it's inherent to the layout.\nThe time ordering algorithm in ggdag automates much of this for you, although, as we saw earlier, it's sometimes helpful to give it more information about the order.\n\nA related topic is feedback loops [@murray2022].\nOften, we think about two things that mutually cause each other as happening in a circle, like global warming and A/C use (A/C use increases global warming, which makes it hotter, which increases A/C use, and so on).\nIt's tempting to visualize that relationship like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n ac_use ~ global_temp,\n global_temp ~ ac_use,\n labels = c(ac_use = \"A/C use\", global_temp = \"Global\\ntemperature\")\n) |>\n ggdag(layout = \"circle\", edge_type = \"arc\", text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG representing the reciprocal relationship between A/C use and global temperature because of global warming. Feedback loops are useful mental shorthands to describe variables that impact each other over time compactly, but they are not true causal diagrams.](05-dags_files/figure-html/fig-feedback-loop-1.png){#fig-feedback-loop width=432}\n:::\n:::\n\n\nFrom a DAG perspective, this is a problem because of the *A* part of *DAG*: it's cyclic!\nImportantly, though, it's also not correct from a causal perspective.\nFeedback loops are a shorthand for what really happens, which is that the two variables mutually affect each other *over time*.\nCausality only goes forward in time, so it doesn't make sense to go back and forth like in @fig-feedback-loop.\n\nThe real DAG looks something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n global_temp_2000 ~ ac_use_1990 + global_temp_1990,\n ac_use_2000 ~ ac_use_1990 + global_temp_1990,\n global_temp_2010 ~ ac_use_2000 + global_temp_2000,\n ac_use_2010 ~ ac_use_2000 + global_temp_2000,\n global_temp_2020 ~ ac_use_2010 + global_temp_2010,\n ac_use_2020 ~ ac_use_2010 + global_temp_2010,\n coords = time_ordered_coords(),\n labels = c(\n ac_use_1990 = \"A/C use\\n(1990)\",\n global_temp_1990 = \"Global\\ntemperature\\n(1990)\",\n ac_use_2000 = \"A/C use\\n(2000)\",\n global_temp_2000 = \"Global\\ntemperature\\n(2000)\",\n ac_use_2010 = \"A/C use\\n(2010)\",\n global_temp_2010 = \"Global\\ntemperature\\n(2010)\",\n ac_use_2020 = \"A/C use\\n(2020)\",\n global_temp_2020 = \"Global\\ntemperature\\n(2020)\"\n )\n) |>\n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG showing the relationship between A/C use and global temperature over time. The true causal relationship in a feedback loop goes *forward*.](05-dags_files/figure-html/fig-feedforward-1.png){#fig-feedforward width=480}\n:::\n:::\n\n\nThe two variables, rather than being in a feed*back* loop, are actually in a feed*forward* loop: they co-evolve over time.\nHere, we only show four discrete moments in time (the decades from 1990 to 2020), but of course, we could get much finer depending on the question and data.\n\nAs with any DAG, the proper analysis approach depends on the question.\nThe effect of A/C use in 2000 on the global temperature in 2020 produces a different adjustment set than the global temperature in 2000 on A/C use in 2020.\nSimilarly, whether we also model this change over time or just those two time points depends on the question.\nOften, these feedforward relationships require you to address *time-varying* confounding, which we'll discuss in [Chapter -@sec-longitudinal].\n\n### Consider the whole data collection process\n\nAs @fig-podcast_dag3 showed us, it's essential to consider the *way* we collected data as much as the causal structure of the question.\nConsidering the whole data collection process is particularly true if you're working with \"found\" data---a data set not intentionally collected to answer the research question.\nWe are always inherently conditioning on the data we have vs. the data we don't have.\nIf other variables influenced the data collection process in the causal structure, you need to consider the impact.\nDo you need to control for additional variables?\nDo you need to change the effect you are trying to estimate?\nCan you answer the question at all?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n::: callout-tip\n## What about case-control studies?\n\nA standard study design in epidemiology is the case-control study.\nCase-control studies are beneficial when the outcome under study is rare or takes a very long time to happen (like many types of cancer).\nParticipants are selected into the study based on their outcome: once a person has an event, they are entered as a case and matched with a control who hasn't had the event.\nOften, they are matched on other factors as well.\n\nMatched case-control studies are selection biased by design [@mansournia2013].\nIn @fig-case-control, when we condition on selection into the study, we lose the ability to close all backdoor paths, even if we control for `confounder`.\nFrom the DAG, it would appear that the entire design is invalid!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ confounder + exposure,\n selection ~ outcome + confounder,\n exposure ~ confounder,\n exposure = \"exposure\",\n outcome = \"outcome\",\n coords = time_ordered_coords()\n) |>\n ggdag(edge_type = \"arc\", text_size = 2.2)\n```\n\n::: {.cell-output-display}\n![A DAG representing a matched case-control study. In such a study, selection is determined by outcome status and any matched confounders. Selection into the study is thus a collider. Since it is inherently stratified on who is actually in the study, such data are limited in the types of causal effects they can estimate.](05-dags_files/figure-html/fig-case-control-1.png){#fig-case-control width=432}\n:::\n:::\n\n\nLuckily, this isn't wholly true.\nCase-control studies are limited in the type of causal effects they can estimate (causal odds ratios, which under some circumstances approximate causal risk ratios).\nWith careful study design and sampling, the math works out such that these estimates are still valid.\nExactly how and why case-control studies work is beyond the scope of this book, but they are a remarkably clever design.\n:::\n\n### Include variables you don't have\n\nIt's critical that you include *all* variables important to the causal structure, not just the variables you have measured in your data.\nggdag can mark variables as unmeasured (\"latent\"); it will then return only usable adjustment sets, e.g., those without the unmeasured variables.\nOf course, the best thing to do is to use DAGs to help you understand what to measure in the first place, but there are many reasons why your data might be different.\nEven data intentionally collected for the research question might not have a variable discovered to be a confounder after data collection.\n\nFor instance, if we have a DAG where `exposure` and `outcome` have a confounding pathway consisting of `confounder1` and `confounder2`, we can control for either to successfully debias the estimate:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ exposure + confounder1,\n exposure ~ confounder2,\n confounder2 ~ confounder1,\n exposure = \"exposure\",\n outcome = \"outcome\"\n) |>\n adjustmentSets()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ confounder1 }\n{ confounder2 }\n```\n\n\n:::\n:::\n\n\nThus, if just one is missing (`latent`), then we are ok:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ exposure + confounder1,\n exposure ~ confounder2,\n confounder2 ~ confounder1,\n exposure = \"exposure\",\n outcome = \"outcome\",\n latent = \"confounder1\"\n) |>\n adjustmentSets()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ confounder2 }\n```\n\n\n:::\n:::\n\n\nBut if both are missing, there are no valid adjustment sets.\n\nWhen you don't have a variable measured, you still have a few options.\nAs mentioned above, you may be able to identify alternate adjustment sets.\nIf the missing variable is required to close all backdoor paths completely, you can and should conduct a sensitivity analysis to understand the impact of not having it.\nThis is the subject of [Chapter -@sec-sensitivity].\n\nUnder some lucky circumstances, you can also use a *proxy* confounder [@miao2018].\nA proxy confounder is a variable closely related to the confounder such that controlling for it controls for some of the effects of the missing variable.\nConsider an expansion of the fundamental confounding relationship where `q` has a cause, `p`, as in @fig-proxy-confounder.\nTechnically, if we don't have `q`, we can't close the backdoor path, and our effect will be biased.\nPractically, though, if `p` is highly correlated with `q`, it can serve as a method to reduce the confounding from `q`.\nYou can think of `p` as a mismeasured version of `q`; it will seldom wholly control for the bias via `q`, but it can help minimize it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n y ~ x + q,\n x ~ q,\n q ~ p,\n coords = time_ordered_coords()\n) |>\n ggdag(edge_type = \"arc\")\n```\n\n::: {.cell-output-display}\n![A DAG with a confounder, `q`, and a proxy confounder, `p`. The true adjustment set is `q`. Since `p` causes `q`, it contains information about `q` and can reduce the bias if we don't have `q` measured.](05-dags_files/figure-html/fig-proxy-confounder-1.png){#fig-proxy-confounder width=432}\n:::\n:::\n\n\n### Saturate your DAG, then prune\n\nIn discussing @tbl-dag-properties, we mentioned *saturated* DAGs.\nThese are DAGs where all possible arrows are included based on the time ordering, e.g., every variable causes variables that come after it in time.\n\n*Not* including an arrow is a bigger assumption than including one.\nIn other words, your default should be to have an arrow from one variable to a future variable.\nThis default is counterintuitive to many people.\nHow can it be that we need to be so careful about assessing causal effects yet be so liberal in applying causal assumptions in the DAG?\nThe answer to this lies in the strength and prevalence of the cause.\nTechnically, an arrow present means that *for at least a single observation*, the prior node causes the following node.\nThe arrow similarly says nothing about the strength of the relationship.\nSo, a minuscule causal effect on a single individual justifies the presence of an arrow.\nIn practice, such a case is probably not relevant.\nThere is *effectively* no arrow.\n\nThe more significant point, though, is that you should feel confident to add an arrow.\nThe bar for justification is much lower than you think.\nInstead, it's helpful to 1) determine your time ordering, 2) saturate the DAG, and 3) prune out implausible arrows.\n\nLet's experiment by working through a saturated version of the podcast-exam DAG.\n\nFirst, the time-ordering.\nPresumably, the student's sense of humor far predates the day of the exam.\nMood in the morning, too, predates listening to the podcast or exam score, as does preparation.\nThe saturated DAG given this ordering is:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A saturated version of `podcast_dag`: variables have all possible arrows going forward to other variables over time.](05-dags_files/figure-html/fig-podcast_dag_sat-1.png){#fig-podcast_dag_sat width=528}\n:::\n:::\n\n\nThere are a few new arrows here.\nHumor now causes the other two confounders, as well as exam score.\nSome of them make sense.\nSense of humor probably affects mood for some people.\nWhat about preparedness?\nThis relationship seems a little less plausible.\nSimilarly, we know that a sense of humor does not affect exam scores in this case because the grading is blinded.\nLet's prune those two.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A pruned version of @fig-podcast_dag_sat: we've removed implausible arrows from the fully saturated DAGs.](05-dags_files/figure-html/fig-podcast_dag_pruned-1.png){#fig-podcast_dag_pruned width=528}\n:::\n:::\n\n\nThis DAG seems more reasonable.\nSo, was our original DAG wrong?\nThat depends on several factors.\nNotably, both DAGs produce the same adjustment set: controlling for `mood` and `prepared` will give us an unbiased effect if either DAG is correct.\nEven if the new DAG were to produce a different adjustment set, whether the result is meaningfully different depends on the strength of the confounding.\n\n### Include instruments and precision variables\n\nTechnically, you do not need to include instrumental and precision variables in your DAG.\nThe adjustment sets will be the same with and without them.\nHowever, adding them is helpful for two reasons.\nFirstly, they demonstrate your assumptions about their relationships and the variables under study.\nAs discussed above, *not* including an arrow is a more significant assumption than including one, so it's valuable information about how you think the causal structure operates.\nSecondly, it impacts your modeling decision.\nYou should always include precision variables in your model to reduce variability in your estimate so it helps you identify those.\nInstruments are also helpful to see because they may guide alternative or complementary modeling strategies, as we'll discuss in @sec-evidence.\n\n### Focus on the causal structure, then consider measurement bias\n\nAs we saw above, missingness and measurement error can be a source of bias.\nAs we'll see in [Chapter -@sec-missingness], we have several strategies to approach such a situation.\nYet, almost everything we measure is inaccurate to some degree.\nThe true DAG for the data at hand inherently conditions on the measured version of variables.\nIn that sense, your data are always subtly-wrong, a sort of unreliable narrator.\nWhen should we include this information in the DAG?\nWe recommend first focusing on the causal structure of the DAG as if you had perfectly measured each variable [@hernan2021].\nThen, consider how mismeasurement and missingness might affect the realized data, particularly related to the exposure, outcome, and critical confounders.\nYou may prefer to present this as an alternative DAG to consider strategies for addressing the bias arising from those sources, e.g., imputation or sensitivity analyses.\nAfter all, the DAG in @fig-error_dag makes you think the question is unanswerable because we have no method to close all backdoor paths.\nAs with all open paths, that depends on the severity of the bias and our ability to reckon with it.\n\n\n\n\n\n### Pick adjustment sets most likely to be successful\n\nOne area where measurement error is an important consideration is when picking an adjustment set.\nIn theory, if a DAG is correct, any adjustment set will work to create an unbiased result.\nIn practice, variables have different levels of quality.\nPick an adjustment set most likely to succeed because it contains accurate variables.\nSimilarly, non-minimal adjustment sets are helpful to consider because, together, several variables with measurement error along a backdoor path may be enough to minimize the practical bias resulting from that path.\n\nWhat if you don't have certain critical variables measured and thus do not have a valid adjustment set?\nIn that case, you should pick the adjustment set with the best chance of minimizing the bias from other backdoor paths.\nAll is not lost if you don't have every confounder measured: get the highest quality estimate you can, then conduct a sensitivity analysis about the unmeasured variables to understand the impact.\n\n### Use robustness checks\n\nFinally, we recommend checking your DAG for robustness.\nYou can never verify the correctness of your DAG under most conditions, but you can use the implications in your DAG to support it.\nThree types of robustness checks can be helpful depending on the circumstances.\n\n1. **Negative controls** [@Lipsitch2010]. These come in two flavors: negative exposure controls and negative outcome controls. The idea is to find something associated with one but not the other, e.g., the outcome but not the exposure, so there should be no effect. Since there should be no effect, you now have a measurement for how well you control for *other* effects (e.g., the difference from null). Ideally, the confounders for negative controls are similar to the research question.\n2. **DAG-data consistency** [@Textor2016]. Negative controls are an implication of your DAG. An extension of this idea is that there are *many* such implications. Because blocking a path removes statistical dependencies from that path, you can check those assumptions in several places in your DAG.\n3. **Alternate adjustment sets**. Adjustment sets should give roughly the same answer because, outside of random and measurement errors, they are all sets that block backdoor paths. If more than one adjustment set seems reasonable, you can use that as a sensitivity analysis by checking multiple models.\n\nWe'll discuss these in detail in [Chapter -@sec-sensitivity].\nThe caveat here is that these should be complementary to your initial DAG, not a way of *replacing* it.\nIn fact, if you use more than one adjustment set during your analysis, you should report the results from all of them to avoid overfitting your results to your data.\n", "supporting": [ "05-dags_files" ], diff --git a/_freeze/chapters/05-dags/figure-html/fig-adustment-set-all-1.png b/_freeze/chapters/05-dags/figure-html/fig-adustment-set-all-1.png index cf4d1f74..94b08dfa 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-adustment-set-all-1.png and b/_freeze/chapters/05-dags/figure-html/fig-adustment-set-all-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-paths-podcast-1.png b/_freeze/chapters/05-dags/figure-html/fig-paths-podcast-1.png index 4700754c..8fbe988c 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-paths-podcast-1.png and b/_freeze/chapters/05-dags/figure-html/fig-paths-podcast-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-podcast-adustment-set-1.png b/_freeze/chapters/05-dags/figure-html/fig-podcast-adustment-set-1.png index 9e2b2cf4..81d43ddf 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-podcast-adustment-set-1.png and b/_freeze/chapters/05-dags/figure-html/fig-podcast-adustment-set-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-paths-1.png b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-paths-1.png index 0254cf07..63eddb50 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-paths-1.png and b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-paths-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-set-1.png b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-set-1.png index 07f5a2b7..0c44bec4 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-set-1.png and b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag2-set-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-as-1.png b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-as-1.png index d2ada31a..5670a873 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-as-1.png and b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-as-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-direct-1.png b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-direct-1.png index b3290dc1..1f015093 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-direct-1.png and b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag3-direct-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag4-as-1.png b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag4-as-1.png index 15290ce7..aa8569d0 100644 Binary files a/_freeze/chapters/05-dags/figure-html/fig-podcast_dag4-as-1.png and b/_freeze/chapters/05-dags/figure-html/fig-podcast_dag4-as-1.png differ diff --git a/_freeze/chapters/05-dags/figure-html/unnamed-chunk-14-1.png b/_freeze/chapters/05-dags/figure-html/unnamed-chunk-14-1.png index 24bc6610..d7bc04dc 100644 Binary files a/_freeze/chapters/05-dags/figure-html/unnamed-chunk-14-1.png and b/_freeze/chapters/05-dags/figure-html/unnamed-chunk-14-1.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json b/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json index b810fd31..b406984d 100644 --- a/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json +++ b/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "c348c5e891d5e67c7c0c1a4ee0d70664", + "hash": "76cbcd542e0b61fd4c0df43dce0d5f8c", "result": { - "markdown": "# Causal inference is not (just) a statistical problem {#sec-quartets}\n\n\n\n\n\n## The Causal Quartet\n\nWe now have the tools to look at something we've alluded to thus far in the book: causal inference is not (just) a statistical problem.\nOf course, we use statistics to answer causal questions.\nIt's necessary to answer most questions, even if the statistics are basic (as they often are in randomized designs).\nHowever, statistics alone do not allow us to address all of the assumptions of causal inference.\n\nIn 1973, Francis Anscombe introduced a set of four datasets called *Anscombe's Quartet*.\nThese data illustrated an important lesson: summary statistics alone cannot help you understand data; you must also visualize your data.\nIn the plots in @fig-anscombe, each data set has remarkably similar summary statistics, including means and correlations that are nearly identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(quartets)\n\nanscombe_quartet |> \n ggplot(aes(x, y)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![Anscombe's Quartet, a set of four datasets with nearly identical summary statistics. Anscombe's point was that one must visualize the data to understand it.](06-not-just-a-stats-problem_files/figure-html/fig-anscombe-1.png){#fig-anscombe width=672}\n:::\n:::\n\n\nThe Datasaurus Dozen is a modern take on Anscombe's Quartet.\nThe mean, standard deviation, and correlation are nearly identical in each dataset, but the visualizations are very different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasauRus)\n\n# roughly the same correlation in each dataset\ndatasaurus_dozen |> \n group_by(dataset) |> \n summarize(cor = round(cor(x, y), 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 13 × 2\n dataset cor\n \n 1 away -0.06\n 2 bullseye -0.07\n 3 circle -0.07\n 4 dino -0.06\n 5 dots -0.06\n 6 h_lines -0.06\n 7 high_lines -0.07\n 8 slant_down -0.07\n 9 slant_up -0.07\n10 star -0.06\n11 v_lines -0.07\n12 wide_lines -0.07\n13 x_shape -0.07\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndatasaurus_dozen |> \n ggplot(aes(x, y)) + \n geom_point() + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Datasaurus Dozen, a set of datasets with nearly identical summary statistics. The Datasaurus Dozen is a modern version of Anscombe's Quartet. It's actually a baker's dozen, but who's counting?](06-not-just-a-stats-problem_files/figure-html/fig-datasaurus-1.png){#fig-datasaurus width=672}\n:::\n:::\n\n\nIn causal inference, however, even visualization is insufficient to untangle causal effects.\nAs we visualized in DAGs in @sec-dags, background knowledge is required to infer causation from correlation [@onthei1999].\n\nInspired by Anscombe's quartet, the *causal quartet* has many of the same properties of Anscombe's quartet and the Datasaurus Dozen: the numerical summaries of the variables in the dataset are the same [@dagostinomcgowan2023].\nUnlike these data, the causal quartet also *look* the same as each other.\nThe difference is the causal structure that generated each dataset.\n@fig-causal_quartet_hidden shows four datasets where the observational relationship between `exposure` and `outcome` is virtually identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n group_by(dataset) |>\n mutate(exposure = scale(exposure), outcome = scale(outcome)) |> \n ungroup() |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, four data sets with nearly identical summary statistics and visualizations. The causal structure of each dataset is different, and data alone cannot tell us which is which.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_hidden-1.png){#fig-causal_quartet_hidden width=672}\n:::\n:::\n\n\nThe question for each dataset is whether to adjust for a third variable, `covariate`.\nIs `covariate` a confounder?\nA mediator?\nA collider?\nWe can't use data to figure this problem out.\nIn @tbl-quartet_lm, it's not clear which effect is correct.\nLikewise, the correlation between `exposure` and `covariate` is no help: they're all the same!\n\n\n::: {#tbl-quartet_lm .cell tbl-cap='The causal quartet, with the estimated effect of `exposure` on `outcome` with and without adjustment for `covariate`. The unadjusted estimate is identical for all four datasets, as is the correlation between `exposure` and `covariate`. The adjusted estimate varies. Without background knowledge, it\\'s not clear which is right.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
DatasetNot adjusting for covariateAdjusting for covariateCorrelation of exposure and covariate
11.000.550.70
21.000.500.70
31.000.000.70
41.000.880.70
\n
\n```\n\n:::\n:::\n\n\n::: callout-warning\n## The ten percent rule\n\nThe ten percent rule is a common technique in epidemiology and other fields to determine whether a variable is a confounder.\nThe ten percent rule says that you should include a variable in your model if including it changes the effect estimate by more than ten percent.\nThe problem is, it doesn't work.\n*Every* example in the causal quartet causes a more than ten percent change.\nAs we know, this leads to the wrong answer in some of the datasets.\nEven the reverse technique, *excluding* a variable when it's *less* than ten percent, can cause trouble because many minor confounding effects can add up to more considerable bias.\n\n\n::: {#tbl-quartet_ten_percent .cell tbl-cap='The percent change in the coefficient for `exposure` when including `covariate` in the model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n \n \n
DatasetPercent change
144.6%
249.7%
399.8%
412.5%
\n
\n```\n\n:::\n:::\n\n:::\n\nWhile the visual relationship between `covariate` and `exposure` is not identical between datasets, all have the same correlation.\nIn @fig-causal_quartet_covariate, the standardized relationship between the two is identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n group_by(dataset) |> \n summarize(cor = round(cor(covariate, exposure), 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 2\n dataset cor\n \n1 1 0.7\n2 2 0.7\n3 3 0.7\n4 4 0.7\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n group_by(dataset) |>\n mutate(covariate = scale(covariate), exposure = scale(exposure)) |> \n ungroup() |> \n ggplot(aes(covariate, exposure)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset) \n```\n\n::: {.cell-output-display}\n![The correlation is the same in each dataset, but the visual relationship is not. However, the differences in the plots are not enough information to determine whether `covariate` is a confounder, mediator, or collider.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_covariate-1.png){#fig-causal_quartet_covariate width=672}\n:::\n:::\n\n\n::: {.callout-tip} \n## Why did we standardize the coefficients?\n\nStandardizing numeric variables to have a mean of 0 and standard deviation of 1, as implemented in `scale()`, is a common technique in statistics. It's useful for a variety of reasons, but we chose to scale the variables here to emphasize the identical correlation between `covariate` and `exposure` in each dataset. If we didn't scale the variables, the correlation would be the same, but the plots would look different because their standard deviation are different. The beta coefficient in an OLS model is calculated with information about the covariance and the standard deviation of the variable, so scaling it makes the coefficient identical to the Pearson's correlation.\n\n@fig-causal_quartet_covariate_unscaled shows the unscaled relationship between `covariate` and `exposure`. Now, we see some differences: dataset 4 seems to have more variance in `covariate`, but that's not actionable information. In fact, it's a mathematical artifact of the data generating process.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n ggplot(aes(covariate, exposure)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![@fig-causal_quartet_covariate, unscaled](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_covariate_unscaled-1.png){#fig-causal_quartet_covariate_unscaled width=672}\n:::\n:::\n\n:::\n\nLet's reveal the labels of the datasets, representing the causal structure of the dataset.\n@fig-causal_quartet, `covariate` plays a different role in each dataset.\nIn 1 and 4, it's a collider (we *shouldn't* adjust for it).\nIn 2, it's a confounder (we *should* adjust for it).\nIn 3, it's a mediator (it depends on the research question).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, revealed. The first and last datasets are types of collider bias; we should *not* control for `covariate.` In the second dataset, `covariate` is a confounder, and we *should* control for it. In the third dataset, `covariate` is a mediator, and we should control for it if we want the total effect, but not if we want the direct effect.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet-1.png){#fig-causal_quartet width=672}\n:::\n:::\n\n\nWhat can we do if the data can't distinguish these causal structures?\nThe best answer is to have a good sense of the data-generating mechanism.\nIn @fig-quartet-dag, we show the DAG for each dataset.\nOnce we compile a DAG for each dataset, we only need to query the DAG for the correct adjustment set, assuming the DAG is right.\n\n\n::: {#fig-quartet-dag .cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![The DAG for dataset 1, where `covariate` (c) is a collider. We should *not* adjust for `covariate`, which is a descendant of `exposure` (e) and `outcome` (o).](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-1.png){#fig-quartet-dag-1 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 2, where `covariate` (c) is a confounder. `covariate` is a mutual cause of `exposure` (e) and `outcome` (o), representing a backdoor path, so we *must* adjust for it to get the right answer.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-2.png){#fig-quartet-dag-2 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 3, where `covariate` (c) is a mediator. `covariate` is a descendant of `exposure` (e) and a cause of `outcome` (o). The path through `covariate` is the indirect path, and the path through `exposure` is the direct path. We should adjust for `covariate` if we want the direct effect, but not if we want the total effect.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-3.png){#fig-quartet-dag-3 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 4, where `covariate` (c) is a collider via M-Bias. Although `covariate` happens before both `outcome` (o) and `exposure` (e), it's still a collider. We should *not* adjust for `covariate`, particularly since we can't control for the bias via `u1` and `u2`, which are unmeasured.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-4.png){#fig-quartet-dag-4 width=288}\n:::\n\nThe DAGs for the Causal Quartet.\n:::\n\n\nThe data generating mechanism[^06-not-just-a-stats-problem-1] in the DAGs matches what generated the datasets, so we can use the DAGs to determine the correct effect: unadjusted in datasets 1 and 4 and adjusted in dataset 2.\nFor dataset 3, it depends on which mediation effect we want: adjusted for the direct effect and unadjusted for the total effect.\n\n[^06-not-just-a-stats-problem-1]: See @dagostinomcgowan2023 for the models that generated the datasets.\n\n\n::: {#tbl-quartets_true_effects .cell tbl-cap='The data generating mechanism and true causal effects in each dataset. Sometimes, the unadjusted effect is the same, and sometimes it is not, depending on the mechanism and question.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
Data generating mechanismCorrect causal modelCorrect causal effect
(1) Collideroutcome ~ exposure1
(2) Confounderoutcome ~ exposure; covariate0.5
(3) MediatorDirect effect: outcome ~ exposure; covariate, Total Effect: outcome ~ exposureDirect effect: 0, Total effect: 1
(4) M-Biasoutcome ~ exposure1
\n
\n```\n\n:::\n:::\n\n\n## Time as a heuristic for causal structure\n\nHopefully, we have convinced you of the usefulness of DAGs.\nHowever, constructing correct DAGs is a challenging endeavor.\nIn the causal quartet, we knew the DAGs because we generated the data.\nWe need background knowledge to assemble a candidate causal structure in real life.\nFor some questions, such background knowledge is not available.\nFor others, we may worry about the complexity of the causal structure, particularly when variables mutually evolve with each other, as in @fig-feedback-loop.\n\nOne heuristic is particularly useful when a DAG is incomplete or uncertain: time.\nBecause causality is temporal, a cause must precede an effect.\nMany, but not all, problems in deciding if we should adjust for a confounder are solved by simply putting the variables in order by time.\nTime order is also one of the most critical assumptions you can visualize in a DAG, so it's an excellent place to start, regardless of the completeness of the DAG.\n\nConsider @fig-quartets-time-ordered-1, a time-ordered version of the collider DAG where the covariate is measured at both baseline and follow-up.\nThe original DAG actually represents the *second* measurement, where the covariate is a descendant of both the outcome and exposure.\nIf, however, we control for the same covariate as measured at the start of the study (@fig-quartets-time-ordered-2), it cannot be a descendant of the outcome at follow-up because it has yet to happen.\nThus, when you are missing background knowledge as to the causal structure of the covariate, you can use time-ordering as a defensive measure to avoid bias.\nOnly control for variables that precede the outcome.\n\n\n::: {#fig-quartets-time-ordered .cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![In a time-ordered version of the collider DAG, controlling for the covariate at follow-up induces bias.](06-not-just-a-stats-problem_files/figure-html/fig-quartets-time-ordered-1.png){#fig-quartets-time-ordered-1 width=384}\n:::\n\n::: {.cell-output-display}\n![Conversely, controlling for the covariate as measured at baseline does not induce bias because it is not a descendant of the outcome.](06-not-just-a-stats-problem_files/figure-html/fig-quartets-time-ordered-2.png){#fig-quartets-time-ordered-2 width=384}\n:::\n\nA time-ordered version of the collider DAG where each variable is measured twice. Controlling for `covariate` at follow-up is a collider, but controlling for `covariate` at baseline is not.\n:::\n\n\n::: callout-warning\n## Don't adjust for the future\n\nThe time-ordering heuristic relies on a simple rule: don't adjust for the future.\n:::\n\nThe quartet package's `causal_quartet_time` has time-ordered measurements of each variable for the four datasets.\nEach has a `*_baseline` and `*_follow-up` measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet_time\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 400 × 12\n covariate_baseline exposure_baseline\n \n 1 -0.0963 -1.43 \n 2 -1.11 0.0593 \n 3 0.647 0.370 \n 4 0.755 0.00471\n 5 1.19 0.340 \n 6 -0.588 -3.61 \n 7 -1.13 1.44 \n 8 0.689 1.02 \n 9 -1.49 -2.43 \n10 -2.78 -1.26 \n# ℹ 390 more rows\n# ℹ 10 more variables: outcome_baseline ,\n# covariate_followup , exposure_followup ,\n# outcome_followup , exposure_mid ,\n# covariate_mid , outcome_mid , u1 ,\n# u2 , dataset \n```\n\n\n:::\n:::\n\n\nUsing the formula `outcome_followup ~ exposure_baseline + covariate_baseline` works for three out of four datasets.\nEven though `covariate_baseline` is only in the adjustment set for the second dataset, it's not a collider in two of the other datasets, so it's not a problem.\n\n\n::: {#tbl-quartet_time_adjusted .cell tbl-cap='The adjusted effect of `exposure_baseline` on `outcome_followup` in each dataset. The effect adjusted for `covariate_baseline` is correct for three out of four datasets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetAdjusted effectTruth
(1) Collider1.001.00
(2) Confounder0.500.50
(3) Mediator1.001.00
(4) M-Bias0.881.00
\n
\n```\n\n:::\n:::\n\n\nWhere it fails is in dataset 4, the M-bias example.\nIn this case, `covariate_baseline` is still a collider because the collision occurs before both the exposure and outcome.\nAs we discussed in @sec-m-bias, however, if you are in doubt whether something is genuinely M-bias, it is better to adjust for it than not.\nConfounding bias tends to be worse, and meaningful M-bias is probably rare in real life.\nAs the actual causal structure deviates from perfect M-bias, the severity of the bias tends to decrease.\nSo, if it is clearly M-bias, don't adjust for the variable.\nIf it's not clear, adjust for it.\n\n::: callout-tip\nRemember as well that it is possible to block bias induced by adjusting for a collider in certain circumstances because collider bias is just another open path.\nIf we had `u1` and `u2`, we could control for `covariate` while blocking potential collider bias.\nIn other words, sometimes, when we open a path, we can close it again.\n\n\n::: {.cell}\n\n:::\n\n:::\n\n## Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\n### Prediction metrics\n\nPredictive measurements also fail to distinguish between the four datasets.\nIn @tbl-quartet_time_predictive, we show the difference in a couple of standard predictive metrics when we add `covariate` to the model.\nIn each dataset, `covariate` adds information to the model because it contains associational information about the outcome [^06-not-just-a-stats-problem-2].\nThe RMSE goes down, indicating a better fit, and the R^2^ goes up, showing more variance explained.\nThe coefficients for `covariate` represent the information about `outcome` it contains; they don't tell us from where in the causal structure that information originates.\nCorrelation isn't causation, and neither is prediction.\nIn the case of the collider data set, it's not even a helpful prediction tool because you wouldn't have `covariate` at the time of prediction, given that it happens after the exposure and outcome.\n\n[^06-not-just-a-stats-problem-2]: For M-bias, including `covariate` in the model is helpful to the extent that it has information about `u2`, one of the causes of the outcome.\n In this case, the data generating mechanism was such that `covariate` contains more information from `u1` than `u2`, so it doesn't add as much predictive value.\n Random noise represents most of what `u2` doesn't account for.\n\n\n::: {#tbl-quartet_time_predictive .cell tbl-cap='The difference in predictive metrics on `outcome` in each dataset with and without `covariate`. In each dataset, `covariate` adds information to the model, but this offers little guidance regarding the proper causal model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetRMSER2
(1) Collider−0.140.12
(2) Confounder−0.200.14
(3) Mediator−0.480.37
(4) M-Bias−0.010.01
\n
\n```\n\n:::\n:::\n\n\n### The Table Two Fallacy[^06-not-just-a-stats-problem-3]\n\n[^06-not-just-a-stats-problem-3]: If you recall, the Table Two Fallacy is named after the tendency in health research journals to have a complete set of model coefficients in the second table of an article.\n See @Westreich2013 for a detailed discussion of the Table Two Fallacy.\n\nRelatedly, model coefficients for variables *other* than those of the causes we're interested in can be difficult to interpret.\nIn a model with `outcome ~ exposure + covariate`, it's tempting to present the coefficient of `covariate` as well as `exposure`.\nThe problem, as discussed @sec-pred-or-explain, is that the causal structure for the effect of `covariate` on `outcome` may differ from that of `exposure` on `outcome`.\nLet's consider a variation of the quartet DAGs with other variables.\n\nFirst, let's start with the confounder DAG.\nIn @fig-quartet_confounder, we see that `covariate` is a confounder.\nIf this DAG represents the complete causal structure for `outcome`, the model `outcome ~ exposure + covariate` will give an unbiased estimate of the effect on `outcome` for `exposure`, assuming we've met other assumptions of the modeling process.\nThe adjustment set for `covariate`'s effect on `outcome` is empty, and `exposure` is not a collider, so controlling for it does not induce bias[^06-not-just-a-stats-problem-4].\nBut look again.\n`exposure` is a mediator for `covariate`'s effect on `outcome`; some of the total effect is mediated through `outcome`, while there is also a direct effect of `covariate` on `outcome`. **Both estimates are unbiased, but they are different *types* of estimates**. The effect of `exposure` on `outcome` is the *total effect* of that relationship, while the effect of `covariate` on `outcome` is the *direct effect*.\n\n[^06-not-just-a-stats-problem-4]: Additionally, OLS produces a *collapsable* effect.\n Other effects, like the odds and hazards ratios, are *non-collapsable*, meaning including unrelated variables in the model *can* change the effect estimate.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The DAG for dataset 2, where `covariate` is a confounder. If you look closely, you'll realize that, from the perspective of the effect of `covariate` on the `outcome`, `exposure` is a *mediator*.](06-not-just-a-stats-problem_files/figure-html/fig-quartet_confounder-1.png){#fig-quartet_confounder width=288}\n:::\n:::\n\n\nWhat if we add `q`, a mutual cause of `covariate` and `outcome`?\nIn @fig-quartet_confounder_q, the adjustment sets are still different.\nThe adjustment set for `outcome ~ exposure` is still the same: `{covariate}`.\nThe `outcome ~ covariate` adjustment set is `{q}`.\nIn other words, `q` is a confounder for `covariate`'s effect on `outcome`.\nThe model `outcome ~ exposure + covariate` will produce the correct effect for `exposure` but not for the direct effect of `covariate`.\nNow, we have a situation where `covariate` not only answers a different type of question than `exposure` but is also biased by the absence of `q`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A modification of the DAG for dataset 2, where `covariate` is a confounder. Now, the relationship between `covariate` and `outcome` is confounded by `q`, a variable not necessary to calculate the unbiased effect of `exposure` on `outcome`.](06-not-just-a-stats-problem_files/figure-html/fig-quartet_confounder_q-1.png){#fig-quartet_confounder_q width=336}\n:::\n:::\n\n\nSpecifying a single causal model is deeply challenging.\nHaving a single model answer multiple causal questions is exponentially more difficult.\nIf attempting to do so, apply the same scrutiny to both[^06-not-just-a-stats-problem-5] questions.\nIs it possible to have a single adjustment set that answers both questions?\nIf not, specify two models or forego one of the questions.\nIf so, you need to ensure that the estimates answer the correct question.\nWe'll also discuss *joint* causal effects in @sec-interaction.\n\n[^06-not-just-a-stats-problem-5]: Practitioners of *casual* inference will interpret *many* effects from a single model in this way, but we consider this an act of bravado.\n\nUnfortunately, algorithms for detecting adjustment sets for multiple exposures and effect types are not well-developed, so you may need to rely on your knowledge of the causal structure in determining the intersection of the adjustment sets.\n", + "markdown": "# Causal inference is not (just) a statistical problem {#sec-quartets}\n\n\n\n\n\n## The Causal Quartet\n\nWe now have the tools to look at something we've alluded to thus far in the book: causal inference is not (just) a statistical problem.\nOf course, we use statistics to answer causal questions.\nIt's necessary to answer most questions, even if the statistics are basic (as they often are in randomized designs).\nHowever, statistics alone do not allow us to address all of the assumptions of causal inference.\n\nIn 1973, Francis Anscombe introduced a set of four datasets called *Anscombe's Quartet*.\nThese data illustrated an important lesson: summary statistics alone cannot help you understand data; you must also visualize your data.\nIn the plots in @fig-anscombe, each data set has remarkably similar summary statistics, including means and correlations that are nearly identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(quartets)\n\nanscombe_quartet |> \n ggplot(aes(x, y)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![Anscombe's Quartet, a set of four datasets with nearly identical summary statistics. Anscombe's point was that one must visualize the data to understand it.](06-not-just-a-stats-problem_files/figure-html/fig-anscombe-1.png){#fig-anscombe width=672}\n:::\n:::\n\n\nThe Datasaurus Dozen is a modern take on Anscombe's Quartet.\nThe mean, standard deviation, and correlation are nearly identical in each dataset, but the visualizations are very different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasauRus)\n\n# roughly the same correlation in each dataset\ndatasaurus_dozen |> \n group_by(dataset) |> \n summarize(cor = round(cor(x, y), 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 13 × 2\n dataset cor\n \n 1 away -0.06\n 2 bullseye -0.07\n 3 circle -0.07\n 4 dino -0.06\n 5 dots -0.06\n 6 h_lines -0.06\n 7 high_lines -0.07\n 8 slant_down -0.07\n 9 slant_up -0.07\n10 star -0.06\n11 v_lines -0.07\n12 wide_lines -0.07\n13 x_shape -0.07\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndatasaurus_dozen |> \n ggplot(aes(x, y)) + \n geom_point() + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Datasaurus Dozen, a set of datasets with nearly identical summary statistics. The Datasaurus Dozen is a modern version of Anscombe's Quartet. It's actually a baker's dozen, but who's counting?](06-not-just-a-stats-problem_files/figure-html/fig-datasaurus-1.png){#fig-datasaurus width=672}\n:::\n:::\n\n\nIn causal inference, however, even visualization is insufficient to untangle causal effects.\nAs we visualized in DAGs in @sec-dags, background knowledge is required to infer causation from correlation [@onthei1999].\n\nInspired by Anscombe's quartet, the *causal quartet* has many of the same properties of Anscombe's quartet and the Datasaurus Dozen: the numerical summaries of the variables in the dataset are the same [@dagostinomcgowan2023].\nUnlike these data, the causal quartet also *look* the same as each other.\nThe difference is the causal structure that generated each dataset.\n@fig-causal_quartet_hidden shows four datasets where the observational relationship between `exposure` and `outcome` is virtually identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n group_by(dataset) |>\n mutate(exposure = scale(exposure), outcome = scale(outcome)) |> \n ungroup() |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, four data sets with nearly identical summary statistics and visualizations. The causal structure of each dataset is different, and data alone cannot tell us which is which.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_hidden-1.png){#fig-causal_quartet_hidden width=672}\n:::\n:::\n\n\nThe question for each dataset is whether to adjust for a third variable, `covariate`.\nIs `covariate` a confounder?\nA mediator?\nA collider?\nWe can't use data to figure this problem out.\nIn @tbl-quartet_lm, it's not clear which effect is correct.\nLikewise, the correlation between `exposure` and `covariate` is no help: they're all the same!\n\n\n::: {#tbl-quartet_lm .cell tbl-cap='The causal quartet, with the estimated effect of `exposure` on `outcome` with and without adjustment for `covariate`. The unadjusted estimate is identical for all four datasets, as is the correlation between `exposure` and `covariate`. The adjusted estimate varies. Without background knowledge, it\\'s not clear which is right.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
DatasetNot adjusting for covariateAdjusting for covariateCorrelation of exposure and covariate
11.000.550.70
21.000.500.70
31.000.000.70
41.000.880.70
\n
\n```\n\n:::\n:::\n\n\n::: callout-warning\n## The ten percent rule\n\nThe ten percent rule is a common technique in epidemiology and other fields to determine whether a variable is a confounder.\nThe ten percent rule says that you should include a variable in your model if including it changes the effect estimate by more than ten percent.\nThe problem is, it doesn't work.\n*Every* example in the causal quartet causes a more than ten percent change.\nAs we know, this leads to the wrong answer in some of the datasets.\nEven the reverse technique, *excluding* a variable when it's *less* than ten percent, can cause trouble because many minor confounding effects can add up to more considerable bias.\n\n\n::: {#tbl-quartet_ten_percent .cell tbl-cap='The percent change in the coefficient for `exposure` when including `covariate` in the model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n \n \n
DatasetPercent change
144.6%
249.7%
399.8%
412.5%
\n
\n```\n\n:::\n:::\n\n:::\n\nWhile the visual relationship between `covariate` and `exposure` is not identical between datasets, all have the same correlation.\nIn @fig-causal_quartet_covariate, the standardized relationship between the two is identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n group_by(dataset) |> \n summarize(cor = round(cor(covariate, exposure), 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 2\n dataset cor\n \n1 1 0.7\n2 2 0.7\n3 3 0.7\n4 4 0.7\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n group_by(dataset) |>\n mutate(covariate = scale(covariate), exposure = scale(exposure)) |> \n ungroup() |> \n ggplot(aes(covariate, exposure)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset) \n```\n\n::: {.cell-output-display}\n![The correlation is the same in each dataset, but the visual relationship is not. However, the differences in the plots are not enough information to determine whether `covariate` is a confounder, mediator, or collider.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_covariate-1.png){#fig-causal_quartet_covariate width=672}\n:::\n:::\n\n\n::: {.callout-tip} \n## Why did we standardize the coefficients?\n\nStandardizing numeric variables to have a mean of 0 and standard deviation of 1, as implemented in `scale()`, is a common technique in statistics. It's useful for a variety of reasons, but we chose to scale the variables here to emphasize the identical correlation between `covariate` and `exposure` in each dataset. If we didn't scale the variables, the correlation would be the same, but the plots would look different because their standard deviation are different. The beta coefficient in an OLS model is calculated with information about the covariance and the standard deviation of the variable, so scaling it makes the coefficient identical to the Pearson's correlation.\n\n@fig-causal_quartet_covariate_unscaled shows the unscaled relationship between `covariate` and `exposure`. Now, we see some differences: dataset 4 seems to have more variance in `covariate`, but that's not actionable information. In fact, it's a mathematical artifact of the data generating process.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n ggplot(aes(covariate, exposure)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![@fig-causal_quartet_covariate, unscaled](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_covariate_unscaled-1.png){#fig-causal_quartet_covariate_unscaled width=672}\n:::\n:::\n\n:::\n\nLet's reveal the labels of the datasets, representing the causal structure of the dataset.\n@fig-causal_quartet, `covariate` plays a different role in each dataset.\nIn 1 and 4, it's a collider (we *shouldn't* adjust for it).\nIn 2, it's a confounder (we *should* adjust for it).\nIn 3, it's a mediator (it depends on the research question).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, revealed. The first and last datasets are types of collider bias; we should *not* control for `covariate.` In the second dataset, `covariate` is a confounder, and we *should* control for it. In the third dataset, `covariate` is a mediator, and we should control for it if we want the direct effect, but not if we want the total effect.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet-1.png){#fig-causal_quartet width=672}\n:::\n:::\n\n\nWhat can we do if the data can't distinguish these causal structures?\nThe best answer is to have a good sense of the data-generating mechanism.\nIn @fig-quartet-dag, we show the DAG for each dataset.\nOnce we compile a DAG for each dataset, we only need to query the DAG for the correct adjustment set, assuming the DAG is right.\n\n\n::: {#fig-quartet-dag .cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![The DAG for dataset 1, where `covariate` (c) is a collider. We should *not* adjust for `covariate`, which is a descendant of `exposure` (e) and `outcome` (o).](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-1.png){#fig-quartet-dag-1 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 2, where `covariate` (c) is a confounder. `covariate` is a mutual cause of `exposure` (e) and `outcome` (o), representing a backdoor path, so we *must* adjust for it to get the right answer.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-2.png){#fig-quartet-dag-2 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 3, where `covariate` (c) is a mediator. `covariate` is a descendant of `exposure` (e) and a cause of `outcome` (o). The path through `covariate` is the indirect path, and the path through `exposure` is the direct path. We should adjust for `covariate` if we want the direct effect, but not if we want the total effect.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-3.png){#fig-quartet-dag-3 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 4, where `covariate` (c) is a collider via M-Bias. Although `covariate` happens before both `outcome` (o) and `exposure` (e), it's still a collider. We should *not* adjust for `covariate`, particularly since we can't control for the bias via `u1` and `u2`, which are unmeasured.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-4.png){#fig-quartet-dag-4 width=288}\n:::\n\nThe DAGs for the Causal Quartet.\n:::\n\n\nThe data generating mechanism[^06-not-just-a-stats-problem-1] in the DAGs matches what generated the datasets, so we can use the DAGs to determine the correct effect: unadjusted in datasets 1 and 4 and adjusted in dataset 2.\nFor dataset 3, it depends on which mediation effect we want: adjusted for the direct effect and unadjusted for the total effect.\n\n[^06-not-just-a-stats-problem-1]: See @dagostinomcgowan2023 for the models that generated the datasets.\n\n\n::: {#tbl-quartets_true_effects .cell tbl-cap='The data generating mechanism and true causal effects in each dataset. Sometimes, the unadjusted effect is the same, and sometimes it is not, depending on the mechanism and question.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
Data generating mechanismCorrect causal modelCorrect causal effect
(1) Collideroutcome ~ exposure1
(2) Confounderoutcome ~ exposure; covariate0.5
(3) MediatorDirect effect: outcome ~ exposure; covariate, Total Effect: outcome ~ exposureDirect effect: 0, Total effect: 1
(4) M-Biasoutcome ~ exposure1
\n
\n```\n\n:::\n:::\n\n\n## Time as a heuristic for causal structure\n\nHopefully, we have convinced you of the usefulness of DAGs.\nHowever, constructing correct DAGs is a challenging endeavor.\nIn the causal quartet, we knew the DAGs because we generated the data.\nWe need background knowledge to assemble a candidate causal structure in real life.\nFor some questions, such background knowledge is not available.\nFor others, we may worry about the complexity of the causal structure, particularly when variables mutually evolve with each other, as in @fig-feedback-loop.\n\nOne heuristic is particularly useful when a DAG is incomplete or uncertain: time.\nBecause causality is temporal, a cause must precede an effect.\nMany, but not all, problems in deciding if we should adjust for a confounder are solved by simply putting the variables in order by time.\nTime order is also one of the most critical assumptions you can visualize in a DAG, so it's an excellent place to start, regardless of the completeness of the DAG.\n\nConsider @fig-quartets-time-ordered-1, a time-ordered version of the collider DAG where the covariate is measured at both baseline and follow-up.\nThe original DAG actually represents the *second* measurement, where the covariate is a descendant of both the outcome and exposure.\nIf, however, we control for the same covariate as measured at the start of the study (@fig-quartets-time-ordered-2), it cannot be a descendant of the outcome at follow-up because it has yet to happen.\nThus, when you are missing background knowledge as to the causal structure of the covariate, you can use time-ordering as a defensive measure to avoid bias.\nOnly control for variables that precede the outcome.\n\n\n::: {#fig-quartets-time-ordered .cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![In a time-ordered version of the collider DAG, controlling for the covariate at follow-up induces bias.](06-not-just-a-stats-problem_files/figure-html/fig-quartets-time-ordered-1.png){#fig-quartets-time-ordered-1 width=384}\n:::\n\n::: {.cell-output-display}\n![Conversely, controlling for the covariate as measured at baseline does not induce bias because it is not a descendant of the outcome.](06-not-just-a-stats-problem_files/figure-html/fig-quartets-time-ordered-2.png){#fig-quartets-time-ordered-2 width=384}\n:::\n\nA time-ordered version of the collider DAG where each variable is measured twice. Controlling for `covariate` at follow-up is a collider, but controlling for `covariate` at baseline is not.\n:::\n\n\n::: callout-warning\n## Don't adjust for the future\n\nThe time-ordering heuristic relies on a simple rule: don't adjust for the future.\n:::\n\nThe quartet package's `causal_quartet_time` has time-ordered measurements of each variable for the four datasets.\nEach has a `*_baseline` and `*_follow-up` measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet_time\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 400 × 12\n covariate_baseline exposure_baseline\n \n 1 -0.0963 -1.43 \n 2 -1.11 0.0593 \n 3 0.647 0.370 \n 4 0.755 0.00471\n 5 1.19 0.340 \n 6 -0.588 -3.61 \n 7 -1.13 1.44 \n 8 0.689 1.02 \n 9 -1.49 -2.43 \n10 -2.78 -1.26 \n# ℹ 390 more rows\n# ℹ 10 more variables: outcome_baseline ,\n# covariate_followup , exposure_followup ,\n# outcome_followup , exposure_mid ,\n# covariate_mid , outcome_mid , u1 ,\n# u2 , dataset \n```\n\n\n:::\n:::\n\n\nUsing the formula `outcome_followup ~ exposure_baseline + covariate_baseline` works for three out of four datasets.\nEven though `covariate_baseline` is only in the adjustment set for the second dataset, it's not a collider in two of the other datasets, so it's not a problem.\n\n\n::: {#tbl-quartet_time_adjusted .cell tbl-cap='The adjusted effect of `exposure_baseline` on `outcome_followup` in each dataset. The effect adjusted for `covariate_baseline` is correct for three out of four datasets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetAdjusted effectTruth
(1) Collider1.001.00
(2) Confounder0.500.50
(3) Mediator1.001.00
(4) M-Bias0.881.00
\n
\n```\n\n:::\n:::\n\n\nWhere it fails is in dataset 4, the M-bias example.\nIn this case, `covariate_baseline` is still a collider because the collision occurs before both the exposure and outcome.\nAs we discussed in @sec-m-bias, however, if you are in doubt whether something is genuinely M-bias, it is better to adjust for it than not.\nConfounding bias tends to be worse, and meaningful M-bias is probably rare in real life.\nAs the actual causal structure deviates from perfect M-bias, the severity of the bias tends to decrease.\nSo, if it is clearly M-bias, don't adjust for the variable.\nIf it's not clear, adjust for it.\n\n::: callout-tip\nRemember as well that it is possible to block bias induced by adjusting for a collider in certain circumstances because collider bias is just another open path.\nIf we had `u1` and `u2`, we could control for `covariate` while blocking potential collider bias.\nIn other words, sometimes, when we open a path, we can close it again.\n\n\n::: {.cell}\n\n:::\n\n:::\n\n## Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\n### Prediction metrics\n\nPredictive measurements also fail to distinguish between the four datasets.\nIn @tbl-quartet_time_predictive, we show the difference in a couple of standard predictive metrics when we add `covariate` to the model.\nIn each dataset, `covariate` adds information to the model because it contains associational information about the outcome [^06-not-just-a-stats-problem-2].\nThe RMSE goes down, indicating a better fit, and the R^2^ goes up, showing more variance explained.\nThe coefficients for `covariate` represent the information about `outcome` it contains; they don't tell us from where in the causal structure that information originates.\nCorrelation isn't causation, and neither is prediction.\nIn the case of the collider data set, it's not even a helpful prediction tool because you wouldn't have `covariate` at the time of prediction, given that it happens after the exposure and outcome.\n\n[^06-not-just-a-stats-problem-2]: For M-bias, including `covariate` in the model is helpful to the extent that it has information about `u2`, one of the causes of the outcome.\n In this case, the data generating mechanism was such that `covariate` contains more information from `u1` than `u2`, so it doesn't add as much predictive value.\n Random noise represents most of what `u2` doesn't account for.\n\n\n::: {#tbl-quartet_time_predictive .cell tbl-cap='The difference in predictive metrics on `outcome` in each dataset with and without `covariate`. In each dataset, `covariate` adds information to the model, but this offers little guidance regarding the proper causal model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetRMSER2
(1) Collider−0.140.12
(2) Confounder−0.200.14
(3) Mediator−0.480.37
(4) M-Bias−0.010.01
\n
\n```\n\n:::\n:::\n\n\n### The Table Two Fallacy[^06-not-just-a-stats-problem-3]\n\n[^06-not-just-a-stats-problem-3]: If you recall, the Table Two Fallacy is named after the tendency in health research journals to have a complete set of model coefficients in the second table of an article.\n See @Westreich2013 for a detailed discussion of the Table Two Fallacy.\n\nRelatedly, model coefficients for variables *other* than those of the causes we're interested in can be difficult to interpret.\nIn a model with `outcome ~ exposure + covariate`, it's tempting to present the coefficient of `covariate` as well as `exposure`.\nThe problem, as discussed @sec-pred-or-explain, is that the causal structure for the effect of `covariate` on `outcome` may differ from that of `exposure` on `outcome`.\nLet's consider a variation of the quartet DAGs with other variables.\n\nFirst, let's start with the confounder DAG.\nIn @fig-quartet_confounder, we see that `covariate` is a confounder.\nIf this DAG represents the complete causal structure for `outcome`, the model `outcome ~ exposure + covariate` will give an unbiased estimate of the effect on `outcome` for `exposure`, assuming we've met other assumptions of the modeling process.\nThe adjustment set for `covariate`'s effect on `outcome` is empty, and `exposure` is not a collider, so controlling for it does not induce bias[^06-not-just-a-stats-problem-4].\nBut look again.\n`exposure` is a mediator for `covariate`'s effect on `outcome`; some of the total effect is mediated through `outcome`, while there is also a direct effect of `covariate` on `outcome`. **Both estimates are unbiased, but they are different *types* of estimates**. The effect of `exposure` on `outcome` is the *total effect* of that relationship, while the effect of `covariate` on `outcome` is the *direct effect*.\n\n[^06-not-just-a-stats-problem-4]: Additionally, OLS produces a *collapsable* effect.\n Other effects, like the odds and hazards ratios, are *non-collapsable*, meaning including unrelated variables in the model *can* change the effect estimate.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The DAG for dataset 2, where `covariate` is a confounder. If you look closely, you'll realize that, from the perspective of the effect of `covariate` on the `outcome`, `exposure` is a *mediator*.](06-not-just-a-stats-problem_files/figure-html/fig-quartet_confounder-1.png){#fig-quartet_confounder width=288}\n:::\n:::\n\n\nWhat if we add `q`, a mutual cause of `covariate` and `outcome`?\nIn @fig-quartet_confounder_q, the adjustment sets are still different.\nThe adjustment set for `outcome ~ exposure` is still the same: `{covariate}`.\nThe `outcome ~ covariate` adjustment set is `{q}`.\nIn other words, `q` is a confounder for `covariate`'s effect on `outcome`.\nThe model `outcome ~ exposure + covariate` will produce the correct effect for `exposure` but not for the direct effect of `covariate`.\nNow, we have a situation where `covariate` not only answers a different type of question than `exposure` but is also biased by the absence of `q`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A modification of the DAG for dataset 2, where `covariate` is a confounder. Now, the relationship between `covariate` and `outcome` is confounded by `q`, a variable not necessary to calculate the unbiased effect of `exposure` on `outcome`.](06-not-just-a-stats-problem_files/figure-html/fig-quartet_confounder_q-1.png){#fig-quartet_confounder_q width=336}\n:::\n:::\n\n\nSpecifying a single causal model is deeply challenging.\nHaving a single model answer multiple causal questions is exponentially more difficult.\nIf attempting to do so, apply the same scrutiny to both[^06-not-just-a-stats-problem-5] questions.\nIs it possible to have a single adjustment set that answers both questions?\nIf not, specify two models or forego one of the questions.\nIf so, you need to ensure that the estimates answer the correct question.\nWe'll also discuss *joint* causal effects in @sec-interaction.\n\n[^06-not-just-a-stats-problem-5]: Practitioners of *casual* inference will interpret *many* effects from a single model in this way, but we consider this an act of bravado.\n\nUnfortunately, algorithms for detecting adjustment sets for multiple exposures and effect types are not well-developed, so you may need to rely on your knowledge of the causal structure in determining the intersection of the adjustment sets.\n", "supporting": [ "06-not-just-a-stats-problem_files" ], diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-1.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-1.png index 8a50ada5..77039e56 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-1.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-1.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-2.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-2.png index ac9d7013..7746ebae 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-2.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-2.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-3.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-3.png index 70bba7e9..f779f8b2 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-3.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-3.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-4.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-4.png index 0a25aa28..d9bcd3ea 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-4.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet-dag-4.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet_confounder-1.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet_confounder-1.png index 71dadf5e..b7f8da1e 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet_confounder-1.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartet_confounder-1.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-1.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-1.png index 22945f31..4a9939ea 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-1.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-1.png differ diff --git a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-2.png b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-2.png index 9fd1cf8a..6078ca96 100644 Binary files a/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-2.png and b/_freeze/chapters/06-not-just-a-stats-problem/figure-html/fig-quartets-time-ordered-2.png differ diff --git a/_freeze/chapters/07-prep-data/execute-results/html.json b/_freeze/chapters/07-prep-data/execute-results/html.json index 72ad0d35..d5bbe6ea 100644 --- a/_freeze/chapters/07-prep-data/execute-results/html.json +++ b/_freeze/chapters/07-prep-data/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "e3c60b0eddf12af85ad025f1e4a019b1", + "hash": "976d0b109b0ac3177487eaca8cad63ba", "result": { - "markdown": "# Preparing data to answer causal questions {#sec-data-causal}\n\n\n\n\n\n## Introduction to the data {#sec-data}\n\nThroughout this book we will be using data obtained from [Touring Plans](https://touringplans.com).\nTouring Plans is a company that helps folks plan their trips to Disney and Universal theme parks.\nOne of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling.\nThe `{touringplans}` R package includes several datasets containing information about Disney theme park attractions.\nA summary of the attractions included in the package can be found by running the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(touringplans)\nattractions_metadata\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 14 × 8\n dataset_name name short_name park land opened_on \n \n 1 alien_sauce… Alie… Alien Sau… Disn… Toy … 2018-06-30\n 2 dinosaur DINO… DINOSAUR Disn… Dino… 1998-04-22\n 3 expedition_… Expe… Expeditio… Disn… Asia 2006-04-07\n 4 flight_of_p… Avat… Flight of… Disn… Pand… 2017-05-27\n 5 kilimanjaro… Kili… Kilimanja… Disn… Afri… 1998-04-22\n 6 navi_river Na'v… Na'vi Riv… Disn… Pand… 2017-05-27\n 7 pirates_of_… Pira… Pirates o… Magi… Adve… 1973-12-17\n 8 rock_n_roll… Rock… Rock Coas… Disn… Suns… 1999-07-29\n 9 seven_dwarf… Seve… 7 Dwarfs … Magi… Fant… 2014-05-28\n10 slinky_dog Slin… Slinky Dog Disn… Toy … 2018-06-30\n11 soarin Soar… Soarin' Epcot Worl… 2005-05-05\n12 spaceship_e… Spac… Spaceship… Epcot Worl… 1982-10-01\n13 splash_moun… Spla… Splash Mo… Magi… Fron… 1992-07-17\n14 toy_story_m… Toy … Toy Story… Disn… Toy … 2008-05-31\n# ℹ 2 more variables: duration ,\n# average_wait_per_hundred \n```\n\n\n:::\n:::\n\n\nAdditionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily.\nThis metadata includes information like the Walt Disney World ticket season on the particular day (was it high season -- think Christmas -- or low season -- think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as \"extra magic hours\" in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_raw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2,079 × 181\n date wdw_ticket_season dayofweek dayofyear\n \n 1 2015-01-01 5 0\n 2 2015-01-02 6 1\n 3 2015-01-03 7 2\n 4 2015-01-04 1 3\n 5 2015-01-05 2 4\n 6 2015-01-06 3 5\n 7 2015-01-07 4 6\n 8 2015-01-08 5 7\n 9 2015-01-09 6 8\n10 2015-01-10 7 9\n# ℹ 2,069 more rows\n# ℹ 177 more variables: weekofyear ,\n# monthofyear , year , season ,\n# holidaypx , holidaym , holidayn ,\n# holiday , wdwticketseason ,\n# wdwracen , wdweventn , wdwevent ,\n# wdwrace , wdwseason , …\n```\n\n\n:::\n:::\n\n\nSuppose the causal question of interest is:\n\n**Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?**\n\nLet's begin by diagramming this causal question (@fig-seven-diag).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagram of the causal question \"Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?\"](07-prep-data_files/figure-html/fig-seven-diag-1.png){#fig-seven-diag width=672}\n:::\n:::\n\n\nHistorically, guests who stayed in a Walt Disney World resort hotel could access the park during \"Extra Magic Hours,\" during which the park was closed to all other guests.\nThese extra hours could be in the morning or evening.\nThe Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom.\nMagic Kingdom may or may not be selected each day to have these \"Extra Magic Hours.\" We are interested in examining the relationship between whether there were \"Extra Magic Hours\" in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am.\nBelow is a proposed DAG for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](07-prep-data_files/figure-html/fig-dag-magic-1.png){#fig-dag-magic width=672}\n:::\n:::\n\n\nSince we are not in charge of Walt Disney World's operations, we cannot randomize dates to have (or not) \"Extra Magic Hours\", therefore, we need to rely on previously collected observational data and do our best to emulate the *target trial* that we would have created, should it have been possible.\nHere, our observations are *days*.\nLooking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:\n\n- **Eligibility criteria**: Days must be from 2018\n- **Exposure definition**: Magic kingdom had \"Extra Magic Hours\" in the morning\n- **Assignment procedures**: Observed -- if the historic data suggests there were \"Extra Magic Hours\" in the morning on a particular day, that day is classified as \"exposed\" otherwise it is \"unexposed\"\n- **Follow-up period**: From park open to 10am.\n- **Outcome definition**: The average posted wait time between 9am and 10am\n- **Causal contrast of interest**: Average treatment effect (we will discuss this in @sec-estimands)\n- **Analysis plan**: We use inverse probability waiting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (@fig-dag-magic)\n\n## Data wrangling and recipes\n\nMost of our data manipulation tools come from the `{dplyr}` package (@tbl-dplyr).\nWe will also use `{lubridate}` to help us manipulate dates.\n\n| Target trial protocol element | {dplyr} functions |\n|-------------------------------|---------------------------------------------|\n| Eligibility criteria | `filter()` |\n| Exposure definition | `mutate()` |\n| Assignment procedures | `mutate()` |\n| Follow-up period | `mutate()` `pivot_longer()` `pivot_wider()` |\n| Outcome definition | `mutate()` |\n| Analysis plan | `select()` `mutate()` |\n\n: Mapping target trial protocol elements to commonly used `{dplyr}` functions {#tbl-dplyr}\n\nTo answer this question, we are going to need to manipulate both the `seven_dwarfs_train` dataset as well as the `parks_metadata_raw` dataset.\nLet's start with the `seven_dwarfs_train` data set.\nThe Seven Dwarfs Mine Train ride is an attraction at Walt Disney World's Magic Kingdom.\nThe `seven_dwarfs_train` dataset in the {touringplans} package contains information about the date a particular wait time was recorded (`park_date`), the time of the wait time (`wait_datetime`), the actual wait time (`wait_minutes_actual`), and the posted wait time (`wait_minutes_posted`).\nLet's take a look at this dataset.\nThe {skimr} package is great for getting a quick glimpse at a new dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(skimr)\nskim(seven_dwarfs_train)\n```\n\n::: {.cell-output-display}\n\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |seven_dwarfs_train |\n|Number of rows |321631 |\n|Number of columns |4 |\n|_______________________ | |\n|Column type frequency: | |\n|Date |1 |\n|numeric |2 |\n|POSIXct |1 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|park_date | 0| 1|2015-01-01 |2021-12-28 |2018-04-07 | 2334|\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:-------------------|---------:|-------------:|-----:|-------:|------:|---:|---:|---:|----:|:-----|\n|wait_minutes_actual | 313996| 0.02| 23.99| 1064.06| -92918| 21| 31| 46| 217|▁▁▁▁▇ |\n|wait_minutes_posted | 30697| 0.90| 76.96| 33.99| 0| 50| 70| 95| 300|▆▇▁▁▁ |\n\n\n**Variable type: POSIXct**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:-------------------|:-------------------|:-------------------|--------:|\n|wait_datetime | 0| 1|2015-01-01 07:51:12 |2021-12-28 22:57:34 |2018-04-07 23:14:06 | 321586|\n\n\n:::\n:::\n\n\nExamining the output above, we learn that this dataset contains four columns and 321,631 rows.\nWe also learn that the dates span from 2015 to 2021.\nWe can also examine the distribution of each of the variables to detect any potential anomalies.\nNotice anything strange?\nLook at the `p0` (that is the minimum value) for `wait_minutes_actual`.\nIt is `-92918`!\nWe are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.\n\nWe need this dataset to calculate our *outcome*.\nRecall from above that our outcome is defined as the average posted wait time between 9am and 10am.\nAdditionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(lubridate)\nseven_dwarfs_train_2018 <- seven_dwarfs_train |>\n filter(year(park_date) == 2018) |> # eligibility criteria\n mutate(hour = hour(wait_datetime)) |> # get hour from wait\n group_by(park_date, hour) |> # group by date\n summarise(\n wait_minutes_posted_avg = mean(wait_minutes_posted, na.rm = TRUE),\n .groups = \"drop\"\n ) |> # get average wait time\n mutate(\n wait_minutes_posted_avg =\n case_when(\n is.nan(wait_minutes_posted_avg) ~ NA,\n TRUE ~ wait_minutes_posted_avg\n )\n ) |> # if it is NAN make it NA\n filter(hour == 9) # only keep the average wait time between 9 and 10\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 362 × 3\n park_date hour wait_minutes_posted_avg\n \n 1 2018-01-01 9 60 \n 2 2018-01-02 9 60 \n 3 2018-01-03 9 60 \n 4 2018-01-04 9 68.9\n 5 2018-01-05 9 70.6\n 6 2018-01-06 9 33.3\n 7 2018-01-07 9 46.4\n 8 2018-01-08 9 69.5\n 9 2018-01-09 9 64.3\n10 2018-01-10 9 74.3\n# ℹ 352 more rows\n```\n\n\n:::\n:::\n\n\nNow that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for.\nExamining @fig-dag-magic, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature.\nThese are in the `parks_metadata_raw` dataset.\nThis data will require extra cleaning, since the names are in the original format.\n\n::: callout-tip\nWe like to have our variable names follow a clean convention -- one way to do this is to follow Emily Riederer's \"Column Names as Contracts\" format [@Riederer_2020].\nThe basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables.\nFor example, in these data, variables that are specific to a particular wait time are prepended with the term `wait` (e.g. `wait_datetime` and `wait_minutes_actual`), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term `park` (e.g. `park_date` or `park_temperature_high`).\n:::\n\nLet's first decide what variables we will need.\nIn practice, this decision may involve an iterative process.\nFor example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned.\nLet's start by skimming this dataframe.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskim(parks_metadata_raw)\n```\n\n::: {.cell-output-display}\n\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |parks_metadata_raw |\n|Number of rows |2079 |\n|Number of columns |181 |\n|_______________________ | |\n|Column type frequency: | |\n|character |42 |\n|Date |1 |\n|difftime |46 |\n|logical |6 |\n|numeric |86 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: character**\n\n|skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace|\n|:---------------------|---------:|-------------:|---:|---:|-----:|--------:|----------:|\n|wdw_ticket_season | 861| 0.59| 4| 7| 0| 3| 0|\n|season | 253| 0.88| 4| 29| 0| 17| 0|\n|holidayn | 1865| 0.10| 3| 7| 0| 43| 0|\n|wdwticketseason | 761| 0.63| 4| 7| 0| 3| 0|\n|wdwracen | 1992| 0.04| 4| 6| 0| 5| 0|\n|wdweventn | 1832| 0.12| 3| 12| 0| 8| 0|\n|wdwseason | 87| 0.96| 4| 29| 0| 17| 0|\n|mkeventn | 1546| 0.26| 3| 11| 0| 10| 0|\n|epeventn | 868| 0.58| 4| 5| 0| 4| 0|\n|hseventn | 1877| 0.10| 4| 7| 0| 5| 0|\n|akeventn | 2010| 0.03| 4| 4| 0| 2| 0|\n|holidayj | 2037| 0.02| 5| 15| 0| 8| 0|\n|insession | 105| 0.95| 2| 3| 0| 95| 0|\n|insession_enrollment | 105| 0.95| 2| 4| 0| 100| 0|\n|insession_wdw | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_dlr | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_sqrt_wdw | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_sqrt_dlr | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_california | 105| 0.95| 2| 4| 0| 89| 0|\n|insession_dc | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_central_fl | 105| 0.95| 2| 4| 0| 71| 0|\n|insession_drive1_fl | 105| 0.95| 2| 4| 0| 85| 0|\n|insession_drive2_fl | 105| 0.95| 2| 4| 0| 95| 0|\n|insession_drive_ca | 105| 0.95| 2| 4| 0| 91| 0|\n|insession_florida | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_mardi_gras | 105| 0.95| 2| 4| 0| 82| 0|\n|insession_midwest | 105| 0.95| 2| 4| 0| 75| 0|\n|insession_ny_nj | 105| 0.95| 2| 4| 0| 8| 0|\n|insession_ny_nj_pa | 105| 0.95| 2| 4| 0| 19| 0|\n|insession_new_england | 105| 0.95| 2| 4| 0| 45| 0|\n|insession_new_jersey | 105| 0.95| 2| 4| 0| 2| 0|\n|insession_nothwest | 105| 0.95| 2| 4| 0| 17| 0|\n|insession_planes | 105| 0.95| 2| 4| 0| 81| 0|\n|insession_socal | 105| 0.95| 2| 4| 0| 80| 0|\n|insession_southwest | 105| 0.95| 2| 4| 0| 86| 0|\n|mkprddn | 183| 0.91| 33| 41| 0| 2| 0|\n|mkprdnn | 1358| 0.35| 29| 38| 0| 2| 0|\n|mkfiren | 134| 0.94| 18| 65| 0| 8| 0|\n|epfiren | 126| 0.94| 13| 35| 0| 2| 0|\n|hsfiren | 485| 0.77| 24| 66| 0| 6| 0|\n|hsshwnn | 164| 0.92| 10| 28| 0| 2| 0|\n|akshwnn | 883| 0.58| 15| 33| 0| 2| 0|\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|date | 0| 1|2015-01-01 |2021-08-31 |2017-11-05 | 2079|\n\n\n**Variable type: difftime**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:-----------|:--------|--------:|\n|mkopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclose | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkemhopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 5|\n|mkemhclose | 0| 1.00|54000 secs |107940 secs |23:00:00 | 14|\n|mkopenyest | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkcloseyest | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkopentom | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclosetom | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|epopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epemhopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epemhclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 12|\n|epopenyest | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epcloseyest | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epopentom | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclosetom | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|hsopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclose | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsemhopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 7|\n|hsemhclose | 0| 1.00|50400 secs |93600 secs |21:00:00 | 18|\n|hsopenyest | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hscloseyest | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsopentom | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclosetom | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|akopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclose | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akemhopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akemhclose | 0| 1.00|50400 secs |90000 secs |20:00:00 | 17|\n|akopenyest | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akcloseyest | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akopentom | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclosetom | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|mkprddt1 | 183| 0.91|39600 secs |61200 secs |15:00:00 | 5|\n|mkprddt2 | 1851| 0.11|50400 secs |73800 secs |15:30:00 | 5|\n|mkprdnt1 | 1358| 0.35|68400 secs |82800 secs |21:00:00 | 11|\n|mkprdnt2 | 1480| 0.29|0 secs |84600 secs |23:00:00 | 8|\n|mkfiret1 | 134| 0.94|66600 secs |80100 secs |21:15:00 | 12|\n|mkfiret2 | 2069| 0.00|85800 secs |85800 secs |23:50:00 | 1|\n|epfiret1 | 126| 0.94|64800 secs |81000 secs |21:00:00 | 6|\n|epfiret2 | 2074| 0.00|85200 secs |85200 secs |23:40:00 | 1|\n|hsfiret1 | 485| 0.77|0 secs |82800 secs |21:00:00 | 17|\n|hsfiret2 | 2045| 0.02|0 secs |81000 secs |21:00:00 | 5|\n|hsshwnt1 | 164| 0.92|65100 secs |79200 secs |20:30:00 | 10|\n|hsshwnt2 | 1369| 0.34|72000 secs |82800 secs |21:30:00 | 11|\n|akshwnt1 | 883| 0.58|65700 secs |76500 secs |20:30:00 | 13|\n|akshwnt2 | 1149| 0.45|70200 secs |81000 secs |21:45:00 | 13|\n\n\n**Variable type: logical**\n\n|skim_variable | n_missing| complete_rate| mean|count |\n|:-------------|---------:|-------------:|----:|:-----|\n|hsprddt1 | 2079| 0| NaN|: |\n|hsprddn | 2079| 0| NaN|: |\n|akprddt1 | 2079| 0| NaN|: |\n|akprddt2 | 2079| 0| NaN|: |\n|akprddn | 2079| 0| NaN|: |\n|akfiren | 2079| 0| NaN|: |\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:------------------|---------:|-------------:|-----------:|----------:|-----------:|-----------:|-----------:|-----------:|-----------:|:-----|\n|dayofweek | 0| 1| 4.00| 2.00| 1.00| 2.00| 4.00| 6.00| 7.00|▇▃▃▃▇ |\n|dayofyear | 0| 1| 181.84| 106.34| 0.00| 89.00| 184.00| 273.00| 365.00|▇▇▇▇▇ |\n|weekofyear | 0| 1| 26.09| 15.20| 0.00| 13.00| 26.00| 39.00| 53.00|▇▇▇▇▇ |\n|monthofyear | 0| 1| 6.51| 3.48| 1.00| 3.00| 7.00| 10.00| 12.00|▇▅▆▅▇ |\n|year | 0| 1| 2017.41| 1.74| 2015.00| 2016.00| 2017.00| 2019.00| 2021.00|▇▃▃▃▃ |\n|holidaypx | 0| 1| 7.85| 6.89| 0.00| 3.00| 6.00| 10.00| 33.00|▇▅▂▁▁ |\n|holidaym | 0| 1| 0.54| 1.35| 0.00| 0.00| 0.00| 0.00| 5.00|▇▁▁▁▁ |\n|holiday | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwevent | 0| 1| 0.12| 0.32| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwrace | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwmaxtemp | 5| 1| 82.80| 8.53| 51.11| 78.29| 84.50| 89.54| 97.72|▁▁▃▇▆ |\n|wdwmintemp | 6| 1| 65.50| 10.18| 27.48| 59.03| 68.35| 74.10| 81.28|▁▂▃▆▇ |\n|wdwmeantemp | 6| 1| 74.15| 9.06| 39.75| 68.76| 76.37| 81.61| 87.76|▁▂▃▆▇ |\n|mkevent | 0| 1| 0.26| 0.44| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|epevent | 0| 1| 0.58| 0.49| 0.00| 0.00| 1.00| 1.00| 1.00|▆▁▁▁▇ |\n|hsevent | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akevent | 0| 1| 0.03| 0.18| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhmorn | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmyest | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmtom | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemheve | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkhoursemh | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhyest | 0| 1| 13.65| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhtom | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkemheyest | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhetom | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmorn | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmtom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheve | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhetom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|ephoursemh | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhyest | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhtom | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|hsemhmorn | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmyest | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmtom | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemheve | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemheyest | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemhetom | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hshoursemh | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhyest | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhtom | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|akemhmorn | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmyest | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmtom | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemheve | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemheyest | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemhetom | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akhoursemh | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhyest | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhtom | 0| 1| 11.76| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|mkhours | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhoursyest | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhourstom | 0| 1| 13.26| 2.00| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|ephours | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephoursyest | 0| 1| 12.03| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephourstom | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|hshours | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshoursyest | 0| 1| 11.93| 1.20| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshourstom | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|akhours | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhoursyest | 0| 1| 11.47| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhourstom | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|weather_wdwhigh | 0| 1| 82.35| 7.86| 70.20| 74.60| 82.80| 90.60| 92.30|▅▃▂▂▇ |\n|weather_wdwlow | 0| 1| 64.10| 9.26| 49.20| 55.80| 63.60| 74.00| 76.10|▅▅▃▂▇ |\n|weather_wdwprecip | 0| 1| 0.15| 0.08| 0.03| 0.08| 0.12| 0.23| 0.35|▇▆▃▅▁ |\n|capacitylost_mk | 0| 1| 422110.61| 36458.81| 352865.00| 385812.00| 433857.00| 456055.00| 473553.00|▅▂▁▇▆ |\n|capacitylost_ep | 0| 1| 366897.04| 24019.96| 325168.00| 338367.00| 380763.00| 381963.00| 394662.00|▅▁▁▂▇ |\n|capacitylost_hs | 0| 1| 287485.76| 33198.89| 203780.00| 279573.00| 301871.00| 311870.00| 321869.00|▂▁▁▁▇ |\n|capacitylost_ak | 0| 1| 228193.83| 14967.82| 210779.00| 220778.00| 223178.00| 232777.00| 273873.00|▇▅▁▁▂ |\n|capacitylostwgt_mk | 0| 1| 41374025.71| 3621097.96| 34661635.00| 37641738.00| 42585643.00| 44577245.00| 46545047.00|▅▂▂▇▆ |\n|capacitylostwgt_ep | 0| 1| 35344939.61| 2201138.89| 31692832.00| 32692333.00| 36666337.00| 36668737.00| 37678138.00|▅▁▁▁▇ |\n|capacitylostwgt_hs | 0| 1| 27528647.53| 3049291.37| 19812520.00| 26772627.00| 28771129.00| 29761030.00| 30750931.00|▂▁▁▁▇ |\n|capacitylostwgt_ak | 0| 1| 22386447.82| 1398263.45| 20790321.00| 21780222.00| 21799422.00| 22780123.00| 26739827.00|▇▅▁▁▂ |\n|mkprdday | 0| 1| 1.07| 0.59| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|mkprdngt | 0| 1| 0.64| 0.90| 0.00| 0.00| 0.00| 2.00| 3.00|▇▁▁▃▁ |\n|mkfirewk | 0| 1| 0.94| 0.26| 0.00| 1.00| 1.00| 1.00| 2.00|▁▁▇▁▁ |\n|epfirewk | 0| 1| 0.94| 0.25| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|hsprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|hsfirewk | 0| 1| 0.78| 0.45| 0.00| 1.00| 1.00| 1.00| 2.00|▂▁▇▁▁ |\n|hsshwngt | 0| 1| 1.28| 0.62| 0.00| 1.00| 1.00| 2.00| 3.00|▁▇▁▅▁ |\n|hsfirewks | 0| 1| 1.00| 0.00| 1.00| 1.00| 1.00| 1.00| 1.00|▁▁▇▁▁ |\n|akprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|akshwngt | 0| 1| 1.04| 0.97| 0.00| 0.00| 1.00| 2.00| 3.00|▇▂▁▇▁ |\n\n\n:::\n:::\n\n\nThis dataset contains many more variables than the one we worked with previously.\nFor this analysis, we are going to select `date` (the observation date), `wdw_ticket_season` (the ticket season for the observation), `wdwmaxtemp` (the maximum temperature), `mkclose` (the time Magic Kingdom closed), `mkemhmorn` (whether Magic Kingdom had an \"Extra Magic Hour\" in the morning).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_clean <- parks_metadata_raw |>\n ## based on our analysis plan, we will select the following variables\n select(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn) |>\n ## based on eligibility criteria, limit to 2018\n filter(year(date) == 2018) |>\n ## rename variables\n rename(\n park_date = date,\n park_ticket_season = wdw_ticket_season,\n park_temperature_high = wdwmaxtemp,\n park_close = mkclose,\n park_extra_magic_morning = mkemhmorn\n )\n```\n:::\n\n\n## Working with multiple data sources\n\nFrequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for.\nThe way we can combine datasets is via *joins* -- joining two or more datasets based on a set or sets of common variables.\nWe can think of three main types of *joins*: left, right, and inner.\nA *left* join combines data from two datasets based on a common variable and includes all records from the *left* dataset along with matching records from the *right* dataset (in `{dplyr}`, `left_join()`), while a *right* join includes all records from the *right* dataset and their corresponding matches from the *left* dataset (in `{dplyr}` `right_join()`); an inner join, on the other hand, includes only the records with matching values in *both* datasets, excluding non-matching records (in `{dplyr}` `inner_join()`.\nFor this analysis, we need to use a left join to pull in the cleaned parks metadata.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n left_join(parks_metadata_clean, by = \"park_date\")\n```\n:::\n\n\n## Recognizing missing data\n\nIt is important to recognize whether we have any missing data in our variables.\nThe `{visdat}` package is great for getting a quick sense of whether we have any missing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(visdat)\nvis_miss(seven_dwarfs_train_2018)\n```\n\n::: {.cell-output-display}\n![](07-prep-data_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nIt looks like we only have a few observations (2%) missing our outcome of interest.\nThis is not too bad.\nFor this first analysis we will ignore the missing values.\nWe can explicitly drop them using the `drop_na()` function from `{dplyr}`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n drop_na()\n```\n:::\n\n\n## Exploring and visualizing data and assumptions\n\nThe *positivity* assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (@sec-assump).\nWe can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.\n\n### Single variable checks for positivity violations\n\n@fig-close shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning.\nThere is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n seven_dwarfs_train_2018,\n aes(\n x = factor(park_close),\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n )\n) +\n geom_bar(position = position_dodge2(width = 0.9, preserve = \"single\")) +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Time of Park Close\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning](07-prep-data_files/figure-html/fig-close-1.png){#fig-close width=672}\n:::\n:::\n\n\nTo examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram.\nWe'll use the {halfmoon} package's `geom_mirror_histogram()` to create one.\nExamining @fig-temp, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space.\nIf we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days.\nThis of course would also restrict which days we could draw conclusions about for the future.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(\n seven_dwarfs_train_2018,\n aes(\n x = park_temperature_high,\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n )\n) +\n geom_mirror_histogram(bins = 20) +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Historic maximum temperature (degrees F)\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](07-prep-data_files/figure-html/fig-temp-1.png){#fig-temp width=672}\n:::\n:::\n\n\nFinally, let's look at the distribution of ticket season by whether there were extra magic hours in the morning.\nExamining @fig-ticket, we do not see any positivity violations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n seven_dwarfs_train_2018,\n aes(\n x = park_ticket_season,\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n )\n) +\n geom_bar(position = \"dodge\") +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Magic Kingdom Ticket Season\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](07-prep-data_files/figure-html/fig-ticket-1.png){#fig-ticket width=672}\n:::\n:::\n\n\n### Multiple variable checks for positivity violations\n\nWe have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations.\nBecause we have so few variables here, we can examine this a bit more closely.\nLet's start by discretizing the `park_temperature_high` variable a bit (we will cut it into tertiles).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 |>\n ## cut park_temperature_high into tertiles\n mutate(park_temperature_high_bin = cut(park_temperature_high, breaks = 3)) |>\n ## bin park close time\n mutate(park_close_bin = case_when(\n hour(park_close) < 19 & hour(park_close) > 12 ~ \"(1) early\",\n hour(park_close) >= 19 & hour(park_close) < 24 ~ \"(2) standard\",\n hour(park_close) >= 24 | hour(park_close) < 12 ~ \"(3) late\"\n )) |>\n group_by(park_close_bin, park_temperature_high_bin, park_ticket_season) |>\n ## calculate the proportion exposed in each bin\n summarise(prop_exposed = mean(park_extra_magic_morning), .groups = \"drop\") |>\n ggplot(aes(x = park_close_bin, y = park_temperature_high_bin, fill = prop_exposed)) +\n geom_tile() +\n scale_fill_gradient2(midpoint = 0.5) +\n facet_wrap(~park_ticket_season) +\n labs(\n y = \"Historic Maximum Temperature (F)\",\n x = \"Magic Kingdom Park Close Time\",\n fill = \"Proportion of Days Exposed\"\n )\n```\n\n::: {.cell-output-display}\n![Check for positivity violations across three confounders: historic high temperature, park close time, and ticket season.](07-prep-data_files/figure-html/fig-positivity-1.png){#fig-positivity width=864}\n:::\n:::\n\n\nInteresting!\n@fig-positivity shows an interesting potential violation.\nIt looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning.\nThis actually makes sense if we think a bit about this data set.\nThe only days with cold temperatures in Florida that would also be considered a \"peak\" time to visit Walt Disney World would be over Christmas / New Years.\nDuring this time there historically were always extra magic hours.\n\nWe are going to proceed with the analysis, but we will keep these observations in mind.\n\n## Presenting descriptive statistics\n\nLet's examine a table of the variables of interest in this data frame.\nTo do so, we are going to use the `tbl_summary()` function from the `{gtsummary}` package.\n(We'll also use the `{labelled}` package to clean up the variable names for the table.)\n\n\n::: {#tbl-unweighted-gtsummary .cell tbl-cap='A descriptive table of Extra Magic Morning in the touringplans dataset. This table shows the distributions of these variables in the observed population.'}\n\n```{.r .cell-code}\nlibrary(gtsummary)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#BlackLivesMatter\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(labelled)\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n set_variable_labels(\n park_ticket_season = \"Ticket Season\",\n park_close = \"Close Time\",\n park_temperature_high = \"Historic High Temperature\"\n )\n\ntbl_summary(\n seven_dwarfs_train_2018,\n by = park_extra_magic_morning,\n include = c(park_ticket_season, park_close, park_temperature_high)\n) |>\n # add an overall column to the table\n add_overall(last = TRUE)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n \n \n \n \n
Characteristic0, N = 29411, N = 601Overall, N = 3541
Ticket Season


    peak60 (20%)18 (30%)78 (22%)
    regular158 (54%)35 (58%)193 (55%)
    value76 (26%)7 (12%)83 (23%)
Close Time


    16:30:001 (0.3%)0 (0%)1 (0.3%)
    18:00:0037 (13%)18 (30%)55 (16%)
    20:00:0018 (6.1%)2 (3.3%)20 (5.6%)
    21:00:0028 (9.5%)0 (0%)28 (7.9%)
    22:00:0091 (31%)11 (18%)102 (29%)
    23:00:0078 (27%)11 (18%)89 (25%)
    24:00:0040 (14%)17 (28%)57 (16%)
    25:00:001 (0.3%)1 (1.7%)2 (0.6%)
Historic High Temperature84 (78, 89)83 (76, 87)84 (78, 89)
1 n (%); Median (IQR)
\n
\n```\n\n:::\n:::\n", + "markdown": "# Preparing data to answer causal questions {#sec-data-causal}\n\n\n\n\n\n## Introduction to the data {#sec-data}\n\nThroughout this book we will be using data obtained from [Touring Plans](https://touringplans.com).\nTouring Plans is a company that helps folks plan their trips to Disney and Universal theme parks.\nOne of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling.\nThe `{touringplans}` R package includes several datasets containing information about Disney theme park attractions.\nA summary of the attractions included in the package can be found by running the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(touringplans)\nattractions_metadata\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 14 × 8\n dataset_name name short_name park land opened_on \n \n 1 alien_sauce… Alie… Alien Sau… Disn… Toy … 2018-06-30\n 2 dinosaur DINO… DINOSAUR Disn… Dino… 1998-04-22\n 3 expedition_… Expe… Expeditio… Disn… Asia 2006-04-07\n 4 flight_of_p… Avat… Flight of… Disn… Pand… 2017-05-27\n 5 kilimanjaro… Kili… Kilimanja… Disn… Afri… 1998-04-22\n 6 navi_river Na'v… Na'vi Riv… Disn… Pand… 2017-05-27\n 7 pirates_of_… Pira… Pirates o… Magi… Adve… 1973-12-17\n 8 rock_n_roll… Rock… Rock Coas… Disn… Suns… 1999-07-29\n 9 seven_dwarf… Seve… 7 Dwarfs … Magi… Fant… 2014-05-28\n10 slinky_dog Slin… Slinky Dog Disn… Toy … 2018-06-30\n11 soarin Soar… Soarin' Epcot Worl… 2005-05-05\n12 spaceship_e… Spac… Spaceship… Epcot Worl… 1982-10-01\n13 splash_moun… Spla… Splash Mo… Magi… Fron… 1992-07-17\n14 toy_story_m… Toy … Toy Story… Disn… Toy … 2008-05-31\n# ℹ 2 more variables: duration ,\n# average_wait_per_hundred \n```\n\n\n:::\n:::\n\n\nAdditionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily.\nThis metadata includes information like the Walt Disney World ticket season on the particular day (was it high season -- think Christmas -- or low season -- think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as \"extra magic hours\" in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_raw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2,079 × 181\n date wdw_ticket_season dayofweek dayofyear\n \n 1 2015-01-01 5 0\n 2 2015-01-02 6 1\n 3 2015-01-03 7 2\n 4 2015-01-04 1 3\n 5 2015-01-05 2 4\n 6 2015-01-06 3 5\n 7 2015-01-07 4 6\n 8 2015-01-08 5 7\n 9 2015-01-09 6 8\n10 2015-01-10 7 9\n# ℹ 2,069 more rows\n# ℹ 177 more variables: weekofyear ,\n# monthofyear , year , season ,\n# holidaypx , holidaym , holidayn ,\n# holiday , wdwticketseason ,\n# wdwracen , wdweventn , wdwevent ,\n# wdwrace , wdwseason , …\n```\n\n\n:::\n:::\n\n\nSuppose the causal question of interest is:\n\n**Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?**\n\nLet's begin by diagramming this causal question (@fig-seven-diag).\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 0.5, xend = 2.5, y = 0.95, yend = 0.95)): All aesthetics have length 1, but the data has 6 rows.\nℹ Please consider using `annotate()` or provide this\n layer with data containing a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 1.5, xend = 1.5, y = 0.95, yend = 1.1)): All aesthetics have length 1, but the data has 6 rows.\nℹ Please consider using `annotate()` or provide this\n layer with data containing a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 0.5, xend = 1, y = 0.95, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.\nℹ Please consider using `annotate()` or provide this\n layer with data containing a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 1, xend = 1.5, y = 0.65, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.\nℹ Please consider using `annotate()` or provide this\n layer with data containing a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 1.55, xend = 2.05, y = 0.95, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.\nℹ Please consider using `annotate()` or provide this\n layer with data containing a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 2.05, xend = 2.55, y = 0.65, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.\nℹ Please consider using `annotate()` or provide this\n layer with data containing a single row.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Diagram of the causal question \"Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?\"](07-prep-data_files/figure-html/fig-seven-diag-1.png){#fig-seven-diag width=672}\n:::\n:::\n\n\nHistorically, guests who stayed in a Walt Disney World resort hotel could access the park during \"Extra Magic Hours,\" during which the park was closed to all other guests.\nThese extra hours could be in the morning or evening.\nThe Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom.\nMagic Kingdom may or may not be selected each day to have these \"Extra Magic Hours.\" We are interested in examining the relationship between whether there were \"Extra Magic Hours\" in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am.\nBelow is a proposed DAG for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](07-prep-data_files/figure-html/fig-dag-magic-1.png){#fig-dag-magic width=672}\n:::\n:::\n\n\nSince we are not in charge of Walt Disney World's operations, we cannot randomize dates to have (or not) \"Extra Magic Hours\", therefore, we need to rely on previously collected observational data and do our best to emulate the *target trial* that we would have created, should it have been possible.\nHere, our observations are *days*.\nLooking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:\n\n- **Eligibility criteria**: Days must be from 2018\n- **Exposure definition**: Magic kingdom had \"Extra Magic Hours\" in the morning\n- **Assignment procedures**: Observed -- if the historic data suggests there were \"Extra Magic Hours\" in the morning on a particular day, that day is classified as \"exposed\" otherwise it is \"unexposed\"\n- **Follow-up period**: From park open to 10am.\n- **Outcome definition**: The average posted wait time between 9am and 10am\n- **Causal contrast of interest**: Average treatment effect (we will discuss this in @sec-estimands)\n- **Analysis plan**: We use inverse probability weighting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (@fig-dag-magic)\n\n## Data wrangling and recipes\n\nMost of our data manipulation tools come from the `{dplyr}` package (@tbl-dplyr).\nWe will also use `{lubridate}` to help us manipulate dates.\n\n| Target trial protocol element | {dplyr} functions |\n|-------------------------------|---------------------------------------------|\n| Eligibility criteria | `filter()` |\n| Exposure definition | `mutate()` |\n| Assignment procedures | `mutate()` |\n| Follow-up period | `mutate()` `pivot_longer()` `pivot_wider()` |\n| Outcome definition | `mutate()` |\n| Analysis plan | `select()` `mutate()` |\n\n: Mapping target trial protocol elements to commonly used `{dplyr}` functions {#tbl-dplyr}\n\nTo answer this question, we are going to need to manipulate both the `seven_dwarfs_train` dataset as well as the `parks_metadata_raw` dataset.\nLet's start with the `seven_dwarfs_train` data set.\nThe Seven Dwarfs Mine Train ride is an attraction at Walt Disney World's Magic Kingdom.\nThe `seven_dwarfs_train` dataset in the {touringplans} package contains information about the date a particular wait time was recorded (`park_date`), the time of the wait time (`wait_datetime`), the actual wait time (`wait_minutes_actual`), and the posted wait time (`wait_minutes_posted`).\nLet's take a look at this dataset.\nThe {skimr} package is great for getting a quick glimpse at a new dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(skimr)\nskim(seven_dwarfs_train)\n```\n\n::: {.cell-output-display}\n\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |seven_dwarfs_train |\n|Number of rows |321631 |\n|Number of columns |4 |\n|_______________________ | |\n|Column type frequency: | |\n|Date |1 |\n|numeric |2 |\n|POSIXct |1 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|park_date | 0| 1|2015-01-01 |2021-12-28 |2018-04-07 | 2334|\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:-------------------|---------:|-------------:|-----:|-------:|------:|---:|---:|---:|----:|:-----|\n|wait_minutes_actual | 313996| 0.02| 23.99| 1064.06| -92918| 21| 31| 46| 217|▁▁▁▁▇ |\n|wait_minutes_posted | 30697| 0.90| 76.96| 33.99| 0| 50| 70| 95| 300|▆▇▁▁▁ |\n\n\n**Variable type: POSIXct**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:-------------------|:-------------------|:-------------------|--------:|\n|wait_datetime | 0| 1|2015-01-01 07:51:12 |2021-12-28 22:57:34 |2018-04-07 23:14:06 | 321586|\n\n\n:::\n:::\n\n\nExamining the output above, we learn that this dataset contains four columns and 321,631 rows.\nWe also learn that the dates span from 2015 to 2021.\nWe can also examine the distribution of each of the variables to detect any potential anomalies.\nNotice anything strange?\nLook at the `p0` (that is the minimum value) for `wait_minutes_actual`.\nIt is `-92918`!\nWe are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.\n\nWe need this dataset to calculate our *outcome*.\nRecall from above that our outcome is defined as the average posted wait time between 9am and 10am.\nAdditionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(lubridate)\nseven_dwarfs_train_2018 <- seven_dwarfs_train |>\n filter(year(park_date) == 2018) |> # eligibility criteria\n mutate(hour = hour(wait_datetime)) |> # get hour from wait\n group_by(park_date, hour) |> # group by date\n summarise(\n wait_minutes_posted_avg = mean(wait_minutes_posted, na.rm = TRUE),\n .groups = \"drop\"\n ) |> # get average wait time\n mutate(\n wait_minutes_posted_avg =\n case_when(\n is.nan(wait_minutes_posted_avg) ~ NA,\n TRUE ~ wait_minutes_posted_avg\n )\n ) |> # if it is NAN make it NA\n filter(hour == 9) # only keep the average wait time between 9 and 10\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 362 × 3\n park_date hour wait_minutes_posted_avg\n \n 1 2018-01-01 9 60 \n 2 2018-01-02 9 60 \n 3 2018-01-03 9 60 \n 4 2018-01-04 9 68.9\n 5 2018-01-05 9 70.6\n 6 2018-01-06 9 33.3\n 7 2018-01-07 9 46.4\n 8 2018-01-08 9 69.5\n 9 2018-01-09 9 64.3\n10 2018-01-10 9 74.3\n# ℹ 352 more rows\n```\n\n\n:::\n:::\n\n\nNow that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for.\nExamining @fig-dag-magic, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature.\nThese are in the `parks_metadata_raw` dataset.\nThis data will require extra cleaning, since the names are in the original format.\n\n::: callout-tip\nWe like to have our variable names follow a clean convention -- one way to do this is to follow Emily Riederer's \"Column Names as Contracts\" format [@Riederer_2020].\nThe basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables.\nFor example, in these data, variables that are specific to a particular wait time are prepended with the term `wait` (e.g. `wait_datetime` and `wait_minutes_actual`), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term `park` (e.g. `park_date` or `park_temperature_high`).\n:::\n\nLet's first decide what variables we will need.\nIn practice, this decision may involve an iterative process.\nFor example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned.\nLet's start by skimming this dataframe.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskim(parks_metadata_raw)\n```\n\n::: {.cell-output-display}\n\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |parks_metadata_raw |\n|Number of rows |2079 |\n|Number of columns |181 |\n|_______________________ | |\n|Column type frequency: | |\n|character |42 |\n|Date |1 |\n|difftime |46 |\n|logical |6 |\n|numeric |86 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: character**\n\n|skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace|\n|:---------------------|---------:|-------------:|---:|---:|-----:|--------:|----------:|\n|wdw_ticket_season | 861| 0.59| 4| 7| 0| 3| 0|\n|season | 253| 0.88| 4| 29| 0| 17| 0|\n|holidayn | 1865| 0.10| 3| 7| 0| 43| 0|\n|wdwticketseason | 761| 0.63| 4| 7| 0| 3| 0|\n|wdwracen | 1992| 0.04| 4| 6| 0| 5| 0|\n|wdweventn | 1832| 0.12| 3| 12| 0| 8| 0|\n|wdwseason | 87| 0.96| 4| 29| 0| 17| 0|\n|mkeventn | 1546| 0.26| 3| 11| 0| 10| 0|\n|epeventn | 868| 0.58| 4| 5| 0| 4| 0|\n|hseventn | 1877| 0.10| 4| 7| 0| 5| 0|\n|akeventn | 2010| 0.03| 4| 4| 0| 2| 0|\n|holidayj | 2037| 0.02| 5| 15| 0| 8| 0|\n|insession | 105| 0.95| 2| 3| 0| 95| 0|\n|insession_enrollment | 105| 0.95| 2| 4| 0| 100| 0|\n|insession_wdw | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_dlr | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_sqrt_wdw | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_sqrt_dlr | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_california | 105| 0.95| 2| 4| 0| 89| 0|\n|insession_dc | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_central_fl | 105| 0.95| 2| 4| 0| 71| 0|\n|insession_drive1_fl | 105| 0.95| 2| 4| 0| 85| 0|\n|insession_drive2_fl | 105| 0.95| 2| 4| 0| 95| 0|\n|insession_drive_ca | 105| 0.95| 2| 4| 0| 91| 0|\n|insession_florida | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_mardi_gras | 105| 0.95| 2| 4| 0| 82| 0|\n|insession_midwest | 105| 0.95| 2| 4| 0| 75| 0|\n|insession_ny_nj | 105| 0.95| 2| 4| 0| 8| 0|\n|insession_ny_nj_pa | 105| 0.95| 2| 4| 0| 19| 0|\n|insession_new_england | 105| 0.95| 2| 4| 0| 45| 0|\n|insession_new_jersey | 105| 0.95| 2| 4| 0| 2| 0|\n|insession_nothwest | 105| 0.95| 2| 4| 0| 17| 0|\n|insession_planes | 105| 0.95| 2| 4| 0| 81| 0|\n|insession_socal | 105| 0.95| 2| 4| 0| 80| 0|\n|insession_southwest | 105| 0.95| 2| 4| 0| 86| 0|\n|mkprddn | 183| 0.91| 33| 41| 0| 2| 0|\n|mkprdnn | 1358| 0.35| 29| 38| 0| 2| 0|\n|mkfiren | 134| 0.94| 18| 65| 0| 8| 0|\n|epfiren | 126| 0.94| 13| 35| 0| 2| 0|\n|hsfiren | 485| 0.77| 24| 66| 0| 6| 0|\n|hsshwnn | 164| 0.92| 10| 28| 0| 2| 0|\n|akshwnn | 883| 0.58| 15| 33| 0| 2| 0|\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|date | 0| 1|2015-01-01 |2021-08-31 |2017-11-05 | 2079|\n\n\n**Variable type: difftime**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:-----------|:--------|--------:|\n|mkopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclose | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkemhopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 5|\n|mkemhclose | 0| 1.00|54000 secs |107940 secs |23:00:00 | 14|\n|mkopenyest | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkcloseyest | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkopentom | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclosetom | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|epopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epemhopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epemhclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 12|\n|epopenyest | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epcloseyest | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epopentom | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclosetom | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|hsopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclose | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsemhopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 7|\n|hsemhclose | 0| 1.00|50400 secs |93600 secs |21:00:00 | 18|\n|hsopenyest | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hscloseyest | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsopentom | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclosetom | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|akopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclose | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akemhopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akemhclose | 0| 1.00|50400 secs |90000 secs |20:00:00 | 17|\n|akopenyest | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akcloseyest | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akopentom | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclosetom | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|mkprddt1 | 183| 0.91|39600 secs |61200 secs |15:00:00 | 5|\n|mkprddt2 | 1851| 0.11|50400 secs |73800 secs |15:30:00 | 5|\n|mkprdnt1 | 1358| 0.35|68400 secs |82800 secs |21:00:00 | 11|\n|mkprdnt2 | 1480| 0.29|0 secs |84600 secs |23:00:00 | 8|\n|mkfiret1 | 134| 0.94|66600 secs |80100 secs |21:15:00 | 12|\n|mkfiret2 | 2069| 0.00|85800 secs |85800 secs |23:50:00 | 1|\n|epfiret1 | 126| 0.94|64800 secs |81000 secs |21:00:00 | 6|\n|epfiret2 | 2074| 0.00|85200 secs |85200 secs |23:40:00 | 1|\n|hsfiret1 | 485| 0.77|0 secs |82800 secs |21:00:00 | 17|\n|hsfiret2 | 2045| 0.02|0 secs |81000 secs |21:00:00 | 5|\n|hsshwnt1 | 164| 0.92|65100 secs |79200 secs |20:30:00 | 10|\n|hsshwnt2 | 1369| 0.34|72000 secs |82800 secs |21:30:00 | 11|\n|akshwnt1 | 883| 0.58|65700 secs |76500 secs |20:30:00 | 13|\n|akshwnt2 | 1149| 0.45|70200 secs |81000 secs |21:45:00 | 13|\n\n\n**Variable type: logical**\n\n|skim_variable | n_missing| complete_rate| mean|count |\n|:-------------|---------:|-------------:|----:|:-----|\n|hsprddt1 | 2079| 0| NaN|: |\n|hsprddn | 2079| 0| NaN|: |\n|akprddt1 | 2079| 0| NaN|: |\n|akprddt2 | 2079| 0| NaN|: |\n|akprddn | 2079| 0| NaN|: |\n|akfiren | 2079| 0| NaN|: |\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:------------------|---------:|-------------:|-----------:|----------:|-----------:|-----------:|-----------:|-----------:|-----------:|:-----|\n|dayofweek | 0| 1| 4.00| 2.00| 1.00| 2.00| 4.00| 6.00| 7.00|▇▃▃▃▇ |\n|dayofyear | 0| 1| 181.84| 106.34| 0.00| 89.00| 184.00| 273.00| 365.00|▇▇▇▇▇ |\n|weekofyear | 0| 1| 26.09| 15.20| 0.00| 13.00| 26.00| 39.00| 53.00|▇▇▇▇▇ |\n|monthofyear | 0| 1| 6.51| 3.48| 1.00| 3.00| 7.00| 10.00| 12.00|▇▅▆▅▇ |\n|year | 0| 1| 2017.41| 1.74| 2015.00| 2016.00| 2017.00| 2019.00| 2021.00|▇▃▃▃▃ |\n|holidaypx | 0| 1| 7.85| 6.89| 0.00| 3.00| 6.00| 10.00| 33.00|▇▅▂▁▁ |\n|holidaym | 0| 1| 0.54| 1.35| 0.00| 0.00| 0.00| 0.00| 5.00|▇▁▁▁▁ |\n|holiday | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwevent | 0| 1| 0.12| 0.32| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwrace | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwmaxtemp | 5| 1| 82.80| 8.53| 51.11| 78.29| 84.50| 89.54| 97.72|▁▁▃▇▆ |\n|wdwmintemp | 6| 1| 65.50| 10.18| 27.48| 59.03| 68.35| 74.10| 81.28|▁▂▃▆▇ |\n|wdwmeantemp | 6| 1| 74.15| 9.06| 39.75| 68.76| 76.37| 81.61| 87.76|▁▂▃▆▇ |\n|mkevent | 0| 1| 0.26| 0.44| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|epevent | 0| 1| 0.58| 0.49| 0.00| 0.00| 1.00| 1.00| 1.00|▆▁▁▁▇ |\n|hsevent | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akevent | 0| 1| 0.03| 0.18| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhmorn | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmyest | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmtom | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemheve | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkhoursemh | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhyest | 0| 1| 13.65| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhtom | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkemheyest | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhetom | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmorn | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmtom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheve | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhetom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|ephoursemh | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhyest | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhtom | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|hsemhmorn | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmyest | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmtom | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemheve | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemheyest | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemhetom | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hshoursemh | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhyest | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhtom | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|akemhmorn | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmyest | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmtom | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemheve | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemheyest | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemhetom | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akhoursemh | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhyest | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhtom | 0| 1| 11.76| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|mkhours | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhoursyest | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhourstom | 0| 1| 13.26| 2.00| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|ephours | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephoursyest | 0| 1| 12.03| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephourstom | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|hshours | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshoursyest | 0| 1| 11.93| 1.20| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshourstom | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|akhours | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhoursyest | 0| 1| 11.47| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhourstom | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|weather_wdwhigh | 0| 1| 82.35| 7.86| 70.20| 74.60| 82.80| 90.60| 92.30|▅▃▂▂▇ |\n|weather_wdwlow | 0| 1| 64.10| 9.26| 49.20| 55.80| 63.60| 74.00| 76.10|▅▅▃▂▇ |\n|weather_wdwprecip | 0| 1| 0.15| 0.08| 0.03| 0.08| 0.12| 0.23| 0.35|▇▆▃▅▁ |\n|capacitylost_mk | 0| 1| 422110.61| 36458.81| 352865.00| 385812.00| 433857.00| 456055.00| 473553.00|▅▂▁▇▆ |\n|capacitylost_ep | 0| 1| 366897.04| 24019.96| 325168.00| 338367.00| 380763.00| 381963.00| 394662.00|▅▁▁▂▇ |\n|capacitylost_hs | 0| 1| 287485.76| 33198.89| 203780.00| 279573.00| 301871.00| 311870.00| 321869.00|▂▁▁▁▇ |\n|capacitylost_ak | 0| 1| 228193.83| 14967.82| 210779.00| 220778.00| 223178.00| 232777.00| 273873.00|▇▅▁▁▂ |\n|capacitylostwgt_mk | 0| 1| 41374025.71| 3621097.96| 34661635.00| 37641738.00| 42585643.00| 44577245.00| 46545047.00|▅▂▂▇▆ |\n|capacitylostwgt_ep | 0| 1| 35344939.61| 2201138.89| 31692832.00| 32692333.00| 36666337.00| 36668737.00| 37678138.00|▅▁▁▁▇ |\n|capacitylostwgt_hs | 0| 1| 27528647.53| 3049291.37| 19812520.00| 26772627.00| 28771129.00| 29761030.00| 30750931.00|▂▁▁▁▇ |\n|capacitylostwgt_ak | 0| 1| 22386447.82| 1398263.45| 20790321.00| 21780222.00| 21799422.00| 22780123.00| 26739827.00|▇▅▁▁▂ |\n|mkprdday | 0| 1| 1.07| 0.59| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|mkprdngt | 0| 1| 0.64| 0.90| 0.00| 0.00| 0.00| 2.00| 3.00|▇▁▁▃▁ |\n|mkfirewk | 0| 1| 0.94| 0.26| 0.00| 1.00| 1.00| 1.00| 2.00|▁▁▇▁▁ |\n|epfirewk | 0| 1| 0.94| 0.25| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|hsprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|hsfirewk | 0| 1| 0.78| 0.45| 0.00| 1.00| 1.00| 1.00| 2.00|▂▁▇▁▁ |\n|hsshwngt | 0| 1| 1.28| 0.62| 0.00| 1.00| 1.00| 2.00| 3.00|▁▇▁▅▁ |\n|hsfirewks | 0| 1| 1.00| 0.00| 1.00| 1.00| 1.00| 1.00| 1.00|▁▁▇▁▁ |\n|akprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|akshwngt | 0| 1| 1.04| 0.97| 0.00| 0.00| 1.00| 2.00| 3.00|▇▂▁▇▁ |\n\n\n:::\n:::\n\n\nThis dataset contains many more variables than the one we worked with previously.\nFor this analysis, we are going to select `date` (the observation date), `wdw_ticket_season` (the ticket season for the observation), `wdwmaxtemp` (the maximum temperature), `mkclose` (the time Magic Kingdom closed), `mkemhmorn` (whether Magic Kingdom had an \"Extra Magic Hour\" in the morning).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_clean <- parks_metadata_raw |>\n ## based on our analysis plan, we will select the following variables\n select(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn) |>\n ## based on eligibility criteria, limit to 2018\n filter(year(date) == 2018) |>\n ## rename variables\n rename(\n park_date = date,\n park_ticket_season = wdw_ticket_season,\n park_temperature_high = wdwmaxtemp,\n park_close = mkclose,\n park_extra_magic_morning = mkemhmorn\n )\n```\n:::\n\n\n## Working with multiple data sources\n\nFrequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for.\nThe way we can combine datasets is via *joins* -- joining two or more datasets based on a set or sets of common variables.\nWe can think of three main types of *joins*: left, right, and inner.\nA *left* join combines data from two datasets based on a common variable and includes all records from the *left* dataset along with matching records from the *right* dataset (in `{dplyr}`, `left_join()`), while a *right* join includes all records from the *right* dataset and their corresponding matches from the *left* dataset (in `{dplyr}` `right_join()`); an inner join, on the other hand, includes only the records with matching values in *both* datasets, excluding non-matching records (in `{dplyr}` `inner_join()`.\nFor this analysis, we need to use a left join to pull in the cleaned parks metadata.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n left_join(parks_metadata_clean, by = \"park_date\")\n```\n:::\n\n\n## Recognizing missing data\n\nIt is important to recognize whether we have any missing data in our variables.\nThe `{visdat}` package is great for getting a quick sense of whether we have any missing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(visdat)\nvis_miss(seven_dwarfs_train_2018)\n```\n\n::: {.cell-output-display}\n![](07-prep-data_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nIt looks like we only have a few observations (2%) missing our outcome of interest.\nThis is not too bad.\nFor this first analysis we will ignore the missing values.\nWe can explicitly drop them using the `drop_na()` function from `{dplyr}`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n drop_na()\n```\n:::\n\n\n## Exploring and visualizing data and assumptions\n\nThe *positivity* assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (@sec-assump).\nWe can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.\n\n### Single variable checks for positivity violations\n\n@fig-close shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning.\nThere is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n seven_dwarfs_train_2018,\n aes(\n x = factor(park_close),\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n )\n) +\n geom_bar(position = position_dodge2(width = 0.9, preserve = \"single\")) +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Time of Park Close\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning](07-prep-data_files/figure-html/fig-close-1.png){#fig-close width=672}\n:::\n:::\n\n\nTo examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram.\nWe'll use the {halfmoon} package's `geom_mirror_histogram()` to create one.\nExamining @fig-temp, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space.\nIf we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days.\nThis of course would also restrict which days we could draw conclusions about for the future.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(\n seven_dwarfs_train_2018,\n aes(\n x = park_temperature_high,\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n )\n) +\n geom_mirror_histogram(bins = 20) +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Historic maximum temperature (degrees F)\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](07-prep-data_files/figure-html/fig-temp-1.png){#fig-temp width=672}\n:::\n:::\n\n\nFinally, let's look at the distribution of ticket season by whether there were extra magic hours in the morning.\nExamining @fig-ticket, we do not see any positivity violations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n seven_dwarfs_train_2018,\n aes(\n x = park_ticket_season,\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n )\n) +\n geom_bar(position = \"dodge\") +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Magic Kingdom Ticket Season\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](07-prep-data_files/figure-html/fig-ticket-1.png){#fig-ticket width=672}\n:::\n:::\n\n\n### Multiple variable checks for positivity violations\n\nWe have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations.\nBecause we have so few variables here, we can examine this a bit more closely.\nLet's start by discretizing the `park_temperature_high` variable a bit (we will cut it into tertiles).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 |>\n ## cut park_temperature_high into tertiles\n mutate(park_temperature_high_bin = cut(park_temperature_high, breaks = 3)) |>\n ## bin park close time\n mutate(park_close_bin = case_when(\n hour(park_close) < 19 & hour(park_close) > 12 ~ \"(1) early\",\n hour(park_close) >= 19 & hour(park_close) < 24 ~ \"(2) standard\",\n hour(park_close) >= 24 | hour(park_close) < 12 ~ \"(3) late\"\n )) |>\n group_by(park_close_bin, park_temperature_high_bin, park_ticket_season) |>\n ## calculate the proportion exposed in each bin\n summarise(prop_exposed = mean(park_extra_magic_morning), .groups = \"drop\") |>\n ggplot(aes(x = park_close_bin, y = park_temperature_high_bin, fill = prop_exposed)) +\n geom_tile() +\n scale_fill_gradient2(midpoint = 0.5) +\n facet_wrap(~park_ticket_season) +\n labs(\n y = \"Historic Maximum Temperature (F)\",\n x = \"Magic Kingdom Park Close Time\",\n fill = \"Proportion of Days Exposed\"\n )\n```\n\n::: {.cell-output-display}\n![Check for positivity violations across three confounders: historic high temperature, park close time, and ticket season.](07-prep-data_files/figure-html/fig-positivity-1.png){#fig-positivity width=864}\n:::\n:::\n\n\nInteresting!\n@fig-positivity shows an interesting potential violation.\nIt looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning.\nThis actually makes sense if we think a bit about this data set.\nThe only days with cold temperatures in Florida that would also be considered a \"peak\" time to visit Walt Disney World would be over Christmas / New Years.\nDuring this time there historically were always extra magic hours.\n\nWe are going to proceed with the analysis, but we will keep these observations in mind.\n\n## Presenting descriptive statistics\n\nLet's examine a table of the variables of interest in this data frame.\nTo do so, we are going to use the `tbl_summary()` function from the `{gtsummary}` package.\n(We'll also use the `{labelled}` package to clean up the variable names for the table.)\n\n\n::: {#tbl-unweighted-gtsummary .cell tbl-cap='A descriptive table of Extra Magic Morning in the touringplans dataset. This table shows the distributions of these variables in the observed population.'}\n\n```{.r .cell-code}\nlibrary(gtsummary)\nlibrary(labelled)\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n set_variable_labels(\n park_ticket_season = \"Ticket Season\",\n park_close = \"Close Time\",\n park_temperature_high = \"Historic High Temperature\"\n )\n\ntbl_summary(\n seven_dwarfs_train_2018,\n by = park_extra_magic_morning,\n include = c(park_ticket_season, park_close, park_temperature_high)\n) |>\n # add an overall column to the table\n add_overall(last = TRUE)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n \n \n \n \n
Characteristic0, N = 29411, N = 601Overall, N = 3541
Ticket Season


    peak60 (20%)18 (30%)78 (22%)
    regular158 (54%)35 (58%)193 (55%)
    value76 (26%)7 (12%)83 (23%)
Close Time


    16:30:001 (0.3%)0 (0%)1 (0.3%)
    18:00:0037 (13%)18 (30%)55 (16%)
    20:00:0018 (6.1%)2 (3.3%)20 (5.6%)
    21:00:0028 (9.5%)0 (0%)28 (7.9%)
    22:00:0091 (31%)11 (18%)102 (29%)
    23:00:0078 (27%)11 (18%)89 (25%)
    24:00:0040 (14%)17 (28%)57 (16%)
    25:00:001 (0.3%)1 (1.7%)2 (0.6%)
Historic High Temperature84 (78, 89)83 (76, 87)84 (78, 89)
1 n (%); Median (IQR)
\n
\n```\n\n:::\n:::\n", "supporting": [ "07-prep-data_files" ], diff --git a/_freeze/chapters/07-prep-data/figure-html/fig-dag-magic-1.png b/_freeze/chapters/07-prep-data/figure-html/fig-dag-magic-1.png index 77ff716f..341b2c64 100644 Binary files a/_freeze/chapters/07-prep-data/figure-html/fig-dag-magic-1.png and b/_freeze/chapters/07-prep-data/figure-html/fig-dag-magic-1.png differ diff --git a/_freeze/chapters/13-continuous-exposures/execute-results/html.json b/_freeze/chapters/13-continuous-exposures/execute-results/html.json index 3adce0d9..cbd50980 100644 --- a/_freeze/chapters/13-continuous-exposures/execute-results/html.json +++ b/_freeze/chapters/13-continuous-exposures/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "85b7f8099c100dc21f366009a80f48a2", + "hash": "81e37f1980eaa7f3c475b2ae087df6df", "result": { - "markdown": "# Continuous exposures\n\n\n\n\n\n## Calculating propensity scores for continuous exposures\n\nPropensity scores generalize to many other types of exposures, including continuous exposures.\nAt its heart, the workflow is the same: we fit a model where the exposure is the outcome and then use that model to weight a second outcome model.\nFor continuous exposures, linear regression is the simplest way to create propensities.\nInstead of probabilities, we use the cumulative density function.\nThen, we use this density to weight the outcome model.\n\nLet's take a look at an example.\nIn the `touringplans` data set, we have information about the posted waiting times for rides.\nWe also have a limited amount of data on the observed actual times.\nThe question we will consider is this: Do posted wait times for the Seven Dwarfs Mine Train at 8 am affect actual wait times at 9 am?\nHere's our DAG:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between posted wait in the morning at a particular park and the average wait time between 5 pm and 6 pm.](13-continuous-exposures_files/figure-html/fig-dag-avg-wait-1.png){#fig-dag-avg-wait width=672}\n:::\n:::\n\n\nIn @fig-dag-avg-wait, we're assuming that our primary confounders are when the park closes, historic high temperatures, whether or not the ride has extra magic morning hours, and the ticket season.\nThis is the only minimal adjustment set in the DAG, as well.\nThe confounders precede the exposure and outcome, and (by definition) the exposure precedes the outcome.\nThe average posted wait time is, in theory, a manipulable exposure because the park could post a time different from what they expect.\nThe adjustment set\n\nThe model is similar to the binary exposure case, but we need to use linear regression, as the posted time is a continuous variable.\nSince we're not using probabilities, we'll calculate denominators for our weights from a normal density.\nWe then calculate the denominator using the `dnorm()` function, which calculates the normal density for the `exposure`, using `.fitted` as the mean and `mean(.sigma)` as the SD.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(\n exposure ~ confounder_1 + confounder_2,\n data = df\n) |>\n augment(data = df) |>\n mutate(\n denominator = dnorm(exposure, .fitted, mean(.sigma, na.rm = TRUE))\n )\n```\n:::\n\n\n## Diagnostics and stabilization\n\nContinuous exposure weights, however, are very sensitive to modeling choices.\nOne problem, in particular, is the existence of extreme weights, an issue that can also affect other types of exposures.\nWhen some observations have extreme weights, the propensities are *destabilized,* which results in wider confidence intervals.\nWe can stabilize them using the marginal distribution of the exposure.\nA common way to calculate the marginal distribution for propensity scores is to use a regression model with no predictors.\n\n::: callout-caution\nExtreme weights destabilize estimates, resulting in wider confidence intervals.\nExtreme weights can be an issue for any time of weight (including those for binary and other types of exposures) that is not bounded.\nBounded weights like the ATO (which are bounded to 0 and 1) do not have this problem, however---one of their many benefits.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# for continuous exposures\nlm(\n exposure ~ 1,\n data = df\n) |>\n augment(data = df) |>\n transmute(\n numerator = dnorm(exposure, .fitted, mean(.sigma, na.rm = TRUE))\n )\n\n# for binary exposures\nglm(\n exposure ~ 1,\n data = df,\n family = binomial()\n) |>\n augment(type.predict = \"response\", data = df) |>\n select(numerator = .fitted)\n```\n:::\n\n\nThen, rather than inverting them, we calculate the weights as `numerator / denominator`.\nLet's try it out on our posted wait times example.\nFirst, let's wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9?\nWe'll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time).\nWe also have a lot of missingness for `wait_minutes_actual_avg`, so we'll drop unobserved values for now.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(touringplans)\neight <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 8) |>\n select(-wait_minutes_actual_avg)\n\nnine <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n select(park_date, wait_minutes_actual_avg)\n\nwait_times <- eight |>\n left_join(nine, by = \"park_date\") |>\n drop_na(wait_minutes_actual_avg)\n```\n:::\n\n\nFirst, let's calculate our denominator model.\nWe'll fit a model using `lm()` for `wait_minutes_posted_avg` with our covariates, then use the fitted predictions of `wait_minutes_posted_avg` (`.fitted`) to calculate the density using `dnorm()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\ndenominator_model <- lm(\n wait_minutes_posted_avg ~\n park_close + park_extra_magic_morning + park_temperature_high + park_ticket_season,\n data = wait_times\n)\n\ndenominators <- denominator_model |>\n augment(data = wait_times) |>\n mutate(\n denominator = dnorm(\n wait_minutes_posted_avg,\n .fitted,\n mean(.sigma, na.rm = TRUE)\n )\n ) |>\n select(park_date, denominator, .fitted)\n```\n:::\n\n\nWhen we only use the inverted values of `denominator`, we end up with several extreme weights:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndenominators |>\n mutate(wts = 1 / denominator) |>\n ggplot(aes(wts)) +\n geom_histogram(fill = \"#E69F00\", color = \"white\", bins = 50) +\n scale_x_log10(name = \"weights\")\n```\n\n::: {.cell-output-display}\n![A histogram of the inverse probability weights for posted waiting time. Weights for continuous exposures are prone to extreme values, which can unstabilize estimates and variance.](13-continuous-exposures_files/figure-html/fig-hist-sd-unstable-1.png){#fig-hist-sd-unstable width=672}\n:::\n:::\n\n\nIn @fig-hist-sd-unstable, we see several weights over 100 and one over 10,000; these extreme weights will put undue stress on specific points, complicating the results we will estimate.\n\nLet's now fit the marginal density to use for stabilized weights:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumerator_model <- lm(\n wait_minutes_posted_avg ~ 1,\n data = wait_times\n)\n\nnumerators <- numerator_model |>\n augment(data = wait_times) |>\n mutate(\n numerator = dnorm(\n wait_minutes_posted_avg,\n .fitted,\n mean(.sigma, na.rm = TRUE)\n )\n ) |>\n select(park_date, numerator)\n```\n:::\n\n\nWe also need to join the fitted values back to our original data set by date, then calculate the stabilized weights (`swts`) using `numerator / denominator`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwait_times_wts <- wait_times |>\n left_join(numerators, by = \"park_date\") |>\n left_join(denominators, by = \"park_date\") |>\n mutate(swts = numerator / denominator)\n```\n:::\n\n\nThe stabilized weights are much less extreme.\nStabilized weights should have a mean close to 1 (in this example, it is `round(mean(wait_times_wts$swts), digits = 2)`); when that is the case, then the pseudo-population (that is, the equivalent number of observations after weighting) is equal to the original sample size.\nIf the mean is far from 1, we may have issues with model misspecification or positivity violations [@hernan2021].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(wait_times_wts, aes(swts)) +\n geom_histogram(fill = \"#E69F00\", color = \"white\", bins = 50) +\n scale_x_log10(name = \"weights\")\n```\n\n::: {.cell-output-display}\n![A histogram of the stabilized inverse probability weights for posted waiting time. These weights are much more reasonable and will allow the outcome model to behave better.](13-continuous-exposures_files/figure-html/fig-hist-sd-stable-1.png){#fig-hist-sd-stable width=672}\n:::\n:::\n\n\nWhen we compare the exposure---average posted wait times---to the standardized weights, we still have one exceptionally high weight.\nIs this a problem, or is this a valid data point?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(wait_times_wts, aes(wait_minutes_posted_avg, swts)) +\n geom_point(size = 3, color = \"grey80\", alpha = 0.7) +\n geom_point(\n data = function(x) filter(x, swts > 10),\n color = \"firebrick\",\n size = 3\n ) +\n geom_text(\n data = function(x) filter(x, swts > 10),\n aes(label = park_date),\n size = 5,\n hjust = 0,\n nudge_x = -15.5,\n color = \"firebrick\"\n ) +\n scale_y_log10() +\n labs(x = \"Average Posted Wait\", y = \"Stabilized Weights\")\n```\n\n::: {.cell-output-display}\n![A scatter of the stabilized inverse probability weights for posted waiting time vs. posted waiting times. Days with more values of `wait_minutes_posted_avg` farther from the mean appear to be downweighted, with a few exceptions. The most unusual weight is for June 23, 2018.](13-continuous-exposures_files/figure-html/fig-stabilized-wts-scatter-1.png){#fig-stabilized-wts-scatter width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nwait_times_wts |>\n filter(swts > 10) |>\n select(\n park_date,\n wait_minutes_posted_avg,\n .fitted,\n park_close,\n park_extra_magic_morning,\n park_temperature_high,\n park_ticket_season\n ) |>\n knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n|park_date | wait_minutes_posted_avg| .fitted|park_close | park_extra_magic_morning| park_temperature_high|park_ticket_season |\n|:----------|-----------------------:|-------:|:----------|------------------------:|---------------------:|:------------------|\n|2018-06-23 | 81| 28.1|24:00:00 | 0| 91.36|regular |\n\n\n:::\n:::\n\n\nOur model predicted a much lower posted wait time than observed, so this date was upweighted.\nWe don't know why the posted time was so high (the actual time was much lower), but we did find an artist rendering from that date of [Pluto digging for Seven Dwarfs Mine Train treasure](https://disneyparks.disney.go.com/blog/2018/06/disney-doodle-pluto-sniffs-out-fun-at-seven-dwarfs-mine-train/).\n\n## Fitting the outcome model for continuous exposures\n", + "markdown": "# Continuous exposures {#sec-continuous-exposures}\n\n\n\n\n\n## Calculating propensity scores for continuous exposures\n\nPropensity scores generalize to many other types of exposures, including continuous exposures.\nAt its heart, the workflow is the same: we fit a model where the exposure is the outcome and then use that model to weight a second outcome model.\nFor continuous exposures, linear regression is the simplest way to create propensities.\nInstead of probabilities, we use the cumulative density function.\nThen, we use this density to weight the outcome model.\n\nLet's take a look at an example.\nIn the `touringplans` data set, we have information about the posted waiting times for rides.\nWe also have a limited amount of data on the observed actual times.\nThe question we will consider is this: Do posted wait times for the Seven Dwarfs Mine Train at 8 am affect actual wait times at 9 am?\nHere's our DAG:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between posted wait in the morning at a particular park and the average wait time between 5 pm and 6 pm.](13-continuous-exposures_files/figure-html/fig-dag-avg-wait-1.png){#fig-dag-avg-wait width=672}\n:::\n:::\n\n\nIn @fig-dag-avg-wait, we're assuming that our primary confounders are when the park closes, historic high temperatures, whether or not the ride has extra magic morning hours, and the ticket season.\nThis is the only minimal adjustment set in the DAG, as well.\nThe confounders precede the exposure and outcome, and (by definition) the exposure precedes the outcome.\nThe average posted wait time is, in theory, a manipulable exposure because the park could post a time different from what they expect.\nThe adjustment set\n\nThe model is similar to the binary exposure case, but we need to use linear regression, as the posted time is a continuous variable.\nSince we're not using probabilities, we'll calculate denominators for our weights from a normal density.\nWe then calculate the denominator using the `dnorm()` function, which calculates the normal density for the `exposure`, using `.fitted` as the mean and `mean(.sigma)` as the SD.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(\n exposure ~ confounder_1 + confounder_2,\n data = df\n) |>\n augment(data = df) |>\n mutate(\n denominator = dnorm(exposure, .fitted, mean(.sigma, na.rm = TRUE))\n )\n```\n:::\n\n\n## Diagnostics and stabilization\n\nContinuous exposure weights, however, are very sensitive to modeling choices.\nOne problem, in particular, is the existence of extreme weights, an issue that can also affect other types of exposures.\nWhen some observations have extreme weights, the propensities are *destabilized,* which results in wider confidence intervals.\nWe can stabilize them using the marginal distribution of the exposure.\nA common way to calculate the marginal distribution for propensity scores is to use a regression model with no predictors.\n\n::: callout-caution\nExtreme weights destabilize estimates, resulting in wider confidence intervals.\nExtreme weights can be an issue for any time of weight (including those for binary and other types of exposures) that is not bounded.\nBounded weights like the ATO (which are bounded to 0 and 1) do not have this problem, however---one of their many benefits.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# for continuous exposures\nlm(\n exposure ~ 1,\n data = df\n) |>\n augment(data = df) |>\n transmute(\n numerator = dnorm(exposure, .fitted, mean(.sigma, na.rm = TRUE))\n )\n\n# for binary exposures\nglm(\n exposure ~ 1,\n data = df,\n family = binomial()\n) |>\n augment(type.predict = \"response\", data = df) |>\n select(numerator = .fitted)\n```\n:::\n\n\nThen, rather than inverting them, we calculate the weights as `numerator / denominator`.\nLet's try it out on our posted wait times example.\nFirst, let's wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9?\nWe'll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time).\nWe also have a lot of missingness for `wait_minutes_actual_avg`, so we'll drop unobserved values for now.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(touringplans)\neight <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 8) |>\n select(-wait_minutes_actual_avg)\n\nnine <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n select(park_date, wait_minutes_actual_avg)\n\nwait_times <- eight |>\n left_join(nine, by = \"park_date\") |>\n drop_na(wait_minutes_actual_avg)\n```\n:::\n\n\nFirst, let's calculate our denominator model.\nWe'll fit a model using `lm()` for `wait_minutes_posted_avg` with our covariates, then use the fitted predictions of `wait_minutes_posted_avg` (`.fitted`) to calculate the density using `dnorm()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\ndenominator_model <- lm(\n wait_minutes_posted_avg ~\n park_close + park_extra_magic_morning + park_temperature_high + park_ticket_season,\n data = wait_times\n)\n\ndenominators <- denominator_model |>\n augment(data = wait_times) |>\n mutate(\n denominator = dnorm(\n wait_minutes_posted_avg,\n .fitted,\n mean(.sigma, na.rm = TRUE)\n )\n ) |>\n select(park_date, denominator, .fitted)\n```\n:::\n\n\nWhen we only use the inverted values of `denominator`, we end up with several extreme weights:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndenominators |>\n mutate(wts = 1 / denominator) |>\n ggplot(aes(wts)) +\n geom_histogram(fill = \"#E69F00\", color = \"white\", bins = 50) +\n scale_x_log10(name = \"weights\")\n```\n\n::: {.cell-output-display}\n![A histogram of the inverse probability weights for posted waiting time. Weights for continuous exposures are prone to extreme values, which can unstabilize estimates and variance.](13-continuous-exposures_files/figure-html/fig-hist-sd-unstable-1.png){#fig-hist-sd-unstable width=672}\n:::\n:::\n\n\nIn @fig-hist-sd-unstable, we see several weights over 100 and one over 10,000; these extreme weights will put undue stress on specific points, complicating the results we will estimate.\n\nLet's now fit the marginal density to use for stabilized weights:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumerator_model <- lm(\n wait_minutes_posted_avg ~ 1,\n data = wait_times\n)\n\nnumerators <- numerator_model |>\n augment(data = wait_times) |>\n mutate(\n numerator = dnorm(\n wait_minutes_posted_avg,\n .fitted,\n mean(.sigma, na.rm = TRUE)\n )\n ) |>\n select(park_date, numerator)\n```\n:::\n\n\nWe also need to join the fitted values back to our original data set by date, then calculate the stabilized weights (`swts`) using `numerator / denominator`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwait_times_wts <- wait_times |>\n left_join(numerators, by = \"park_date\") |>\n left_join(denominators, by = \"park_date\") |>\n mutate(swts = numerator / denominator)\n```\n:::\n\n\nThe stabilized weights are much less extreme.\nStabilized weights should have a mean close to 1 (in this example, it is `round(mean(wait_times_wts$swts), digits = 2)`); when that is the case, then the pseudo-population (that is, the equivalent number of observations after weighting) is equal to the original sample size.\nIf the mean is far from 1, we may have issues with model misspecification or positivity violations [@hernan2021].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(wait_times_wts, aes(swts)) +\n geom_histogram(fill = \"#E69F00\", color = \"white\", bins = 50) +\n scale_x_log10(name = \"weights\")\n```\n\n::: {.cell-output-display}\n![A histogram of the stabilized inverse probability weights for posted waiting time. These weights are much more reasonable and will allow the outcome model to behave better.](13-continuous-exposures_files/figure-html/fig-hist-sd-stable-1.png){#fig-hist-sd-stable width=672}\n:::\n:::\n\n\nWhen we compare the exposure---average posted wait times---to the standardized weights, we still have one exceptionally high weight.\nIs this a problem, or is this a valid data point?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(wait_times_wts, aes(wait_minutes_posted_avg, swts)) +\n geom_point(size = 3, color = \"grey80\", alpha = 0.7) +\n geom_point(\n data = function(x) filter(x, swts > 10),\n color = \"firebrick\",\n size = 3\n ) +\n geom_text(\n data = function(x) filter(x, swts > 10),\n aes(label = park_date),\n size = 5,\n hjust = 0,\n nudge_x = -15.5,\n color = \"firebrick\"\n ) +\n scale_y_log10() +\n labs(x = \"Average Posted Wait\", y = \"Stabilized Weights\")\n```\n\n::: {.cell-output-display}\n![A scatter of the stabilized inverse probability weights for posted waiting time vs. posted waiting times. Days with more values of `wait_minutes_posted_avg` farther from the mean appear to be downweighted, with a few exceptions. The most unusual weight is for June 23, 2018.](13-continuous-exposures_files/figure-html/fig-stabilized-wts-scatter-1.png){#fig-stabilized-wts-scatter width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nwait_times_wts |>\n filter(swts > 10) |>\n select(\n park_date,\n wait_minutes_posted_avg,\n .fitted,\n park_close,\n park_extra_magic_morning,\n park_temperature_high,\n park_ticket_season\n ) |>\n knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n|park_date | wait_minutes_posted_avg| .fitted|park_close | park_extra_magic_morning| park_temperature_high|park_ticket_season |\n|:----------|-----------------------:|-------:|:----------|------------------------:|---------------------:|:------------------|\n|2018-06-23 | 81| 28.1|24:00:00 | 0| 91.36|regular |\n\n\n:::\n:::\n\n\nOur model predicted a much lower posted wait time than observed, so this date was upweighted.\nWe don't know why the posted time was so high (the actual time was much lower), but we did find an artist rendering from that date of [Pluto digging for Seven Dwarfs Mine Train treasure](https://disneyparks.disney.go.com/blog/2018/06/disney-doodle-pluto-sniffs-out-fun-at-seven-dwarfs-mine-train/).\n\n## Fitting the outcome model for continuous exposures\n", "supporting": [ "13-continuous-exposures_files" ], diff --git a/_freeze/chapters/13-continuous-exposures/figure-html/fig-dag-avg-wait-1.png b/_freeze/chapters/13-continuous-exposures/figure-html/fig-dag-avg-wait-1.png index 4a4f8865..f3978a4d 100644 Binary files a/_freeze/chapters/13-continuous-exposures/figure-html/fig-dag-avg-wait-1.png and b/_freeze/chapters/13-continuous-exposures/figure-html/fig-dag-avg-wait-1.png differ diff --git a/_freeze/chapters/15-g-comp/execute-results/html.json b/_freeze/chapters/15-g-comp/execute-results/html.json index 813d5854..63c55d07 100644 --- a/_freeze/chapters/15-g-comp/execute-results/html.json +++ b/_freeze/chapters/15-g-comp/execute-results/html.json @@ -1,8 +1,10 @@ { - "hash": "598568e77bf613e84a9b4fc54d8c853e", + "hash": "876c105eda910e33da19826666514f99", "result": { - "markdown": "# G-computation {#sec-g-comp}\n\n\n\n\n\n## The Parametric G-Formula\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnorm(5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.1111789 0.2518857 0.0004088 -0.5343476\n[5] -1.0941076\n```\n\n\n:::\n:::\n\n\n## Calculating estimates with G-Computation\n\n## The Natural Course\n", - "supporting": [], + "markdown": "# G-computation {#sec-g-comp}\n\n\n\n\n\n## The Parametric G-Formula\n\nLet's pause to recap a typical goal of the causal analyses we've seen in this book so far: to estimate what would happen if *everyone* in the study were exposed versus what would happen if *no one* was exposed.\nTo do this, we've used weighting techniques that create confounder-balanced pseudopopulations which, in turn, give rise to unbiased causal effect estimates in marginal outcome models.\nOne alternative approach to weighting is called the parametric G-formula, which is generally executed through the following 4 steps:\n\n1. Draw the appropriate time ordered DAG (as described in @sec-dags).\n\n2. For each point in time after baseline, decide on a parametric model that predicts each variable's value based on previously measured variables in the DAG. \nThese are often linear models for continuous variables or logistic regressions for binary variables.\n\n3. Starting with a sample from the observed distribution of data at baseline, generate values for all subsequent variables according to the models in step 2 (i.e. conduct a *Monte Carlo simulation*). Do this with one key modification: for each exposure regime you are interested in comparing (e.g. everyone exposed versus everyone unexposed), assign the exposure variables accordingly (that is, don't let the simulation assign values for exposure variables).\n\n4. Compute the causal contrast of interest based on the simulated outcome in each exposure group.\n\n::: callout-tip\n## Monte Carlo simulations\n\nMonte Carlo simulations are computational approaches that generate a sample of outcomes for random processes.\nOne example would be to calculate the probability of rolling \"snake eyes\" (two ones) on a single roll of two six-sided dice. \nWe could certainly calculate this probability mathematically ($\\frac{1}{6}*\\frac{1}{6}=\\frac{1}{36}\\approx 2.8$%), though it can be just as quick to write a Monte Carlo simulation of the process (1,000,000 rolls shown below).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn <- 1000000\ntibble(\n roll_1 = sample(1:6, n, replace = TRUE),\n roll_2 = sample(1:6, n, replace = TRUE),\n) |> \n reframe(roll_1 + roll_2 == 2) |> \n pull() |> \n sum()/n\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.02787\n```\n\n\n:::\n:::\n\nMonte Carlo simulations are extremely useful for estimating outcomes of complex processes for which closed mathematical solutions are not easy to determine. \nIndeed, that's why Monte Carlo simulations are so useful for the real-world causal mechanisms described in this book!\n:::\n\n## Revisiting the magic morning hours example\n\nRecall in @sec-outcome-model that we estimated the impact of extra magic morning hours on the average posted wait time for the Seven Dwarfs ride between 9 and 10am.\nTo do so, we fit a propensity score model for the exposure (`park_extra_magic_morning`) with the confounders `park_ticket_season`, `park_close`, and `park_temperature_high`.\nIn turn, these propensity scores were converted to regression weights for the outcome model, which concluded that the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes.\n\nWe will now reproduce this analysis, instead adopting the g-formula approach. Proceeding through the 4 steps outlined above, we begin by revisiting the time ordered DAG relevant to this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](15-g-comp_files/figure-html/fig-dag-magic-hours-wait-take-2-1.png){#fig-dag-magic-hours-wait-take-2 width=672}\n:::\n:::\n\n\nThe second step is to specify a parametric model for each non-baseline variable that is based upon previously measured variables in the DAG. \nThis particular example is simple, since we only have two variables that are affected by previous features (`park_extra_magic_morning` and `wait_minutes_posted_avg`). \nLet's suppose that adequate models for these two variables are the simple logistic and linear models that follow.\nOf note, we're not yet going to use the model for the exposure (`park_extra_magic_morning`), but we're including the step here because it will be an important part of patterns you will see in the next section (@sec-dynamic).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load packages and data\nlibrary(broom)\nlibrary(touringplans)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 9)\n\n# A logistic regression for park_extra_magic_morning\nfit_extra_magic <- glm(\n park_extra_magic_morning ~ \n park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = \"binomial\"\n)\n\n# A linear model for wait_minutes_posted_avg\nfit_wait_minutes <- lm(\n wait_minutes_posted_avg ~ \n park_extra_magic_morning + park_ticket_season + park_close +\n park_temperature_high,\n data = seven_dwarfs_9\n)\n```\n:::\n\n\nNext, we need to draw a large sample from the distribution of baseline characteristics.\nDeciding how large this sample should be is typically based on computational availability; larger sample sizes can minimize the risk of precision loss via simulation error [@keil2014].\nIn the present case, we'll use sampling with replacement to generate a data frame of size 10,000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# It's important to set seeds for reproducibility in Monte Carlo runs\nset.seed(8675309)\n\ndf_sim_baseline <- seven_dwarfs_9 |> \n select(park_ticket_season, park_close, park_temperature_high) |> \n sample_n(10000, replace = TRUE)\n```\n:::\n\n\nWith this population in hand, we can now simulate what would happen at each subsequent time point according to the parametric models we just defined. \nRemember that, in step 3, an important caveat is that for the variable upon which we wish to intervene (in this case, `park_extra_magic_morning`) we don't need to let the model determine the values; rather, we set them. \nSpecifically, we'll set the first 5000 to `park_extra_magic_morning = 1` and the second 5000 to `park_extra_magic_morning = 0`. \nOther simulations (in this case, the only remaining variable, `wait_minutes_posted_avg`) proceed as expected.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Set the exposure groups for the causal contrast we wish to estimate\ndf_sim_time_1 <- df_sim_baseline |> \n mutate(park_extra_magic_morning = c(rep(1, 5000), rep(0, 5000)))\n\n# Simulate the outcome according to the parametric model in step 2\ndf_outcome <- fit_wait_minutes |> \n augment(newdata = df_sim_time_1) |> \n rename(wait_minutes_posted_avg = .fitted)\n```\n:::\n\n\nAll that is left to do is compute the causal contrast we wish to estimate.\nHere, that contrast is the difference between expected wait minutes on extra magic mornings versus mornings without the extra magic program. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_outcome |> \n group_by(park_extra_magic_morning) |> \n summarize(wait_minutes = mean(wait_minutes_posted_avg))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n park_extra_magic_morning wait_minutes\n \n1 0 68.1\n2 1 74.3\n```\n\n\n:::\n:::\n\n\nWe see that the difference, $74.3-68.1=6.2$ is the same as our estimate of 6.2 when we used IP weighting.\n\n## The g-formula for continuous exposures\n\nAs previously mentioned, a key strength of the g-formula is its capacity to handle continuous exposures, a situation in which IP weighting can give rise to unstable estimates. \nHere, we briefly repeat the example from @sec-continuous-exposures to show how this is done.\nTo extend the pattern, we will wrap this execution of the technique in a bootstrap to show how confidence intervals are computed.\n\nRecall, our causal question of interest is \"Do posted wait times for the Seven Dwarfs Mine Train at 8 am affect actual wait times at 9 am?\"\nThe time-ordered DAG for this question (step 1) is:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between posted wait in the morning at a particular park and the average wait time between 5 pm and 6 pm.](15-g-comp_files/figure-html/fig-dag-avg-wait-2-1.png){#fig-dag-avg-wait-2 width=672}\n:::\n:::\n\n\nFor step 2, we need to specify parametric models for the non-baseline variables in our DAG (i.e. any variables which have arrows into them).\nIn this case, we need such models for `park_extra_magic_morning`, `wait_minutes_posted_avg`, and `wait_minutes_actual_avg`; we'll assume that the below logistic and linear models are appropriate.\nOne extension to our previous implementation is that we're going to embed the each step of the process into a function, since this will allow us to bootstrap the entire pipeline and obtain confidence intervals.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(splines)\n\nfit_models <- function(.data) {\n \n # A logistic regression for park_extra_magic_morning\n fit_extra_magic <- glm(\n park_extra_magic_morning ~ \n park_ticket_season + park_close + park_temperature_high,\n data = .data,\n family = \"binomial\"\n )\n \n # A linear model for wait_minutes_posted_avg\n fit_wait_minutes_posted <- lm(\n wait_minutes_posted_avg ~ \n park_extra_magic_morning + park_ticket_season + park_close +\n park_temperature_high,\n data = .data\n )\n \n # A linear model for wait_minutes_actual_avg\n # Let's go ahead an add a spline for further flexibility.\n # Be aware this is an area where you can add many options here\n # (interactions, etc) but you may get warnings and/or models which\n # fail to converge if you don't have enough data.\n fit_wait_minutes_actual <- lm(\n wait_minutes_actual_avg ~ \n ns(wait_minutes_posted_avg, df = 3) + \n park_extra_magic_morning +\n park_ticket_season + park_close +\n park_temperature_high,\n data = .data\n )\n \n # return a list that we can pipe into our simulation step (up next)\n return(\n list(\n .data = .data,\n fit_extra_magic = fit_extra_magic,\n fit_wait_minutes_posted = fit_wait_minutes_posted,\n fit_wait_minutes_actual = fit_wait_minutes_actual\n )\n )\n \n}\n```\n:::\n\n\nNext, we write a function which will complete step 3: from a random sample from the distribution of baseline variables, generate values for all subsequent variables (except the intervention variable) according to the models we defined. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# The arguments to simulate_process are as follows:\n# fit_obj is a list which is returned from our fit_models function\n# contrast gives exposure (default 60) and control group (default 30) settings\n# n_sample is the size of the baseline resample of .data\nsimulate_process <- function(\n fit_obj, \n contrast = c(60, 30),\n n_sample = 10000\n) {\n \n # Draw a random sample of baseline variables\n df_baseline <- fit_obj |> \n pluck(\".data\") |> \n select(park_ticket_season, park_close, park_temperature_high) |> \n sample_n(n_sample, replace = TRUE)\n \n # Simulate park_extra_magic_morning\n df_sim_time_1 <- fit_obj |> \n pluck(\"fit_extra_magic\") |> \n augment(newdata = df_baseline, type.predict = \"response\") |> \n # .fitted is the probability that park_extra_magic_morning is 1,\n # so let's use that to generate a 0/1 outcome\n mutate(\n park_extra_magic_morning = rbinom(n(), 1, .fitted)\n )\n \n # Assign wait_minutes_posted_avg, since it's the intervention\n df_sim_time_2 <- df_sim_time_1 |> \n mutate(\n wait_minutes_posted_avg = \n c(rep(contrast[1], n_sample/2), rep(contrast[2], n_sample/2))\n )\n \n # Simulate the outcome \n df_outcome <- fit_obj |> \n pluck(\"fit_wait_minutes_actual\") |> \n augment(newdata = df_sim_time_2) |> \n rename(wait_minutes_actual_avg = .fitted)\n \n # return a list that we can pipe into the contrast estimation step (up next)\n return(\n list(\n df_outcome = df_outcome,\n contrast = contrast\n )\n )\n}\n```\n:::\n\n\nFinally, in step 4, we compute the summary statistics and causal contrast of interest using our simulated data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# sim_obj is a list created by our simulate_process() function\ncompute_stats <- function(sim_obj) {\n \n exposure_val <- sim_obj |> \n pluck(\"contrast\", 1)\n \n control_val <- sim_obj |> \n pluck(\"contrast\", 2)\n \n sim_obj |> \n pluck(\"df_outcome\") |> \n group_by(wait_minutes_posted_avg) |> \n summarize(avg_wait_actual = mean(wait_minutes_actual_avg)) |> \n pivot_wider(\n names_from = wait_minutes_posted_avg, \n values_from = avg_wait_actual,\n names_prefix = \"x_\"\n ) |> \n summarize(\n x_60, x_30, x_60 - x_30\n )\n \n}\n```\n:::\n\n\nNow, let's put it all together to get a single point estimate. \nOnce we've seen that in action, we'll bootstrap for a confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Wrangle the data to reflect the causal question we are asking\neight <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 8) |>\n select(-wait_minutes_actual_avg)\n\nnine <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n select(park_date, wait_minutes_actual_avg)\n\nwait_times <- eight |>\n left_join(nine, by = \"park_date\") |>\n drop_na(wait_minutes_actual_avg)\n\n# get a single point estimate to make sure things work as we planned\nwait_times |> \n fit_models() |> \n simulate_process() |> \n compute_stats() |> \n # rsample wants results labelled this way\n pivot_longer(\n names_to = \"term\", \n values_to = \"estimate\", \n cols = everything()\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n term estimate\n \n1 x_60 29.9\n2 x_30 40.6\n3 x_60 - x_30 -10.7\n```\n\n\n:::\n\n```{.r .cell-code}\n# compute bootstrap confidence intervals\nlibrary(rsample)\n\nboots <- bootstraps(wait_times, times = 1000, apparent = TRUE) |> \n mutate(\n models = map(\n splits, \n \\(.x) analysis(.x) |> \n fit_models() |> \n simulate_process() |> \n compute_stats() |> \n pivot_longer(\n names_to = \"term\", \n values_to = \"estimate\", \n cols = everything()\n )\n )\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: There was 1 warning in `mutate()`.\nℹ In argument: `models = map(...)`.\nCaused by warning:\n! glm.fit: fitted probabilities numerically 0 or 1 occurred\n```\n\n\n:::\n\n```{.r .cell-code}\nresults <- int_pctl(boots, models)\nresults\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 6\n term .lower .estimate .upper .alpha .method \n \n1 x_30 33.1 40.6 48.8 0.05 percentile\n2 x_60 14.0 30.5 47.6 0.05 percentile\n3 x_60 - x_30 -30.8 -10.0 11.7 0.05 percentile\n```\n\n\n:::\n:::\n\n\nIn summary, our results are intepreted as follows: setting the posted wait time at 8am to 60 minutes results in an actual wait time at 9am of 31, while setting the posted wait time to 30 minutes gives a longer wait time of 41. Put another way, increasing the posted wait time at 8am from 30 minutes to 60 minutes results in a 10 minute shorter wait time at 9am.\n\nNote that one of our models threw a warning regarding perfect discrimination (`fitted probabilities numerically 0 or 1 occurred`); this can happen when we don't have a large sample size and one of our models is overspecified due to complexity.\nIn this exercise, the flexibility added by the spline in the regression for `wait_minutes_actual_avg` is what caused the issue.\nOne remedy when this happens is to simplify the offending model (i.e. if you modify the `wait_minutes_actual_avg` to include a simple linear term for `wait_minutes_posted_avg`), the warning is resolved.\nWe've left the warning here to highlight a common challenge that needs resolution when working with the parametric g-formula on small- to mid-sized data sets.\n\n## Dynamic treatment regimes with the g-formula {#sec-dynamic}\n\n\n## The Natural Course\n", + "supporting": [ + "15-g-comp_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/chapters/15-g-comp/figure-html/fig-dag-avg-wait-2-1.png b/_freeze/chapters/15-g-comp/figure-html/fig-dag-avg-wait-2-1.png new file mode 100644 index 00000000..f3978a4d Binary files /dev/null and b/_freeze/chapters/15-g-comp/figure-html/fig-dag-avg-wait-2-1.png differ diff --git a/_freeze/chapters/15-g-comp/figure-html/fig-dag-magic-hours-wait-take-2-1.png b/_freeze/chapters/15-g-comp/figure-html/fig-dag-magic-hours-wait-take-2-1.png new file mode 100644 index 00000000..341b2c64 Binary files /dev/null and b/_freeze/chapters/15-g-comp/figure-html/fig-dag-magic-hours-wait-take-2-1.png differ diff --git a/_freeze/chapters/21-sensitivity/execute-results/html.json b/_freeze/chapters/21-sensitivity/execute-results/html.json index c5460b7f..6502cc24 100644 --- a/_freeze/chapters/21-sensitivity/execute-results/html.json +++ b/_freeze/chapters/21-sensitivity/execute-results/html.json @@ -1,8 +1,10 @@ { - "hash": "36e848491cb5cef73186bb49bc886756", + "hash": "4d7bb6bfd4714efa1e7b6dbdb7bd124b", "result": { - "markdown": "# Sensitivity analysis {#sec-sensitivity}\n\n## Quantitative bias analyses\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnorm(5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.16395 -0.06369 -0.38315 -1.17857 -0.41509\n```\n\n\n:::\n:::\n\n\n## Tipping point analyses\n", - "supporting": [], + "markdown": "# Sensitivity analysis {#sec-sensitivity}\n\n\n\n\n\n\n\nBecause many of the assumptions of causal inference are unverifiable, it's reasonable to be concerned about the validity of your results.\nIn this chapter, we'll provide some ways to probe our assumptions and results for strengths and weaknesses.\nWe'll explore two main ways to do so: exploring the logical implications of the causal question and related DAGs and using mathematical techniques to quantify how different our results would be under other circumstances, such as in the presence of unmeasured confounding.\nThese approaches are known as *sensitivity analyses*: How sensitive is our result to conditions other than those laid out in our assumptions and analysis?\n\n## Checking DAGs for robustness\n\nLet's start with where we began the modeling process: creating causal diagrams.\nBecause DAGs encode the assumptions on which we base our analysis, they are natural points of critique for both others and ourselves.\n\n### Alternate adjustment sets and alternate DAGs\n\nThe same mathematical underpinnings of DAGs that allow us to to query them for things like adjustment sets also allow us to query other implications of DAGs.\nOne of the simplest is that if your DAG is correct and your data are well-measured, any valid adjustment set should result in an unbiased estimate of the causal effect.\nLet's consider the DAG we introduced in @fig-dag-magic.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The original proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. As before, we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](21-sensitivity_files/figure-html/fig-dag-magic-orig-1.png){#fig-dag-magic-orig width=672}\n:::\n:::\n\n\nIn @fig-dag-magic-orig, there's only one adjustment set because all three confounders represent independent backdoor paths.\nLet's say, though, that we had used @fig-dag-magic-missing instead, which is missing arrows from the park close time and historical temperature to whether there was an Extra Magic Morning.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An alternative DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. This DAG has no arrows from park close time and historical temperature to Extra Magic Hours.](21-sensitivity_files/figure-html/fig-dag-magic-missing-1.png){#fig-dag-magic-missing width=672}\n:::\n:::\n\n\nNow there are 4 potential adjustment sets: `park_ticket_season, park_close + park_ticket_season`, `park_temperature_high + park_ticket_season`, or `park_close + park_temperature_high + park_ticket_season`.\n@tbl-alt-sets presents the IPW estimates for each adjustment set.\nThe effects are quite different.\nSome slight variation in the estimates is expected since they are estimated using different variables that may not be measured perfectly; if this DAG were right, however, we should see them much more closely aligned than this.\nIn particular, there seems to be a 3-minute difference between the models with and without park close time.\nThe difference in these results implies that there is something off about the causal structure we specified.\n\n\n::: {#tbl-alt-sets .cell tbl-cap='A table of ATE estimates from the IPW estimator. Each estimate was calculated for one of the valid adjustment sets for the DAG. The estimates are sorted by effect size in order. If the DAG is right and all the data well measured, different adjustment sets should give roughly the same answer.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n \n \n
Adjustment SetATE
Close time, ticket season6.579
Historic temperature, close time, ticket season6.199
Ticket season4.114
Historic temperature, ticket season3.627
\n
\n```\n\n:::\n:::\n\n\n### Negative controls\n\nAlternate adjustment sets are a way of probing the logical implications of your DAG: if it's correct, there could be many ways to account for the open backdoor paths correctly.\nThe reverse is also true: the causal structure of your research question also implies relationships that should be *null*.\nOne way that researchers take advantage of this implication is through *negative controls*.\nA negative control is either an exposure (negative exposure control) or outcome (negative outcome control) similar to your question in as many ways as possible, except that there *shouldn't* be a causal effect.\n@Lipsitch2010 describe negative controls for observational research.\nIn their article, they reference standard controls in bench science. In a lab experiment, any of these actions should lead to a null effect:\n\n1. Leave out an essential ingredient.\n2. Inactivate the hypothesized active ingredient.\n3. Check for an effect that would be impossible by the hypothesized outcome.\n\nThere's nothing unique to lab work here; these scientists merely probe the logical implications of their understanding and hypotheses. \nTo find a good negative control, you usually need to extend your DAG to include more of the causal structure surrounding your question.\nLet's look at some examples.\n\n#### Negative exposures\n\nFirst, we'll look at a negative exposure control.\nIf Extra Magic Mornings really cause an increase in wait time, it stands to reason that this effect is time-limited.\nIn other words, there should be some period after which the effect of Extra Magic Morning dissipates.\nLet's call today *i* and the previous day *i - n*, where *n* is the number of days before the outcome that the negative exposure control occurs.\nFirst, let's explore `n = 63`, e.g., whether or not there was an Extra Magic Morning nine weeks ago.\nThat is a pretty reasonable starting point: it's unlikely that the effect on wait time would still be present 63 days later.\nThis analysis is an example of leaving out an essential ingredient: we waited too long for this to be a realistic cause.\nAny remaining effect is likely due to residual confounding.\n\nLet's look at a DAG to visualize this situation.\nIn @fig-dag-day-i, we've added an identical layer to our original one: now there are two Extra Magic Mornings: one for day `i` and one for day `i - 63`.\nSimilarly, there are two versions of the confounders for each day.\nOne crucial detail in this DAG is that we're assuming that there *is* an effect of day `i - 63`'s Extra Magic Morning on day `i`'s; whether or not there is an Extra Magic Morning one day likely affects whether or not it happens on another day.\nThe decision about where to place them across the year is not random.\nIf this is true, we *would* expect an effect: the indirect effect via day `i`'s Extra Magic Morning status.\nTo get a valid negative control, we need to *inactivate this effect*, which we can do statistically by controlling for day `i`'s Extra Magic Morning status.\nSo, given the DAG, our adjustment set is any combination of the confounders (as long as we have at least one version of each) and day `i`'s Extra Magic Morning (suppressing the indirect effect).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An expansion of the causal structure presented in @fig-dag-magic. In this DAG, the exposure is instead whether or not there were Extra Magic Hours 63 days before the day's wait time we are examining. Because of the long period, there should be no effect. Similarly, the DAG also has earlier confounders related to day `i - 63`.](21-sensitivity_files/figure-html/fig-dag-day-i-1.png){#fig-dag-day-i width=672}\n:::\n:::\n\n\nSince the exposure is on day `i - 63`, we prefer to control for the confounders related to that day, so we'll use the `i - 63` versions.\nWe'll use `lag()` from dplyr to get those variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_days_lag <- 63\ndistinct_emm <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n arrange(park_date) |>\n transmute(\n park_date,\n prev_park_extra_magic_morning = lag(park_extra_magic_morning, n = n_days_lag),\n prev_park_temperature_high = lag(park_temperature_high, n = n_days_lag),\n prev_park_close = lag(park_close, n = n_days_lag),\n prev_park_ticket_season = lag(park_ticket_season, n = n_days_lag)\n )\n\nseven_dwarfs_train_2018_lag <- seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n left_join(distinct_emm, by = \"park_date\") |>\n drop_na(prev_park_extra_magic_morning)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\nWhen we use these data for the IPW effect, we get -0.93 minutes, much closer to null than we found on day `i`.\nLet's take a look at the effect over time.\nWhile there might be a lingering effect of Extra Magic Mornings for a little while (say, the span of an average trip to Disney World), it should quickly approach null.\nHowever, in @fig-sens-i-63, we see that, while it eventually approaches null, there is quite a bit of lingering effect.\nIf these results are accurate, it implies that we have some residual confounding in our effect.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot with a smoothed regression of the relationship between wait times on day `i` and whether there were Extra Magic Hours on day `i - n`, where `n` represents the number of days previous to day `i`. We expect this relationship to rapidly approach the null, but the effect hovers above null for quite some time. This lingering effect implies we have some residual confounding present.\n](21-sensitivity_files/figure-html/fig-sens-i-63-1.png){#fig-sens-i-63 width=672}\n:::\n:::\n\n\n#### Negative outcomes\n\nNow, let's examine an example of a negative control outcome: the wait time for a ride at Universal Studios.\nUniversal Studios is also in Orlando, so the set of causes for wait times are likely comparable to those at Disney World on the same day.\nOf course, whether or not there are Extra Magic Mornings at Disney shouldn't affect the wait times at Universal on the same day: they are separate parks, and most people don't visit both within an hour of one another.\nThis negative control is an example of an effect implausible by the hypothesized mechanism.\n\nWe don't have Universal's ride data, so let's simulate what would happen with and without residual confounding.\nWe'll generate wait times based on the historical temperature, park close time, and ticket season (the second two are technically specific to Disney, but we expect a strong correlation with the Universal versions).\nBecause this is a negative outcome, it is not related to whether or not there were Extra Magic Morning hours at Disney.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_sim <- seven_dwarfs_train_2018 |>\n mutate(\n # we scale each variable and add a bit of random noise\n # to simulate reasonable Universal wait times\n wait_time_universal =\n park_temperature_high / 150 +\n as.numeric(park_close) / 1500 +\n as.integer(factor(park_ticket_season)) / 1000 +\n rnorm(n(), 5, 5)\n )\n```\n:::\n\n::: {.cell}\n\n:::\n\n\nIf we calculate the IPW effect of `park_extra_magic_morning` on `wait_time_universal`, we get 0.21 minutes, a roughly null effect, as expected.\nBut what if we missed an unmeasured confounder, `u`, which caused Extra Magic Mornings and wait times at both Disney and Universal?\nLet's simulate that scenario but augment the data further.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_sim2 <- seven_dwarfs_train_2018 |>\n mutate(\n u = rnorm(n(), mean = 10, sd = 3),\n wait_minutes_posted_avg = wait_minutes_posted_avg + u,\n park_extra_magic_morning = ifelse(\n u > 10,\n rbinom(1, 1, .1),\n park_extra_magic_morning\n ),\n wait_time_universal =\n park_temperature_high / 150 +\n as.numeric(park_close) / 1500 +\n as.integer(factor(park_ticket_season)) / 1000 +\n u +\n rnorm(n(), 5, 5)\n )\n```\n:::\n\n::: {.cell}\n\n:::\n\n\nNow, the effect for both Disney and Universal wait times is different.\nIf we had seen 0.67 minutes for the effect for Disney, we wouldn't necessarily know that we had a confounded result.\nHowever, since we know the wait times at Universal should be unrelated, it's suspicious that the result, -2.32 minutes, is not null.\nThat is evidence that we have unmeasured confounding.\n\n### DAG-data consistency\n\nNegative controls use the logical implications of the causal structure you assume.\nWe can extend that idea to the entire DAG.\nIf the DAG is correct, there are many implications for statistically determining how different variables in the DAG should and should not be related to each other.\nLike negative controls, we can check if variables that *should* be independent *are* independent in the data.\nSometimes, the way that DAGs imply independence between variables is *conditional* on other variables.\nThus, this technique is sometimes called implied conditional independencies [@Textor2016]*.* Let's query our original DAG to find out what it says about the relationships among the variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquery_conditional_independence(emm_wait_dag) |>\n unnest(conditioned_on)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 4\n set a b conditioned_on\n \n1 1 park_close park_temp… \n2 2 park_close park_tick… \n3 3 park_temperature_high park_tick… \n```\n\n\n:::\n:::\n\n\nIn this DAG, three relationships should be null: 1) `park_close` and `park_temperature_high`, 2) `park_close` and `park_ticket_season`, and 3) `park_temperature_high` and `park_ticket_season`.\nNone of these relationships need to condition on another variable to achieve independence; in other words, they should be unconditionally independent.\nWe can use simple techniques like correlation and regression, as well as other statistical tests, to see if nullness holds for these relationships.\nConditional independencies quickly grow in number in complex DAGs, and so dagitty implements a way to automate checks for DAG-data consistency given these implied nulls.\ndagitty checks if the residuals of a given conditional relationship are correlated, which can be modeled automatically in several ways.\nWe'll tell dagitty to calculate the residuals using non-linear models with `type = \"cis.loess\"`.\nSince we're working with correlations, the results should be around 0 if our DAG is right.\nAs we see in @fig-conditional-ind, though, one relationship doesn't hold.\nThere is a correlation between the park's close time and ticket season.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_conditional_independence(\n emm_wait_dag,\n data = seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n mutate(\n across(where(is.character), factor),\n park_close = as.numeric(park_close),\n ) |>\n as.data.frame(),\n type = \"cis.loess\",\n # use 200 bootstrapped samples to calculate CIs\n R = 200\n) |>\n ggdag_conditional_independence()\n```\n\n::: {.cell-output-display}\n![A plot of the estimates and 95% confidence intervals of the correlations between the residuals resulting from a regression of variables in the DAG that should have no relationship. While two relationships appear null, park close time and ticket season seem to be correlated, suggesting we have misspecified the DAG. One source of this misspecification may be missing arrows between the variables. Notably, the adjustment sets are identical with and without this arrow.](21-sensitivity_files/figure-html/fig-conditional-ind-1.png){#fig-conditional-ind width=672}\n:::\n:::\n\n\nWhy might we be seeing a relationship when there isn't supposed to be one?\nA simple explanation is chance: just like in any statistical inference, we need to be cautious about over-extrapolating what we see in our limited sample.\nSince we have data for every day in 2018, we could probably rule that out.\nAnother reason is that we're missing direct arrows from one variable to the other, e.g. from historic temperature to park close time.\nAdding additional arrows is reasonable: park close time and ticket season closely track the weather.\nThat's a little bit of evidence that we're missing an arrow.\n\nAt this point, we need to be cautious about overfitting the DAG to the data.\nDAG-data consistency tests *cannot* prove your DAG right and wrong, and as we saw in @sec-quartets, statistical techniques alone cannot determine the causal structure of a problem.\nSo why use these tests?\nAs with negative controls, they provide a way to probe your assumptions.\nWhile we can never be sure about them, we *do* have information in the data.\nFinding that conditional independence holds is a little more evidence supporting your assumptions.\nThere's a fine line here, so we recommend being transparent about these types of checks: if you make changes based on the results of these tests, you should report your original DAG, too.\nNotably, in this case, adding direct arrows to all three of these relationships results in an identical adjustment set.\n\nLet's look at an example that is more likely to be misspecified, where we remove the arrows from park close time and ticket season to Extra Magic Morning.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nemm_wait_dag2 <- dagify(\n wait_minutes_posted_avg ~ park_extra_magic_morning + park_close +\n park_ticket_season + park_temperature_high,\n park_extra_magic_morning ~ park_temperature_high,\n coords = coord_dag,\n labels = labels,\n exposure = \"park_extra_magic_morning\",\n outcome = \"wait_minutes_posted_avg\"\n)\n\nquery_conditional_independence(emm_wait_dag2) |>\n unnest(conditioned_on)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 4\n set a b conditioned_on\n \n1 1 park_close park_e… \n2 2 park_close park_t… \n3 3 park_close park_t… \n4 4 park_extra_magic_morning park_t… \n5 5 park_temperature_high park_t… \n```\n\n\n:::\n:::\n\n\nThis alternative DAG introduces two new relationships that should be independent.\nIn @fig-conditional-ind-misspec, we see an additional association between ticket season and Extra Magic Morning.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_conditional_independence(\n emm_wait_dag2,\n data = seven_dwarfs_train_2018 |>\n filter(wait_hour == 9) |>\n mutate(\n across(where(is.character), factor),\n park_close = as.numeric(park_close),\n ) |>\n as.data.frame(),\n type = \"cis.loess\",\n R = 200\n) |>\n ggdag_conditional_independence()\n```\n\n::: {.cell-output-display}\n![A plot of the estimates and 95% confidence intervals of the correlations between the residuals resulting from a regression of variables in the DAG that should have no relationship. While two relationships appear null, park close time and ticket season seem to be correlated, suggesting we have misspecified the DAG. One source of this misspecification may be missing arrows between the variables.](21-sensitivity_files/figure-html/fig-conditional-ind-misspec-1.png){#fig-conditional-ind-misspec width=672}\n:::\n:::\n\n\nSo, is this DAG wrong?\nBased on our understanding of the problem, it seems likely that's the case, but interpreting DAG-data consistency tests has a hiccup: different DAGs can have the same set of conditional independencies.\nIn the case of our DAG, one other DAG can generate the same implied conditional independencies (@fig-equiv-dag).\nThese are called *equivalent* DAGs because their implications are the same.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_equivalent_dags(emm_wait_dag2)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Equivalent DAGs for the likely misspecified version of @fig-dag-magic. These two DAGs produce the same set of implied conditional independencies. The difference between them is only the direction of the arrow between historic high temperature and Extra Magic Hours.](21-sensitivity_files/figure-html/fig-equiv-dag-1.png){#fig-equiv-dag width=864}\n:::\n:::\n\n\nEquivalent DAGs are generated by *reversing* arrows.\nThe subset of DAGs with reversible arrows that generate the same implications is called an *equivalence class*.\nWhile technical, this connection can condense the visualization to a single DAG where the reversible edges are denoted by a straight line without arrows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_equivalent_class(emm_wait_dag2, use_text = FALSE, use_labels = TRUE)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![An alternative way of visualizing @fig-equiv-dag where all the equivalent DAGs are condensed to a single version where the *reversible* edges are denoted with edges without arrows.](21-sensitivity_files/figure-html/fig-equiv-class-1.png){#fig-equiv-class width=480}\n:::\n:::\n\n\nSo, what do we do with this information?\nSince many DAGs can produce the same set of conditional independencies, one strategy is to find all the adjustment sets that would be valid for every equivalent DAG.\ndagitty makes this straightforward by calling `equivalenceClass()` and `adjustmentSets()`, but in this case, there are *no* overlapping adjustment sets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dagitty)\n# determine valid sets for all equiv. DAGs\nequivalenceClass(emm_wait_dag2) |>\n adjustmentSets(type = \"all\")\n```\n:::\n\n\nWe can see that by looking at the individual equivalent DAGs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndags <- equivalentDAGs(emm_wait_dag2)\n\n# no overlapping sets\ndags[[1]] |> adjustmentSets(type = \"all\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ park_temperature_high }\n{ park_close, park_temperature_high }\n{ park_temperature_high, park_ticket_season }\n{ park_close, park_temperature_high,\n park_ticket_season }\n```\n\n\n:::\n\n```{.r .cell-code}\ndags[[2]] |> adjustmentSets(type = \"all\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n {}\n{ park_close }\n{ park_ticket_season }\n{ park_close, park_ticket_season }\n```\n\n\n:::\n:::\n\n\nThe good news is that, in this case, one of the equivalent DAGs doesn't make logical sense: the reversible edge is from historical weather to Extra Magic Morning, but that is impossible for both time-ordering reasons (historical temperature occurs in the past) and for logical ones (Disney may be powerful, but to our knowledge, they can't yet control the weather).\nEven though we're using more data in these types of checks, we need to consider the logical and time-ordered plausibility of possible scenarios.\n\n### Alternate DAGs\n\n\n\nAs we mentioned in @sec-dags-iterate, you should specify your DAG ahead of time with ample feedback from other experts.\nLet's now take the opposite approach to the last example: what if we used the original DAG but received feedback after the analysis that we should add more variables?\nConsider the expanded DAG in @fig-dag-extra-days.\nWe've added two new confounders: whether it's a weekend or a holiday.\nThis analysis differs from when we checked alternate adjustment sets in the same DAG; in that case, we checked the DAG's logical consistency.\nIn this case, we're considering a different causal structure.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An expansion of @fig-dag-magic, which now includes two new variables on their own backdoor paths: whether or not it's a holiday and/or a weekend.\n](21-sensitivity_files/figure-html/fig-dag-extra-days-1.png){#fig-dag-extra-days width=672}\n:::\n:::\n\n\nWe can calculate these features from `park_date` using the timeDate package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(timeDate)\n\nholidays <- c(\n \"USChristmasDay\",\n \"USColumbusDay\",\n \"USIndependenceDay\",\n \"USLaborDay\",\n \"USLincolnsBirthday\",\n \"USMemorialDay\",\n \"USMLKingsBirthday\",\n \"USNewYearsDay\",\n \"USPresidentsDay\",\n \"USThanksgivingDay\",\n \"USVeteransDay\",\n \"USWashingtonsBirthday\"\n) |>\n holiday(2018, Holiday = _) |>\n as.Date()\n\nseven_dwarfs_with_days <- seven_dwarfs_train_2018 |>\n mutate(\n is_holiday = park_date %in% holidays,\n is_weekend = isWeekend(park_date)\n ) |>\n filter(wait_hour == 9)\n```\n:::\n\n\nBoth Extra Magic Morning hours and posted wait times are associated with whether it's a holiday or weekend.\n\n\n::: {#tbl-days .cell tbl-cap='The descriptive associations between the two new variables, holiday and weekend, and the exposure and outcome. The average posted waiting time differs on both holidays and weekends, as do the occurrences of Extra Magic Hours. While we can\\'t determine a confounding relationship from descriptive statistics alone, this adds to the evidence that these are confounders.\n'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n \n \n \n \n \n \n
Characteristic\n Weekend\n \n Holiday\n
FALSE, N = 2531TRUE, N = 1011FALSE, N = 3431TRUE, N = 111
Posted Wait Time67 (60, 77)65 (56, 75)66 (58, 77)76 (63, 77)
Extra Magic Morning55 (22%)5 (5.0%)59 (17%)1 (9.1%)
1 Median (IQR); n (%)
\n
\n```\n\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nWhen we refit the IPW estimator, we get 7.58 minutes, slightly bigger than we got without the two new confounders.\nBecause it was a deviation from the analysis plan, you should likely report both effects.\nThat said, this new DAG is probably more correct than the original one.\nFrom a decision point of view, though, the difference is slight in absolute terms (about a minute) and the effect in the same direction as the original estimate.\nIn other words, the result is not terribly sensitive to this change regarding how we might act on the information.\n\nOne other point here: sometimes, people present the results of using increasingly complicated adjustment sets.\nThis comes from the tradition of comparing complex models to parsimonious ones.\nThat type of comparison is a sensitivity analysis in its own right, but it should be principled: rather than fitting simple models for simplicity's sake, you should compare *competing* adjustment sets or conditions.\nFor instance, you may feel like these two DAGs are equally plausible or want to examine if adding other variables better captures the baseline crowd flow at the Magic Kingdom.\n\n## Quantitative bias analyses\n\nThus far, we've probed some of the assumptions we've made about the causal structure of the question.\nWe can take this further using quantitative bias analysis, which uses mathematical assumptions to see how results would change under different conditions.\n\n### Tipping point analyses\n\n### Other types of QBA\n", + "supporting": [ + "21-sensitivity_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-conditional-ind-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-conditional-ind-1.png new file mode 100644 index 00000000..817756eb Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-conditional-ind-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-conditional-ind-misspec-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-conditional-ind-misspec-1.png new file mode 100644 index 00000000..a3512b0b Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-conditional-ind-misspec-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-dag-day-i-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-day-i-1.png new file mode 100644 index 00000000..1b3fc0f1 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-day-i-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-dag-extra-days-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-extra-days-1.png new file mode 100644 index 00000000..f7461e6b Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-extra-days-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-dag-magic-missing-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-magic-missing-1.png new file mode 100644 index 00000000..1db42cc1 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-magic-missing-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-dag-magic-orig-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-magic-orig-1.png new file mode 100644 index 00000000..01939612 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-dag-magic-orig-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-equiv-class-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-equiv-class-1.png new file mode 100644 index 00000000..3eeafcc2 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-equiv-class-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-equiv-dag-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-equiv-dag-1.png new file mode 100644 index 00000000..7e42a11d Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-equiv-dag-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-sens-i-200-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-sens-i-200-1.png new file mode 100644 index 00000000..946202dc Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-sens-i-200-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/fig-sens-i-63-1.png b/_freeze/chapters/21-sensitivity/figure-html/fig-sens-i-63-1.png new file mode 100644 index 00000000..f037555b Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/fig-sens-i-63-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-10-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 00000000..946202dc Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-10-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-17-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 00000000..3a24b4e5 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-17-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-18-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 00000000..b01f82f5 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-18-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-19-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-19-1.png new file mode 100644 index 00000000..603f07ed Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-19-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-20-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 00000000..e68108ff Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-20-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-3-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 00000000..01939612 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-3-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-4-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 00000000..1db42cc1 Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-4-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-8-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 00000000..a564246e Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-8-1.png differ diff --git a/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-9-1.png b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-9-1.png new file mode 100644 index 00000000..f037555b Binary files /dev/null and b/_freeze/chapters/21-sensitivity/figure-html/unnamed-chunk-9-1.png differ diff --git a/chapters/05-dags.qmd b/chapters/05-dags.qmd index 50f331e0..1e79185c 100644 --- a/chapters/05-dags.qmd +++ b/chapters/05-dags.qmd @@ -1216,7 +1216,7 @@ dag_data_used |> In this section, we'll offer some advice from @Tennant2021 and our own experience assembling DAGs. -### Iterate early and often +### Iterate early and often {#sec-dags-iterate} One of the best things you can do for the quality of your results is to make the DAG before you conduct the study, ideally before you even collect the data. If you're already working with your data, at minimum, build your DAG before doing data analysis. diff --git a/chapters/21-sensitivity.qmd b/chapters/21-sensitivity.qmd index f8a8ceee..d610f919 100644 --- a/chapters/21-sensitivity.qmd +++ b/chapters/21-sensitivity.qmd @@ -1,11 +1,775 @@ -# Sensitivity analysis {#sec-sensitivity} - -## Quantitative bias analyses +# Sensitivity analysis {#sec-sensitivity} {{< include 00-setup.qmd >}} ```{r} -rnorm(5) +#| include: false +library(ggdag) +library(touringplans) +library(ggokabeito) +library(broom) +library(propensity) +library(gt) +``` + +Because many of the assumptions of causal inference are unverifiable, it's reasonable to be concerned about the validity of your results. +In this chapter, we'll provide some ways to probe our assumptions and results for strengths and weaknesses. +We'll explore two main ways to do so: exploring the logical implications of the causal question and related DAGs and using mathematical techniques to quantify how different our results would be under other circumstances, such as in the presence of unmeasured confounding. +These approaches are known as *sensitivity analyses*: How sensitive is our result to conditions other than those laid out in our assumptions and analysis? + +## Checking DAGs for robustness + +Let's start with where we began the modeling process: creating causal diagrams. +Because DAGs encode the assumptions on which we base our analysis, they are natural points of critique for both others and ourselves. + +### Alternate adjustment sets and alternate DAGs + +The same mathematical underpinnings of DAGs that allow us to to query them for things like adjustment sets also allow us to query other implications of DAGs. +One of the simplest is that if your DAG is correct and your data are well-measured, any valid adjustment set should result in an unbiased estimate of the causal effect. +Let's consider the DAG we introduced in @fig-dag-magic. + +```{r} +#| label: fig-dag-magic-orig +#| echo: false +#| fig-cap: > +#| The original proposed DAG for the relationship between Extra Magic Hours +#| in the morning at a particular park and the average wait +#| time between 9 am and 10 am. +#| As before, we are saying that we believe 1) Extra Magic Hours impacts +#| average wait time and 2) both Extra Magic Hours and average wait time +#| are determined by the time the park closes, historic high temperatures, +#| and ticket season. +coord_dag <- list( + x = c(park_ticket_season = 0, park_close = 0, park_temperature_high = -1, park_extra_magic_morning = 1, wait_minutes_posted_avg = 2), + y = c(park_ticket_season = -1, park_close = 1, park_temperature_high = 0, park_extra_magic_morning = 0, wait_minutes_posted_avg = 0) +) + +labels <- c( + park_extra_magic_morning = "Extra Magic\nMorning", + wait_minutes_posted_avg = "Average\nwait", + park_ticket_season = "Ticket\nSeason", + park_temperature_high = "Historic high\ntemperature", + park_close = "Time park\nclosed" +) + +emm_wait_dag <- dagify( + wait_minutes_posted_avg ~ park_extra_magic_morning + park_close + park_ticket_season + park_temperature_high, + park_extra_magic_morning ~ park_temperature_high + park_close + park_ticket_season, + coords = coord_dag, + labels = labels, + exposure = "park_extra_magic_morning", + outcome = "wait_minutes_posted_avg" +) + +curvatures <- rep(0, 7) +curvatures[5] <- .3 + +emm_wait_dag |> + tidy_dagitty() |> + node_status() |> + ggplot( + aes(x, y, xend = xend, yend = yend, color = status) + ) + + geom_dag_edges_arc(curvature = curvatures, edge_color = "grey80") + + geom_dag_point() + + geom_dag_text_repel(aes(label = label), size = 3.8, seed = 1630, color = "#494949") + + scale_color_okabe_ito(na.value = "grey90") + + theme_dag() + + theme(legend.position = "none") + + coord_cartesian(clip = "off") + + scale_x_continuous( + limits = c(-1.25, 2.25), + breaks = c(-1, 0, 1, 2) + ) +``` + +In @fig-dag-magic-orig, there's only one adjustment set because all three confounders represent independent backdoor paths. +Let's say, though, that we had used @fig-dag-magic-missing instead, which is missing arrows from the park close time and historical temperature to whether there was an Extra Magic Morning. + +```{r} +#| label: fig-dag-magic-missing +#| echo: false +#| fig-cap: > +#| An alternative DAG for the relationship between Extra Magic Hours +#| in the morning at a particular park and the average wait +#| time between 9 am and 10 am. +#| This DAG has no arrows from park close time and historical temperature to Extra Magic Hours. +emm_wait_dag_missing <- dagify( + wait_minutes_posted_avg ~ park_extra_magic_morning + park_close + park_ticket_season + park_temperature_high, + park_extra_magic_morning ~ park_ticket_season, + coords = coord_dag, + labels = labels, + exposure = "park_extra_magic_morning", + outcome = "wait_minutes_posted_avg" +) + +# produces below: +# park_ticket_season, park_close + park_ticket_season, park_temperature_high + park_ticket_season, or park_close + park_temperature_high + park_ticket_season +adj_sets <- unclass(dagitty::adjustmentSets(emm_wait_dag_missing, type = "all")) |> + map_chr(\(.x) glue::glue('{unlist(glue::glue_collapse(.x, sep = " + "))}')) |> + glue::glue_collapse(sep = ", ", last = ", or ") + +curvatures <- rep(0, 5) +curvatures[3] <- .3 + +emm_wait_dag_missing |> + tidy_dagitty() |> + node_status() |> + ggplot( + aes(x, y, xend = xend, yend = yend, color = status) + ) + + geom_dag_edges_arc(curvature = curvatures, edge_color = "grey80") + + geom_dag_point() + + geom_dag_text_repel(aes(label = label), size = 3.8, seed = 1630, color = "#494949") + + scale_color_okabe_ito(na.value = "grey90") + + theme_dag() + + theme(legend.position = "none") + + coord_cartesian(clip = "off") + + scale_x_continuous( + limits = c(-1.25, 2.25), + breaks = c(-1, 0, 1, 2) + ) +``` + +Now there are `r length(dagitty::adjustmentSets(emm_wait_dag_missing, type = "all"))` potential adjustment sets: `park_ticket_season, park_close + park_ticket_season`, `park_temperature_high + park_ticket_season`, or `park_close + park_temperature_high + park_ticket_season`. +@tbl-alt-sets presents the IPW estimates for each adjustment set. +The effects are quite different. +Some slight variation in the estimates is expected since they are estimated using different variables that may not be measured perfectly; if this DAG were right, however, we should see them much more closely aligned than this. +In particular, there seems to be a 3-minute difference between the models with and without park close time. +The difference in these results implies that there is something off about the causal structure we specified. + +```{r} +#| label: tbl-alt-sets +#| tbl-cap: "A table of ATE estimates from the IPW estimator. Each estimate was calculated for one of the valid adjustment sets for the DAG. The estimates are sorted by effect size in order. If the DAG is right and all the data well measured, different adjustment sets should give roughly the same answer." +#| echo: false +seven_dwarfs <- touringplans::seven_dwarfs_train_2018 |> + filter(wait_hour == 9) + +# we'll use `.data` and `.trt` later +fit_ipw_effect <- function(.fmla, .data = seven_dwarfs, .trt = "park_extra_magic_morning", .outcome_fmla = wait_minutes_posted_avg ~ park_extra_magic_morning) { + .trt_var <- rlang::ensym(.trt) + + # fit propensity score model + propensity_model <- glm( + .fmla, + data = .data, + family = binomial() + ) + + # calculate ATE weights + .df <- propensity_model |> + augment(type.predict = "response", data = .data) |> + mutate(w_ate = wt_ate(.fitted, !!.trt_var, exposure_type = "binary")) + + # fit ipw model + lm(.outcome_fmla, data = .df, weights = w_ate) |> + tidy() |> + filter(term == .trt) |> + pull(estimate) +} + +effects <- list( + park_extra_magic_morning ~ park_ticket_season, + park_extra_magic_morning ~ park_close + park_ticket_season, + park_extra_magic_morning ~ park_temperature_high + park_ticket_season, + park_extra_magic_morning ~ park_temperature_high + + park_close + park_ticket_season +) |> + map_dbl(fit_ipw_effect) + +tibble( + `Adjustment Set` = c( + "Ticket season", + "Close time, ticket season", + "Historic temperature, ticket season", + "Historic temperature, close time, ticket season" + ), + ATE = effects +) |> + arrange(desc(ATE)) |> + gt() +``` + +### Negative controls + +Alternate adjustment sets are a way of probing the logical implications of your DAG: if it's correct, there could be many ways to account for the open backdoor paths correctly. +The reverse is also true: the causal structure of your research question also implies relationships that should be *null*. +One way that researchers take advantage of this implication is through *negative controls*. +A negative control is either an exposure (negative exposure control) or outcome (negative outcome control) similar to your question in as many ways as possible, except that there *shouldn't* be a causal effect. +@Lipsitch2010 describe negative controls for observational research. +In their article, they reference standard controls in bench science. In a lab experiment, any of these actions should lead to a null effect: + +1. Leave out an essential ingredient. +2. Inactivate the hypothesized active ingredient. +3. Check for an effect that would be impossible by the hypothesized outcome. + +There's nothing unique to lab work here; these scientists merely probe the logical implications of their understanding and hypotheses. +To find a good negative control, you usually need to extend your DAG to include more of the causal structure surrounding your question. +Let's look at some examples. + +#### Negative exposures + +First, we'll look at a negative exposure control. +If Extra Magic Mornings really cause an increase in wait time, it stands to reason that this effect is time-limited. +In other words, there should be some period after which the effect of Extra Magic Morning dissipates. +Let's call today *i* and the previous day *i - n*, where *n* is the number of days before the outcome that the negative exposure control occurs. +First, let's explore `n = 63`, e.g., whether or not there was an Extra Magic Morning nine weeks ago. +That is a pretty reasonable starting point: it's unlikely that the effect on wait time would still be present 63 days later. +This analysis is an example of leaving out an essential ingredient: we waited too long for this to be a realistic cause. +Any remaining effect is likely due to residual confounding. + +Let's look at a DAG to visualize this situation. +In @fig-dag-day-i, we've added an identical layer to our original one: now there are two Extra Magic Mornings: one for day `i` and one for day `i - 63`. +Similarly, there are two versions of the confounders for each day. +One crucial detail in this DAG is that we're assuming that there *is* an effect of day `i - 63`'s Extra Magic Morning on day `i`'s; whether or not there is an Extra Magic Morning one day likely affects whether or not it happens on another day. +The decision about where to place them across the year is not random. +If this is true, we *would* expect an effect: the indirect effect via day `i`'s Extra Magic Morning status. +To get a valid negative control, we need to *inactivate this effect*, which we can do statistically by controlling for day `i`'s Extra Magic Morning status. +So, given the DAG, our adjustment set is any combination of the confounders (as long as we have at least one version of each) and day `i`'s Extra Magic Morning (suppressing the indirect effect). + +```{r} +#| label: fig-dag-day-i +#| echo: false +#| fig-cap: > +#| An expansion of the causal structure presented in @fig-dag-magic. +#| In this DAG, the exposure is instead whether or not there were Extra Magic Hours +#| 63 days before the day's wait time we are examining. +#| Because of the long period, there should be no effect. +#| Similarly, the DAG also has earlier confounders related to day `i - 63`. +labels <- c( + x63 = "Extra Magic\nMorning (i-63)", + x = "Extra Magic\nMorning (i)", + y = "Average\nwait", + season = "Ticket\nSeason", + weather = "Historic\nhigh\ntemperature", + close = "Time park\nclosed (i)", + season63 = "Ticket Season\n(i-63)", + weather63 = "Historic\nhigh\ntemperature\n(i-63)", + close63 = "Time park\nclosed (i-63)" +) + +dagify( + y ~ x + close + season + weather, + x ~ weather + close + season + x63, + x63 ~ weather63 + close63 + season63, + weather ~ weather63, + close ~ close63, + season ~ season63, + coords = time_ordered_coords(), + labels = labels, + exposure = "x63", + outcome = "y" +) |> + tidy_dagitty() |> + node_status() |> + ggplot( + aes(x, y, xend = xend, yend = yend, color = status) + ) + + geom_dag_edges_link(edge_color = "grey80") + + geom_dag_point() + + geom_dag_text_repel(aes(label = label), size = 3.8, color = "#494949") + + scale_color_okabe_ito(na.value = "grey90") + + theme_dag() + + theme(legend.position = "none") + + coord_cartesian(clip = "off") +``` + +Since the exposure is on day `i - 63`, we prefer to control for the confounders related to that day, so we'll use the `i - 63` versions. +We'll use `lag()` from dplyr to get those variables. + +```{r} +#| eval: false +n_days_lag <- 63 +distinct_emm <- seven_dwarfs_train_2018 |> + filter(wait_hour == 9) |> + arrange(park_date) |> + transmute( + park_date, + prev_park_extra_magic_morning = lag(park_extra_magic_morning, n = n_days_lag), + prev_park_temperature_high = lag(park_temperature_high, n = n_days_lag), + prev_park_close = lag(park_close, n = n_days_lag), + prev_park_ticket_season = lag(park_ticket_season, n = n_days_lag) + ) + +seven_dwarfs_train_2018_lag <- seven_dwarfs_train_2018 |> + filter(wait_hour == 9) |> + left_join(distinct_emm, by = "park_date") |> + drop_na(prev_park_extra_magic_morning) +``` + +```{r} +#| echo: false +calculate_coef <- function(n_days_lag) { + distinct_emm <- seven_dwarfs_train_2018 |> + filter(wait_hour == 9) |> + arrange(park_date) |> + transmute( + park_date, + prev_park_extra_magic_morning = lag(park_extra_magic_morning, n = n_days_lag), + prev_park_temperature_high = lag(park_temperature_high, n = n_days_lag), + prev_park_close = lag(park_close, n = n_days_lag), + prev_park_ticket_season = lag(park_ticket_season, n = n_days_lag) + ) + + seven_dwarfs_train_2018_lag <- seven_dwarfs_train_2018 |> + filter(wait_hour == 9) |> + left_join(distinct_emm, by = "park_date") |> + drop_na(prev_park_extra_magic_morning) + + fit_ipw_effect( + prev_park_extra_magic_morning ~ prev_park_temperature_high + prev_park_close + prev_park_ticket_season, + .data = seven_dwarfs_train_2018_lag, + .trt = "prev_park_extra_magic_morning", + .outcome_fmla = wait_minutes_posted_avg ~ prev_park_extra_magic_morning + park_extra_magic_morning + ) +} + +result63 <- calculate_coef(63) |> + round(2) +``` + +When we use these data for the IPW effect, we get `r result63` minutes, much closer to null than we found on day `i`. +Let's take a look at the effect over time. +While there might be a lingering effect of Extra Magic Mornings for a little while (say, the span of an average trip to Disney World), it should quickly approach null. +However, in @fig-sens-i-63, we see that, while it eventually approaches null, there is quite a bit of lingering effect. +If these results are accurate, it implies that we have some residual confounding in our effect. + +```{r} +#| label: fig-sens-i-63 +#| fig-cap: > +#| A scatterplot with a smoothed regression of the relationship between wait times on day `i` and whether there were Extra Magic Hours on day `i - n`, where `n` represents the number of days previous to day `i`. We expect this relationship to rapidly approach the null, but the effect hovers above null for quite some time. This lingering effect implies we have some residual confounding present. +#| echo: false +#| warning: false +#| message: false +coefs <- purrr::map_dbl(1:63, calculate_coef) + +ggplot(data.frame(coefs = coefs, x = 1:63), aes(x = x, y = coefs)) + + geom_hline(yintercept = 0) + + geom_point() + + geom_smooth(se = FALSE) + + labs(y = "difference in wait times (minutes)\n on day (i) for EMM on day (i - n)", x = "day (i - n)") +``` + +#### Negative outcomes + +Now, let's examine an example of a negative control outcome: the wait time for a ride at Universal Studios. +Universal Studios is also in Orlando, so the set of causes for wait times are likely comparable to those at Disney World on the same day. +Of course, whether or not there are Extra Magic Mornings at Disney shouldn't affect the wait times at Universal on the same day: they are separate parks, and most people don't visit both within an hour of one another. +This negative control is an example of an effect implausible by the hypothesized mechanism. + +We don't have Universal's ride data, so let's simulate what would happen with and without residual confounding. +We'll generate wait times based on the historical temperature, park close time, and ticket season (the second two are technically specific to Disney, but we expect a strong correlation with the Universal versions). +Because this is a negative outcome, it is not related to whether or not there were Extra Magic Morning hours at Disney. + +```{r} +seven_dwarfs_sim <- seven_dwarfs_train_2018 |> + mutate( + # we scale each variable and add a bit of random noise + # to simulate reasonable Universal wait times + wait_time_universal = + park_temperature_high / 150 + + as.numeric(park_close) / 1500 + + as.integer(factor(park_ticket_season)) / 1000 + + rnorm(n(), 5, 5) + ) +``` + +```{r} +#| echo: false +wait_universal <- seven_dwarfs_sim |> + fit_ipw_effect( + park_extra_magic_morning ~ park_temperature_high + + park_close + park_ticket_season, + .data = _, + .outcome_fmla = wait_time_universal ~ park_extra_magic_morning + ) |> + round(2) +``` + +If we calculate the IPW effect of `park_extra_magic_morning` on `wait_time_universal`, we get `r wait_universal` minutes, a roughly null effect, as expected. +But what if we missed an unmeasured confounder, `u`, which caused Extra Magic Mornings and wait times at both Disney and Universal? +Let's simulate that scenario but augment the data further. + +```{r} +seven_dwarfs_sim2 <- seven_dwarfs_train_2018 |> + mutate( + u = rnorm(n(), mean = 10, sd = 3), + wait_minutes_posted_avg = wait_minutes_posted_avg + u, + park_extra_magic_morning = ifelse( + u > 10, + rbinom(1, 1, .1), + park_extra_magic_morning + ), + wait_time_universal = + park_temperature_high / 150 + + as.numeric(park_close) / 1500 + + as.integer(factor(park_ticket_season)) / 1000 + + u + + rnorm(n(), 5, 5) + ) +``` + +```{r} +#| echo: false +disney <- seven_dwarfs_sim2 |> + fit_ipw_effect( + park_extra_magic_morning ~ park_temperature_high + + park_close + park_ticket_season, + .data = _ + ) |> + round(2) + +universal <- seven_dwarfs_sim2 |> + fit_ipw_effect( + park_extra_magic_morning ~ park_temperature_high + + park_close + park_ticket_season, + .data = _, + .outcome_fmla = wait_time_universal ~ park_extra_magic_morning + ) |> + round(2) +``` + +Now, the effect for both Disney and Universal wait times is different. +If we had seen `r disney` minutes for the effect for Disney, we wouldn't necessarily know that we had a confounded result. +However, since we know the wait times at Universal should be unrelated, it's suspicious that the result, `r universal` minutes, is not null. +That is evidence that we have unmeasured confounding. + +### DAG-data consistency + +Negative controls use the logical implications of the causal structure you assume. +We can extend that idea to the entire DAG. +If the DAG is correct, there are many implications for statistically determining how different variables in the DAG should and should not be related to each other. +Like negative controls, we can check if variables that *should* be independent *are* independent in the data. +Sometimes, the way that DAGs imply independence between variables is *conditional* on other variables. +Thus, this technique is sometimes called implied conditional independencies [@Textor2016]*.* Let's query our original DAG to find out what it says about the relationships among the variables. + +```{r} +query_conditional_independence(emm_wait_dag) |> + unnest(conditioned_on) +``` + +In this DAG, three relationships should be null: 1) `park_close` and `park_temperature_high`, 2) `park_close` and `park_ticket_season`, and 3) `park_temperature_high` and `park_ticket_season`. +None of these relationships need to condition on another variable to achieve independence; in other words, they should be unconditionally independent. +We can use simple techniques like correlation and regression, as well as other statistical tests, to see if nullness holds for these relationships. +Conditional independencies quickly grow in number in complex DAGs, and so dagitty implements a way to automate checks for DAG-data consistency given these implied nulls. +dagitty checks if the residuals of a given conditional relationship are correlated, which can be modeled automatically in several ways. +We'll tell dagitty to calculate the residuals using non-linear models with `type = "cis.loess"`. +Since we're working with correlations, the results should be around 0 if our DAG is right. +As we see in @fig-conditional-ind, though, one relationship doesn't hold. +There is a correlation between the park's close time and ticket season. + +```{r} +#| label: fig-conditional-ind +#| fig-cap: > +#| A plot of the estimates and 95% confidence intervals of the correlations between the residuals resulting from a regression of variables in the DAG that should have no relationship. While two relationships appear null, park close time and ticket season seem to be correlated, suggesting we have misspecified the DAG. One source of this misspecification may be missing arrows between the variables. Notably, the adjustment sets are identical with and without this arrow. +test_conditional_independence( + emm_wait_dag, + data = seven_dwarfs_train_2018 |> + filter(wait_hour == 9) |> + mutate( + across(where(is.character), factor), + park_close = as.numeric(park_close), + ) |> + as.data.frame(), + type = "cis.loess", + # use 200 bootstrapped samples to calculate CIs + R = 200 +) |> + ggdag_conditional_independence() +``` + +Why might we be seeing a relationship when there isn't supposed to be one? +A simple explanation is chance: just like in any statistical inference, we need to be cautious about over-extrapolating what we see in our limited sample. +Since we have data for every day in 2018, we could probably rule that out. +Another reason is that we're missing direct arrows from one variable to the other, e.g. from historic temperature to park close time. +Adding additional arrows is reasonable: park close time and ticket season closely track the weather. +That's a little bit of evidence that we're missing an arrow. + +At this point, we need to be cautious about overfitting the DAG to the data. +DAG-data consistency tests *cannot* prove your DAG right and wrong, and as we saw in @sec-quartets, statistical techniques alone cannot determine the causal structure of a problem. +So why use these tests? +As with negative controls, they provide a way to probe your assumptions. +While we can never be sure about them, we *do* have information in the data. +Finding that conditional independence holds is a little more evidence supporting your assumptions. +There's a fine line here, so we recommend being transparent about these types of checks: if you make changes based on the results of these tests, you should report your original DAG, too. +Notably, in this case, adding direct arrows to all three of these relationships results in an identical adjustment set. + +Let's look at an example that is more likely to be misspecified, where we remove the arrows from park close time and ticket season to Extra Magic Morning. + +```{r} +#| echo: false +labels <- c( + park_extra_magic_morning = "Extra Magic\nMorning", + wait_minutes_posted_avg = "Average\nwait", + park_ticket_season = "Ticket\nSeason", + park_temperature_high = "Historic high\ntemperature", + park_close = "Time park\nclosed" +) +``` + +```{r} +emm_wait_dag2 <- dagify( + wait_minutes_posted_avg ~ park_extra_magic_morning + park_close + + park_ticket_season + park_temperature_high, + park_extra_magic_morning ~ park_temperature_high, + coords = coord_dag, + labels = labels, + exposure = "park_extra_magic_morning", + outcome = "wait_minutes_posted_avg" +) + +query_conditional_independence(emm_wait_dag2) |> + unnest(conditioned_on) +``` + +This alternative DAG introduces two new relationships that should be independent. +In @fig-conditional-ind-misspec, we see an additional association between ticket season and Extra Magic Morning. + +```{r} +#| label: fig-conditional-ind-misspec +#| fig-cap: > +#| A plot of the estimates and 95% confidence intervals of the correlations between the residuals resulting from a regression of variables in the DAG that should have no relationship. While two relationships appear null, park close time and ticket season seem to be correlated, suggesting we have misspecified the DAG. One source of this misspecification may be missing arrows between the variables. +test_conditional_independence( + emm_wait_dag2, + data = seven_dwarfs_train_2018 |> + filter(wait_hour == 9) |> + mutate( + across(where(is.character), factor), + park_close = as.numeric(park_close), + ) |> + as.data.frame(), + type = "cis.loess", + R = 200 +) |> + ggdag_conditional_independence() +``` + +So, is this DAG wrong? +Based on our understanding of the problem, it seems likely that's the case, but interpreting DAG-data consistency tests has a hiccup: different DAGs can have the same set of conditional independencies. +In the case of our DAG, one other DAG can generate the same implied conditional independencies (@fig-equiv-dag). +These are called *equivalent* DAGs because their implications are the same. + +```{r} +#| eval: false +ggdag_equivalent_dags(emm_wait_dag2) +``` + +```{r} +#| label: fig-equiv-dag +#| echo: false +#| fig-width: 9 +#| fig-cap: > +#| Equivalent DAGs for the likely misspecified version of @fig-dag-magic. +#| These two DAGs produce the same set of implied conditional independencies. +#| The difference between them is only the direction of the arrow between +#| historic high temperature and Extra Magic Hours. +curvatures <- rep(0, 10) +curvatures[c(4, 9)] <- .25 + +ggdag_equivalent_dags(emm_wait_dag2, use_edges = FALSE, use_text = FALSE) + + geom_dag_edges_arc(data = function(x) distinct(x), curvature = curvatures, edge_color = "grey80") + + geom_dag_edges_link(data = function(x) filter(x, (name == "park_extra_magic_morning" & to == "park_temperature_high") | (name == "park_temperature_high" & to == "park_extra_magic_morning")), edge_color = "black") + + geom_dag_text_repel(aes(label = label), data = function(x) filter(x, label %in% c("Extra Magic\nMorning", "Historic high\ntemperature")), box.padding = 15, seed = 12, color = "#494949") + + theme_dag() +``` + +Equivalent DAGs are generated by *reversing* arrows. +The subset of DAGs with reversible arrows that generate the same implications is called an *equivalence class*. +While technical, this connection can condense the visualization to a single DAG where the reversible edges are denoted by a straight line without arrows. + +```{r} +#| eval: false +ggdag_equivalent_class(emm_wait_dag2, use_text = FALSE, use_labels = TRUE) +``` + +```{r} +#| label: fig-equiv-class +#| echo: false +#| fig-width: 5 +#| fig-cap: > +#| An alternative way of visualizing @fig-equiv-dag where all the equivalent +#| DAGs are condensed to a single version where the *reversible* edges are denoted +#| with edges without arrows. +curvatures <- rep(0, 4) +curvatures[3] <- .25 + +emm_wait_dag2 |> + node_equivalent_class() |> + ggdag(use_edges = FALSE, use_text = FALSE) + + geom_dag_edges_arc(data = function(x) filter(x, !reversable), curvature = curvatures, edge_color = "grey90") + + geom_dag_edges_link(data = function(x) filter(x, reversable), arrow = NULL) + + geom_dag_text_repel(aes(label = label), data = function(x) filter(x, label %in% c("Extra Magic\nMorning", "Historic high\ntemperature")), box.padding = 16, seed = 12, size = 5, color = "#494949") + + theme_dag() +``` + +So, what do we do with this information? +Since many DAGs can produce the same set of conditional independencies, one strategy is to find all the adjustment sets that would be valid for every equivalent DAG. +dagitty makes this straightforward by calling `equivalenceClass()` and `adjustmentSets()`, but in this case, there are *no* overlapping adjustment sets. + +```{r} +library(dagitty) +# determine valid sets for all equiv. DAGs +equivalenceClass(emm_wait_dag2) |> + adjustmentSets(type = "all") ``` -## Tipping point analyses +We can see that by looking at the individual equivalent DAGs. + +```{r} +dags <- equivalentDAGs(emm_wait_dag2) + +# no overlapping sets +dags[[1]] |> adjustmentSets(type = "all") +dags[[2]] |> adjustmentSets(type = "all") +``` + +The good news is that, in this case, one of the equivalent DAGs doesn't make logical sense: the reversible edge is from historical weather to Extra Magic Morning, but that is impossible for both time-ordering reasons (historical temperature occurs in the past) and for logical ones (Disney may be powerful, but to our knowledge, they can't yet control the weather). +Even though we're using more data in these types of checks, we need to consider the logical and time-ordered plausibility of possible scenarios. + +### Alternate DAGs + + + +As we mentioned in @sec-dags-iterate, you should specify your DAG ahead of time with ample feedback from other experts. +Let's now take the opposite approach to the last example: what if we used the original DAG but received feedback after the analysis that we should add more variables? +Consider the expanded DAG in @fig-dag-extra-days. +We've added two new confounders: whether it's a weekend or a holiday. +This analysis differs from when we checked alternate adjustment sets in the same DAG; in that case, we checked the DAG's logical consistency. +In this case, we're considering a different causal structure. + +```{r} +#| label: fig-dag-extra-days +#| fig-cap: > +#| An expansion of @fig-dag-magic, which now includes two new variables on their own backdoor paths: whether or not it's a holiday and/or a weekend. +#| echo: false + +labels <- c( + park_extra_magic_morning = "Extra Magic\nMorning", + wait_minutes_posted_avg = "Average\nwait", + park_ticket_season = "Ticket\nSeason", + park_temperature_high = "Historic high\ntemperature", + park_close = "Time park\nclosed", + is_weekend = "Weekend", + is_holiday = "Holiday" +) + +emm_wait_dag3 <- dagify( + wait_minutes_posted_avg ~ park_extra_magic_morning + park_close + park_ticket_season + park_temperature_high + is_weekend + is_holiday, + park_extra_magic_morning ~ park_temperature_high + park_close + park_ticket_season + is_weekend + is_holiday, + park_close ~ is_weekend + is_holiday, + coords = time_ordered_coords(), + labels = labels, + exposure = "park_extra_magic_morning", + outcome = "wait_minutes_posted_avg" +) + +curvatures <- rep(0, 13) +curvatures[11] <- .25 + +emm_wait_dag3 |> + tidy_dagitty() |> + node_status() |> + ggplot( + aes(x, y, xend = xend, yend = yend, color = status) + ) + + geom_dag_edges_arc(curvature = curvatures, edge_color = "grey80") + + geom_dag_point() + + geom_dag_text_repel(aes(label = label), size = 3.8, seed = 16301, color = "#494949") + + scale_color_okabe_ito(na.value = "grey90") + + theme_dag() + + theme(legend.position = "none") + + coord_cartesian(clip = "off") +``` + +We can calculate these features from `park_date` using the timeDate package. + +```{r} +library(timeDate) + +holidays <- c( + "USChristmasDay", + "USColumbusDay", + "USIndependenceDay", + "USLaborDay", + "USLincolnsBirthday", + "USMemorialDay", + "USMLKingsBirthday", + "USNewYearsDay", + "USPresidentsDay", + "USThanksgivingDay", + "USVeteransDay", + "USWashingtonsBirthday" +) |> + holiday(2018, Holiday = _) |> + as.Date() + +seven_dwarfs_with_days <- seven_dwarfs_train_2018 |> + mutate( + is_holiday = park_date %in% holidays, + is_weekend = isWeekend(park_date) + ) |> + filter(wait_hour == 9) +``` + +Both Extra Magic Morning hours and posted wait times are associated with whether it's a holiday or weekend. + +```{r} +#| label: tbl-days +#| tbl-cap: > +#| The descriptive associations between the two new variables, holiday and weekend, and the exposure and outcome. The average posted waiting time differs on both holidays and weekends, as do the occurrences of Extra Magic Hours. While we can't determine a confounding relationship from descriptive statistics alone, this adds to the evidence that these are confounders. +#| echo: false +tbl_labels <- list( + is_weekend ~ "Weekend", + is_holiday ~ "Holiday", + park_extra_magic_morning ~ "Extra Magic Morning", + wait_minutes_posted_avg ~ "Posted Wait Time" +) + +tbl_data_days <- seven_dwarfs_with_days |> + select(wait_minutes_posted_avg, park_extra_magic_morning, is_weekend, is_holiday) + +tbl1 <- gtsummary::tbl_summary( + tbl_data_days, + by = is_weekend, + label = tbl_labels[-2], + include = -is_holiday +) + +tbl2 <- gtsummary::tbl_summary( + tbl_data_days, + by = is_holiday, + label = tbl_labels[-1], + include = -is_weekend +) + +gtsummary::tbl_merge(list(tbl1, tbl2), c("Weekend", "Holiday")) +``` + +```{r} +#| echo: false +ipw_results_with_days <- fit_ipw_effect( + park_extra_magic_morning ~ park_temperature_high + + park_close + park_ticket_season + is_weekend + is_holiday, + .data = seven_dwarfs_with_days +) |> round(2) +``` + +When we refit the IPW estimator, we get `r ipw_results_with_days` minutes, slightly bigger than we got without the two new confounders. +Because it was a deviation from the analysis plan, you should likely report both effects. +That said, this new DAG is probably more correct than the original one. +From a decision point of view, though, the difference is slight in absolute terms (about a minute) and the effect in the same direction as the original estimate. +In other words, the result is not terribly sensitive to this change regarding how we might act on the information. + +One other point here: sometimes, people present the results of using increasingly complicated adjustment sets. +This comes from the tradition of comparing complex models to parsimonious ones. +That type of comparison is a sensitivity analysis in its own right, but it should be principled: rather than fitting simple models for simplicity's sake, you should compare *competing* adjustment sets or conditions. +For instance, you may feel like these two DAGs are equally plausible or want to examine if adding other variables better captures the baseline crowd flow at the Magic Kingdom. + +## Quantitative bias analyses + +Thus far, we've probed some of the assumptions we've made about the causal structure of the question. +We can take this further using quantitative bias analysis, which uses mathematical assumptions to see how results would change under different conditions. + +### Tipping point analyses + +### Other types of QBA