Merge branch 'main' into dag_sens

r-causal · Jul 1, 2024 · 9243492 · 9243492
2 parents 2ea1370 + a4edc6c
commit 9243492
Show file tree

Hide file tree

Showing 7 changed files with 55 additions and 204 deletions.
diff --git a/_freeze/chapters/03-counterfactuals/execute-results/html.json b/_freeze/chapters/03-counterfactuals/execute-results/html.json
diff --git a/_freeze/index/execute-results/html.json b/_freeze/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "b381f8bdbbfe539249501c824bd96cfe",
+  "hash": "b13605468b718b90b4a5eec92fa59ff7",
   "result": {
-    "markdown": "# Preface {.unnumbered}\n\nWelcome to *Causal Inference in R*.\nAnswering causal questions is critical for scientific and business purposes, but techniques like randomized clinical trials and A/B testing are not always practical or successful.\nThe tools in this book will allow readers to better make causal inferences with observational data with the R programming language.\nBy its end, we hope to help you:\n\n1.  Ask better causal questions.\n2.  Understand the assumptions needed for causal inference\n3.  Identify the target population for which you want to make inferences\n4.  Fit causal models and check their problems\n5.  Conduct sensitivity analyses where the techniques we use might be imperfect\n\nThis book is for both academic researchers and data scientists.\nAlthough the questions may differ between these settings, many techniques are the same: causal inference is as helpful for asking questions about cancer as it is about clicks.\nWe use a mix of examples from medicine, economics, and tech to demonstrate that you need a clear causal question and a willingness to be transparent about your assumptions.\n\nYou'll learn a lot in this book, but ironically, you won't learn much about conducting randomized trials, one of the best tools for causal inferences.\nRandomized trials, and their cousins, A/B tests (standard in the tech world), are compelling because they alleviate many of the assumptions we need to make for valid inferences.\nThey are also sufficiently complex in design to merit their own learning resources.\nInstead, we'll focus on observational data where we don't usually benefit from randomization.\nIf you're interested in randomization techniques, don't put away this resource just yet: many causal inference techniques designed for observational data improve randomized analyses, too.\n\nWe're making a few assumptions about you as a reader:\n\n1.  You're familiar with the [tidyverse](https://www.tidyverse.org/) ecosystem of R packages and their general philosophy. For instance, we use a lot of dplyr and ggplot2 in this book, but we won't explain their basic grammar. To learn more about starting with the tidyverse, we recommend [R for Data Science](https://r4ds.hadley.nz/).\n2.  You're familiar with basic statistical modeling in R. For instance, we'll fit many models with `lm()` and `glm()`, but we won't discuss how they work. If you want to learn more about R's powerful modeling functions, we recommend reading [\"A Review of R Modeling Fundamentals\"](https://www.tmwr.org/base-r.html) in [Tidy Modeling with R](https://www.tmwr.org).\n3.  We also assume you have familiarity with other R basics, such as [writing functions](https://r4ds.hadley.nz/functions.html). [R for Data Science](https://r4ds.hadley.nz/) is also a good resource for these topics. (For a deeper dive into the R programming language, we recommend [Advanced R](https://adv-r.hadley.nz/index.html), although we don't assume you have mastered its material for this book).\n\nWe'll also use tools from the tidymodels ecosystem, a set of R packages for modeling related to the tidyverse.\nWe don't assume you have used them before.\ntidymodels also focuses on predictive modeling, so many of its tools aren't appropriate for this book.\nNevertheless, if you are interested in this topic, we recommend [Tidy Modeling with R](https://www.tmwr.org).\n\nThere are also several other excellent books on causal inference.\nThis book is different in its focus on R, but it's still helpful to see this area from other perspectives.\nA few books you might like:\n\n-   [*Causal Inference: What If?*](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)\n-   [*Causal Inference: The Mixtape*](https://mixtape.scunning.com/)\n-   [*The Effect*](https://theeffectbook.net/)\n\nThe first book is focused on epidemiology.\nThe latter two are focused on econometrics.\nWe also recommend *The Book of Why* @pearl2018why for more on causal diagrams.\n\n## Conventions\n\n### Modern R Features\n\nWe use two modern R features in R 4.1.0 and above in this book.\nThe first is the native pipe, `|>`.\nThis R feature is similar to the tidyverse's `%>%`, with which you may be more familiar.\nIn typical cases, the two work interchangeably.\nOne notable difference is that `|>` uses the `_` symbol to direct the pipe's results, e.g., `.df |> lm(y ~ x, data = _)`.\nSee [this Tidyverse Blog post](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) for more on this topic.\n\nAnother modern R feature we use is the native lambda, a way of writing short functions that looks like `\\(.x) do_something(.x)`.\nIt is similar to purrr's `~` lambda notation.\nIt's also helpful to realize the native lambda is identical to `function(.x) do_something(.x)`, where `\\` is shorthand for `function`.\nSee [R for Data Science's chapter on iteration](https://r4ds.hadley.nz/iteration.html) for more on this topic.\n\n## Theming\n\nThe plots in this book use a consistent theme that we don't include in every code chunk, meaning if you run the code for a visualization, you might get a slightly different-looking result.\nWe set the following defaults related to ggplot2:\n\n<!-- TODO: make sure these are up to date -->\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(\n  # set default colors in ggplot2 to colorblind-friendly\n  # Okabe-Ito and Viridis palettes\n  ggplot2.discrete.colour = ggokabeito::palette_okabe_ito(),\n  ggplot2.discrete.fill = ggokabeito::palette_okabe_ito(),\n  ggplot2.continuous.colour = \"viridis\",\n  ggplot2.continuous.fill = \"viridis\",\n  # set theme font and size\n  book.base_family = \"sans\",\n  book.base_size = 14\n)\n\nlibrary(ggplot2)\n\n# set default theme\ntheme_set(\n  theme_minimal(\n    base_size = getOption(\"book.base_size\"),\n    base_family = getOption(\"book.base_family\")\n  ) %+replace%\n    theme(\n      panel.grid.minor = element_blank(),\n      legend.position = \"bottom\"\n    )\n)\n```\n:::\n\n\nWe also mask a few functions from ggdag that we like to customize:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_dag <- function() {\n  ggdag::theme_dag(base_family = getOption(\"book.base_family\"))\n}\n\ngeom_dag_label_repel <- function(..., seed = 10) {\n  ggdag::geom_dag_label_repel(\n    aes(x, y, label = label),\n    box.padding = 3.5,\n    inherit.aes = FALSE,\n    max.overlaps = Inf,\n    family = getOption(\"book.base_family\"),\n    seed = seed,\n    label.size = NA,\n    label.padding = 0.1,\n    size = getOption(\"book.base_size\") / 3,\n    ...\n  )\n}\n```\n:::\n",
+    "markdown": "# Preface {.unnumbered}\n\nWelcome to *Causal Inference in R*.\nAnswering causal questions is critical for scientific and business purposes, but techniques like randomized clinical trials and A/B testing are not always practical or successful.\nThe tools in this book will allow readers to better make causal inferences with observational data with the R programming language.\nBy its end, we hope to help you:\n\n1.  Ask better causal questions.\n2.  Understand the assumptions needed for causal inference\n3.  Identify the target population for which you want to make inferences\n4.  Fit causal models and check their problems\n5.  Conduct sensitivity analyses where the techniques we use might be imperfect\n\nThis book is for both academic researchers and data scientists.\nAlthough the questions may differ between these settings, many techniques are the same: causal inference is as helpful for asking questions about cancer as it is about clicks.\nWe use a mix of examples from medicine, economics, and tech to demonstrate that you need a clear causal question and a willingness to be transparent about your assumptions.\n\nYou'll learn a lot in this book, but ironically, you won't learn much about conducting randomized trials, one of the best tools for causal inferences.\nRandomized trials, and their cousins, A/B tests (standard in the tech world), are compelling because they alleviate many of the assumptions we need to make for valid inferences.\nThey are also sufficiently complex in design to merit their own learning resources.\nInstead, we'll focus on observational data where we don't usually benefit from randomization.\nIf you're interested in randomization techniques, don't put away this resource just yet: many causal inference techniques designed for observational data improve randomized analyses, too.\n\nWe're making a few assumptions about you as a reader:\n\n1.  You're familiar with the [tidyverse](https://www.tidyverse.org/) ecosystem of R packages and their general philosophy. For instance, we use a lot of dplyr and ggplot2 in this book, but we won't explain their basic grammar. To learn more about starting with the tidyverse, we recommend [*R for Data Science*](https://r4ds.hadley.nz/).\n2.  You're familiar with basic statistical modeling in R. For instance, we'll fit many models with `lm()` and `glm()`, but we won't discuss how they work. If you want to learn more about R's powerful modeling functions, we recommend reading [\"A Review of R Modeling Fundamentals\"](https://www.tmwr.org/base-r.html) in [*Tidy Modeling with R*](https://www.tmwr.org).\n3.  We also assume you have familiarity with other R basics, such as [writing functions](https://r4ds.hadley.nz/functions.html). [*R for Data Science*](https://r4ds.hadley.nz/) is also a good resource for these topics. (For a deeper dive into the R programming language, we recommend [*Advanced R*](https://adv-r.hadley.nz/index.html), although we don't assume you have mastered its material for this book).\n\nWe'll also use tools from the tidymodels ecosystem, a set of R packages for modeling related to the tidyverse.\nWe don't assume you have used them before.\ntidymodels also focuses on predictive modeling, so many of its tools aren't appropriate for this book.\nNevertheless, if you are interested in this topic, we recommend [*Tidy Modeling with R*](https://www.tmwr.org).\n\nThere are also several other excellent books on causal inference.\nThis book is different in its focus on R, but it's still helpful to see this area from other perspectives.\nA few books you might like:\n\n-   [*Causal Inference: What If?*](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)\n-   [*Causal Inference: The Mixtape*](https://mixtape.scunning.com/)\n-   [*The Effect*](https://theeffectbook.net/)\n\nThe first book is focused on epidemiology.\nThe latter two are focused on econometrics.\nWe also recommend *The Book of Why* @pearl2018why for more on causal diagrams.\n\n## Conventions\n\n### Modern R Features\n\nWe use two modern R features in R 4.1.0 and above in this book.\nThe first is the native pipe, `|>`.\nThis R feature is similar to the tidyverse's `%>%`, with which you may be more familiar.\nIn typical cases, the two work interchangeably.\nOne notable difference is that `|>` uses the `_` symbol to direct the pipe's results, e.g., `.df |> lm(y ~ x, data = _)`.\nSee [this Tidyverse Blog post](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) for more on this topic.\n\nAnother modern R feature we use is the native lambda, a way of writing short functions that looks like `\\(.x) do_something(.x)`.\nIt is similar to purrr's `~` lambda notation.\nIt's also helpful to realize the native lambda is identical to `function(.x) do_something(.x)`, where `\\` is shorthand for `function`.\nSee [R for Data Science's chapter on iteration](https://r4ds.hadley.nz/iteration.html) for more on this topic.\n\n## Theming\n\nThe plots in this book use a consistent theme that we don't include in every code chunk, meaning if you run the code for a visualization, you might get a slightly different-looking result.\nWe set the following defaults related to ggplot2:\n\n<!-- TODO: make sure these are up to date -->\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(\n  # set default colors in ggplot2 to colorblind-friendly\n  # Okabe-Ito and Viridis palettes\n  ggplot2.discrete.colour = ggokabeito::palette_okabe_ito(),\n  ggplot2.discrete.fill = ggokabeito::palette_okabe_ito(),\n  ggplot2.continuous.colour = \"viridis\",\n  ggplot2.continuous.fill = \"viridis\",\n  # set theme font and size\n  book.base_family = \"sans\",\n  book.base_size = 14\n)\n\nlibrary(ggplot2)\n\n# set default theme\ntheme_set(\n  theme_minimal(\n    base_size = getOption(\"book.base_size\"),\n    base_family = getOption(\"book.base_family\")\n  ) %+replace%\n    theme(\n      panel.grid.minor = element_blank(),\n      legend.position = \"bottom\"\n    )\n)\n```\n:::\n\n\nWe also mask a few functions from ggdag that we like to customize:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_dag <- function() {\n  ggdag::theme_dag(base_family = getOption(\"book.base_family\"))\n}\n\ngeom_dag_label_repel <- function(..., seed = 10) {\n  ggdag::geom_dag_label_repel(\n    aes(x, y, label = label),\n    box.padding = 3.5,\n    inherit.aes = FALSE,\n    max.overlaps = Inf,\n    family = getOption(\"book.base_family\"),\n    seed = seed,\n    label.size = NA,\n    label.padding = 0.1,\n    size = getOption(\"book.base_size\") / 3,\n    ...\n  )\n}\n```\n:::\n",
     "supporting": [
       "index_files"
     ],

diff --git a/chapters/01-casual-to-causal.qmd b/chapters/01-casual-to-causal.qmd
@@ -162,7 +162,7 @@ ggplot(
 Here are some other great examples of descriptive analyses.
 
 -   Deforestation around the world. Our World in Data [@owidforestsanddeforestation] is a data journalism organization that produces thoughtful, usually descriptive reports on various topics. In this report, they present data visualizations of both absolute change in forest coverage (forest transitions) and relative change (deforestation or reforestation), using basic statistics and forestry theory to present helpful information about the state of forests over time.
--   The prevalence of chlamydial and gonococcal infections [@Miller2004]. Measuring the prevalence of disease (how many people currently have a disease, usually expressed as a rate per number of people) is helpful for basic public health (resources, prevention, education) and scientific understanding. In this study, the authors conducted a complex survey meant to be representative of all high schools in the United States (the target population); they used survey weights to address a variety of factors related to their question, then calculated prevalence rates and other statistics. As we'll see, weights are helpful in causal inference for the same reason: targeting a particular population. That said, not all weighting techniques are causal in nature, and they were not here.
+-   The prevalence of chlamydial and gonococcal infections [@Miller2004]. Measuring the prevalence of disease (how many people currently have a disease, usually expressed as a ratio per number of people) is helpful for basic public health (resources, prevention, education) and scientific understanding. In this study, the authors conducted a complex survey meant to be representative of all high schools in the United States (the target population); they used survey weights to address a variety of factors related to their question, then calculated prevalence ratios and other statistics. As we'll see, weights are helpful in causal inference for the same reason: targeting a particular population. That said, not all weighting techniques are causal in nature, and they were not here.
 -   Estimating race and ethnicity-specific hysterectomy inequalities [@Gartner2020]. Descriptive techniques also help us understand disparities in areas like economics and epidemiology. In this study, the authors asked: Does the risk of hysterectomy differ by racial or ethnic background? Although the analysis is stratified by a key variable, it's still descriptive. Another interesting aspect of this paper is the authors' work ensuring the research answered questions about the right target population. Their analysis combined several data sources to better estimate the true population prevalence (instead of the prevalence among those in hospitals, as commonly presented). They also adjusted for the prevalence of hysterectomy, e.g., they calculated the incidence (new case) rate only among those who could actually have a hysterectomy (e.g., they hadn't had one yet).
 
 #### Validity

diff --git a/chapters/03-counterfactuals.qmd b/chapters/03-counterfactuals.qmd
@@ -227,7 +227,7 @@ data_observed <- data |>
     # change the exposure to randomized, generate from a binomial distribution
     # with a probability 0.5 for being in either group
     exposure = case_when(
-      rbinom(10, 1, 0.5) == 1 ~ "chocolate",
+      rbinom(n(), 1, 0.5) == 1 ~ "chocolate",
       TRUE ~ "vanilla"
     ),
     observed_outcome = case_when(
@@ -314,8 +314,8 @@ set.seed(11)
 data_observed <- data |>
   mutate(
     exposure_unobserved = case_when(
-      rbinom(10, 1, 0.25) == 1 ~ "chocolate (spoiled)",
-      rbinom(10, 1, 0.25) == 1 ~ "chocolate",
+      rbinom(n(), 1, 0.25) == 1 ~ "chocolate (spoiled)",
+      rbinom(n(), 1, 0.25) == 1 ~ "chocolate",
       TRUE ~ "vanilla"
     ),
     observed_outcome = case_when(
@@ -355,7 +355,7 @@ set.seed(11)
 data_observed <- data |>
   mutate(
     exposure = case_when(
-      rbinom(10, 1, 0.5) == 1 ~ "chocolate",
+      rbinom(n(), 1, 0.5) == 1 ~ "chocolate",
       TRUE ~ "vanilla"
     ),
     exposure_partner =
@@ -452,9 +452,9 @@ data_observed <- data |>
     prefer_chocolate = y_chocolate > y_vanilla,
     exposure = case_when(
       # people who like chocolate more chose that 80% of the time
-      prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), "chocolate", "vanilla"),
+      prefer_chocolate ~ ifelse(rbinom(n(), 1, 0.8), "chocolate", "vanilla"),
       # people who like vanilla more chose that 80% of the time
-      !prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), "vanilla", "chocolate")
+      !prefer_chocolate ~ ifelse(rbinom(n(), 1, 0.8), "vanilla", "chocolate")
     ),
     observed_outcome = case_when(
       exposure == "chocolate" ~ y_chocolate,

diff --git a/chapters/06-not-just-a-stats-problem.qmd b/chapters/06-not-just-a-stats-problem.qmd
@@ -620,7 +620,7 @@ But look again.
 `exposure` is a mediator for `covariate`'s effect on `outcome`; some of the total effect is mediated through `outcome`, while there is also a direct effect of `covariate` on `outcome`. **Both estimates are unbiased, but they are different *types* of estimates**. The effect of `exposure` on `outcome` is the *total effect* of that relationship, while the effect of `covariate` on `outcome` is the *direct effect*.
 
 [^06-not-just-a-stats-problem-4]: Additionally, OLS produces a *collapsable* effect.
-    Other effects, like the odds and hazards ratios, are *non-collapsable*, meaning including unrelated variables in the model *can* change the effect estimate.
+    Other effects, like the odds and hazards ratios, are *non-collapsable*, meaning you may need to include non-confounding variables in the model that cause the outcome in order to estimate the effect of interest accurately.
 
 ```{r}
 #| label: fig-quartet_confounder