r-causal · malcolmbarrett · Jun 20, 2024 · Jun 20, 2024 · Jun 20, 2024 · Jun 20, 2024
diff --git a/.github/workflows/quarto.yaml b/.github/workflows/quarto.yaml
@@ -31,8 +31,6 @@ jobs:
 
       - name: Install Google Fonts
         run: |
-          brew tap homebrew/cask
-          brew tap homebrew/cask-fonts
           brew install font-open-sans
 
       - name: Query dependencies

diff --git a/_freeze/chapters/03-counterfactuals/execute-results/html.json b/_freeze/chapters/03-counterfactuals/execute-results/html.json
diff --git a/_freeze/index/execute-results/html.json b/_freeze/index/execute-results/html.json
@@ -1,8 +1,10 @@
 {
-  "hash": "ef3157b3e2d1d97cd016993a6b9fcb07",
+  "hash": "b13605468b718b90b4a5eec92fa59ff7",
   "result": {
-    "markdown": "# Preface {.unnumbered}\n\nWelcome to *Causal Inference in R*.\nAnswering causal questions is critical for scientific and business purposes, but techniques like randomized clinical trials and A/B testing are not always practical or successful.\nThe tools in this book will allow readers better make causal inferences with  observational data with the R programming language.\nBy its end, we hope to help you:\n\n1.  Ask better causal questions.\n2.  Understand the assumptions needed for causal inference\n3.  Identify the target population for which you want to make inferences\n4.  Fit causal models and check their problems\n5.  Conduct sensitivity analyses where the techniques we use might be imperfect\n\nThis book is for both academic researchers and data scientists.\nAlthough the questions may differ between these settings, many techniques are the same: causal inference is as helpful for asking questions about cancer as it is about clicks.\nWe use a mix of examples from medicine, economics, and tech to demonstrate that you need a clear causal question and a willingness to be transparent about your assumptions.\n\nYou'll learn a lot in this book, but ironically, you won't learn much about conducting randomized trials, one of the best tools for causal inferences.\nRandomized trials, and their cousins, A/B tests (standard in the tech world), are compelling because they alleviate many of the assumptions we need to make for valid inferences.\nThey are also sufficiently complex in design to merit their own learning resources.\nInstead, we'll focus on observational data where we don't usually benefit from randomization.\nIf you're interested in randomization techniques, don't put away this resource just yet: many causal inference techniques designed for observational data improve randomized analyses, too.\n\nWe're making a few assumptions about you as a reader:\n\n1.  You're familiar with the [tidyverse](https://www.tidyverse.org/) ecosystem of R packages and their general philosophy. For instance, we use a lot of dplyr and ggplot2 in this book, but we won't explain their basic grammar. To learn more about starting with the tidyverse, we recommend [R for Data Science](https://r4ds.hadley.nz/).\n2.  You're familiar with basic statistical modeling in R. For instance, we'll fit many models with `lm()` and `glm()`, but we won't discuss how they work. If you want to learn more about R's powerful modeling functions, we recommend reading [\"A Review of R Modeling Fundamentals\"](https://www.tmwr.org/base-r.html) in [Tidy Modeling with R](https://www.tmwr.org).\n3.  We also assume you have familiarity with other R basics, such as [writing functions](https://r4ds.hadley.nz/functions.html). [R for Data Science](https://r4ds.hadley.nz/) is also a good resource for these topics. (For a deeper dive into the R programming language, we recommend [Advanced R](https://adv-r.hadley.nz/index.html), although we don't assume you have mastered its material for this book).\n\nWe'll also use tools from the tidymodels ecosystem, a set of R packages for modeling related to the tidyverse.\nWe don't assume you have used them before.\ntidymodels also focuses on predictive modeling, so many of its tools aren't appropriate for this book.\nNevertheless, if you are interested in this topic, we recommend [Tidy Modeling with R](https://www.tmwr.org).\n\nThere are also several other excellent books on causal inference.\nThis book is different in its focus on R, but it's still helpful to see this area from other perspectives.\nA few books you might like:\n\n-   [*Causal Inference: What If?*](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)\n-   [*Causal Inference: The Mixtape*](https://mixtape.scunning.com/)\n-   [*The Effect*](https://theeffectbook.net/)\n\nThe first book is focused on epidemiology.\nThe latter two are focused on econometrics.\nWe also recommend *The Book of Why* @pearl2018why for more on causal diagrams.\n\n## Conventions\n\n### Modern R Features\n\nWe use two modern R features in R 4.1.0 and above in this book.\nThe first is the native pipe, `|>`.\nThis R feature is similar to the tidyverse's `%>%`, with which you may be more familiar.\nIn typical cases, the two work interchangeably.\nOne notable difference is that `|>` uses the `_` symbol to direct the pipe's results, e.g., `.df |> lm(y ~ x, data = _)`.\nSee [this Tidyverse Blog post](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) for more on this topic.\n\nAnother modern R feature we use is the native lambda, a way of writing short functions that looks like `\\(.x) do_something(.x)`.\nIt is similar to purrr's `~` lambda notation.\nIt's also helpful to realize the native lambda is identical to `function(.x) do_something(.x)`, where `\\` is shorthand for `function`.\nSee [R for Data Science's chapter on iteration](https://r4ds.hadley.nz/iteration.html) for more on this topic.\n\n## Theming\n\nThe plots in this book use a consistent theme that we don't include in every code chunk, meaning if you run the code for a visualization, you might get a slightly different-looking result.\nWe set the following defaults related to ggplot2:\n\n<!-- TODO: make sure these are up to date -->\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(\n  # set default colors in ggplot2 to colorblind-friendly\n  # Okabe-Ito and Viridis palettes\n  ggplot2.discrete.colour = ggokabeito::palette_okabe_ito(),\n  ggplot2.discrete.fill = ggokabeito::palette_okabe_ito(),\n  ggplot2.continuous.colour = \"viridis\",\n  ggplot2.continuous.fill = \"viridis\",\n  # set theme font and size\n  book.base_family = \"sans\",\n  book.base_size = 14\n)\n\nlibrary(ggplot2)\n\n# set default theme\ntheme_set(\n  theme_minimal(\n    base_size = getOption(\"book.base_size\"),\n    base_family = getOption(\"book.base_family\")\n  ) %+replace%\n    theme(\n      panel.grid.minor = element_blank(),\n      legend.position = \"bottom\"\n    )\n)\n```\n:::\n\n\nWe also mask a few functions from ggdag that we like to customize:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_dag <- function() {\n  ggdag::theme_dag(base_family = getOption(\"book.base_family\"))\n}\n\ngeom_dag_label_repel <- function(..., seed = 10) {\n  ggdag::geom_dag_label_repel(\n    aes(x, y, label = label),\n    box.padding = 3.5,\n    inherit.aes = FALSE,\n    max.overlaps = Inf,\n    family = getOption(\"book.base_family\"),\n    seed = seed,\n    label.size = NA,\n    label.padding = 0.1,\n    size = getOption(\"book.base_size\") / 3,\n    ...\n  )\n}\n```\n:::\n",
-    "supporting": [],
+    "markdown": "# Preface {.unnumbered}\n\nWelcome to *Causal Inference in R*.\nAnswering causal questions is critical for scientific and business purposes, but techniques like randomized clinical trials and A/B testing are not always practical or successful.\nThe tools in this book will allow readers to better make causal inferences with observational data with the R programming language.\nBy its end, we hope to help you:\n\n1.  Ask better causal questions.\n2.  Understand the assumptions needed for causal inference\n3.  Identify the target population for which you want to make inferences\n4.  Fit causal models and check their problems\n5.  Conduct sensitivity analyses where the techniques we use might be imperfect\n\nThis book is for both academic researchers and data scientists.\nAlthough the questions may differ between these settings, many techniques are the same: causal inference is as helpful for asking questions about cancer as it is about clicks.\nWe use a mix of examples from medicine, economics, and tech to demonstrate that you need a clear causal question and a willingness to be transparent about your assumptions.\n\nYou'll learn a lot in this book, but ironically, you won't learn much about conducting randomized trials, one of the best tools for causal inferences.\nRandomized trials, and their cousins, A/B tests (standard in the tech world), are compelling because they alleviate many of the assumptions we need to make for valid inferences.\nThey are also sufficiently complex in design to merit their own learning resources.\nInstead, we'll focus on observational data where we don't usually benefit from randomization.\nIf you're interested in randomization techniques, don't put away this resource just yet: many causal inference techniques designed for observational data improve randomized analyses, too.\n\nWe're making a few assumptions about you as a reader:\n\n1.  You're familiar with the [tidyverse](https://www.tidyverse.org/) ecosystem of R packages and their general philosophy. For instance, we use a lot of dplyr and ggplot2 in this book, but we won't explain their basic grammar. To learn more about starting with the tidyverse, we recommend [*R for Data Science*](https://r4ds.hadley.nz/).\n2.  You're familiar with basic statistical modeling in R. For instance, we'll fit many models with `lm()` and `glm()`, but we won't discuss how they work. If you want to learn more about R's powerful modeling functions, we recommend reading [\"A Review of R Modeling Fundamentals\"](https://www.tmwr.org/base-r.html) in [*Tidy Modeling with R*](https://www.tmwr.org).\n3.  We also assume you have familiarity with other R basics, such as [writing functions](https://r4ds.hadley.nz/functions.html). [*R for Data Science*](https://r4ds.hadley.nz/) is also a good resource for these topics. (For a deeper dive into the R programming language, we recommend [*Advanced R*](https://adv-r.hadley.nz/index.html), although we don't assume you have mastered its material for this book).\n\nWe'll also use tools from the tidymodels ecosystem, a set of R packages for modeling related to the tidyverse.\nWe don't assume you have used them before.\ntidymodels also focuses on predictive modeling, so many of its tools aren't appropriate for this book.\nNevertheless, if you are interested in this topic, we recommend [*Tidy Modeling with R*](https://www.tmwr.org).\n\nThere are also several other excellent books on causal inference.\nThis book is different in its focus on R, but it's still helpful to see this area from other perspectives.\nA few books you might like:\n\n-   [*Causal Inference: What If?*](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)\n-   [*Causal Inference: The Mixtape*](https://mixtape.scunning.com/)\n-   [*The Effect*](https://theeffectbook.net/)\n\nThe first book is focused on epidemiology.\nThe latter two are focused on econometrics.\nWe also recommend *The Book of Why* @pearl2018why for more on causal diagrams.\n\n## Conventions\n\n### Modern R Features\n\nWe use two modern R features in R 4.1.0 and above in this book.\nThe first is the native pipe, `|>`.\nThis R feature is similar to the tidyverse's `%>%`, with which you may be more familiar.\nIn typical cases, the two work interchangeably.\nOne notable difference is that `|>` uses the `_` symbol to direct the pipe's results, e.g., `.df |> lm(y ~ x, data = _)`.\nSee [this Tidyverse Blog post](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) for more on this topic.\n\nAnother modern R feature we use is the native lambda, a way of writing short functions that looks like `\\(.x) do_something(.x)`.\nIt is similar to purrr's `~` lambda notation.\nIt's also helpful to realize the native lambda is identical to `function(.x) do_something(.x)`, where `\\` is shorthand for `function`.\nSee [R for Data Science's chapter on iteration](https://r4ds.hadley.nz/iteration.html) for more on this topic.\n\n## Theming\n\nThe plots in this book use a consistent theme that we don't include in every code chunk, meaning if you run the code for a visualization, you might get a slightly different-looking result.\nWe set the following defaults related to ggplot2:\n\n<!-- TODO: make sure these are up to date -->\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(\n  # set default colors in ggplot2 to colorblind-friendly\n  # Okabe-Ito and Viridis palettes\n  ggplot2.discrete.colour = ggokabeito::palette_okabe_ito(),\n  ggplot2.discrete.fill = ggokabeito::palette_okabe_ito(),\n  ggplot2.continuous.colour = \"viridis\",\n  ggplot2.continuous.fill = \"viridis\",\n  # set theme font and size\n  book.base_family = \"sans\",\n  book.base_size = 14\n)\n\nlibrary(ggplot2)\n\n# set default theme\ntheme_set(\n  theme_minimal(\n    base_size = getOption(\"book.base_size\"),\n    base_family = getOption(\"book.base_family\")\n  ) %+replace%\n    theme(\n      panel.grid.minor = element_blank(),\n      legend.position = \"bottom\"\n    )\n)\n```\n:::\n\n\nWe also mask a few functions from ggdag that we like to customize:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_dag <- function() {\n  ggdag::theme_dag(base_family = getOption(\"book.base_family\"))\n}\n\ngeom_dag_label_repel <- function(..., seed = 10) {\n  ggdag::geom_dag_label_repel(\n    aes(x, y, label = label),\n    box.padding = 3.5,\n    inherit.aes = FALSE,\n    max.overlaps = Inf,\n    family = getOption(\"book.base_family\"),\n    seed = seed,\n    label.size = NA,\n    label.padding = 0.1,\n    size = getOption(\"book.base_size\") / 3,\n    ...\n  )\n}\n```\n:::\n",
+    "supporting": [
+      "index_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],

diff --git a/chapters/03-counterfactuals.qmd b/chapters/03-counterfactuals.qmd
@@ -227,7 +227,7 @@ data_observed <- data |>
     # change the exposure to randomized, generate from a binomial distribution
     # with a probability 0.5 for being in either group
     exposure = case_when(
-      rbinom(10, 1, 0.5) == 1 ~ "chocolate",
+      rbinom(n(), 1, 0.5) == 1 ~ "chocolate",
       TRUE ~ "vanilla"
     ),
     observed_outcome = case_when(
@@ -314,8 +314,8 @@ set.seed(11)
 data_observed <- data |>
   mutate(
     exposure_unobserved = case_when(
-      rbinom(10, 1, 0.25) == 1 ~ "chocolate (spoiled)",
-      rbinom(10, 1, 0.25) == 1 ~ "chocolate",
+      rbinom(n(), 1, 0.25) == 1 ~ "chocolate (spoiled)",
+      rbinom(n(), 1, 0.25) == 1 ~ "chocolate",
       TRUE ~ "vanilla"
     ),
     observed_outcome = case_when(
@@ -355,7 +355,7 @@ set.seed(11)
 data_observed <- data |>
   mutate(
     exposure = case_when(
-      rbinom(10, 1, 0.5) == 1 ~ "chocolate",
+      rbinom(n(), 1, 0.5) == 1 ~ "chocolate",
       TRUE ~ "vanilla"
     ),
     exposure_partner =
@@ -452,9 +452,9 @@ data_observed <- data |>
     prefer_chocolate = y_chocolate > y_vanilla,
     exposure = case_when(
       # people who like chocolate more chose that 80% of the time
-      prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), "chocolate", "vanilla"),
+      prefer_chocolate ~ ifelse(rbinom(n(), 1, 0.8), "chocolate", "vanilla"),
       # people who like vanilla more chose that 80% of the time
-      !prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), "vanilla", "chocolate")
+      !prefer_chocolate ~ ifelse(rbinom(n(), 1, 0.8), "vanilla", "chocolate")
     ),
     observed_outcome = case_when(
       exposure == "chocolate" ~ y_chocolate,