final pass

dsweber2 · dsweber2 · commit 6047d4f8baf7 · 2025-06-23T16:04:04.000-05:00
diff --git a/man/step_adjust_latency.Rd b/man/step_adjust_latency.Rd
diff --git a/vignettes/backtesting.Rmd b/vignettes/backtesting.Rmd
@@ -92,7 +92,10 @@ example, here is a plot that compares monthly snapshots of the data.
 <summary>Code for plotting</summary>
 ```{r plot_revision_example, warn = FALSE, message = FALSE}
 geo_choose <- "ca"
-forecast_dates <- seq(from = as.Date("2020-08-01"), to = as.Date("2021-11-01"), by = "1 month")
+forecast_dates <- seq(
+  from = as.Date("2020-08-01"),
+  to = as.Date("2021-11-01"),
+  by = "1 month")
 percent_cli_data <- bind_rows(
   # Snapshotted data for the version-faithful forecasts
   map(
@@ -110,13 +113,15 @@ percent_cli_data <- bind_rows(
 ) |> as_tibble()
 p0 <- autoplot(
   archive_cases_dv_subset, percent_cli, 
-  .versions = seq(ymd("2020-07-01"), ymd("2021-11-30"), by = "1 month"), 
+  .versions = forecast_dates, 
   .mark_versions = TRUE,
   .facet_filter = (geo_value == "ca")
 ) +
   scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
   labs(x = "", y = "% of doctor's visits with\n Covid-like illness") + 
-  scale_color_viridis_c(option = "turbo", guide = guide_legend(reverse=TRUE), direction = -1) +
+  scale_color_viridis_c(
+    option = "viridis",
+    guide = guide_legend(reverse=TRUE), direction = -1) +
   scale_y_continuous(limits = c(0, NA), expand = expansion(c(0, 0.05))) +
   theme(legend.position = "none")
 ```
@@ -133,19 +138,19 @@ For example, the snapshot on March 1st, 2021 is aquamarine, and increases to
 slightly over 10.
 Every series is necessarily to the left of the snapshot date (since all known
 values must happen before the snapshot is taken[^4]).
-The grey line overlaying the various snapshots represents the "final
+The black line overlaying the various snapshots represents the "final
 value", which is just the snapshot at the last version in the archive (the
 `versions_end`).
 
-Comparing with the grey line tells us how much the value at the time of the
+Comparing with the black line tells us how much the value at the time of the
 snapshot differs with what was eventually reported.
 The drop in January 2021 in the snapshot on `2021-02-01` was initially reported
 as much steeper than it eventually turned out to be, while in the period after
 that the values were initially reported as higher than they actually were.
 
 Handling data latency is important in both real-time forecasting and retrospective
 forecasting.
-Looking at the very first snapshot, `2020-08-01` (the red dotted
+Looking at the very first snapshot, `2020-08-01` (the purple dotted
 vertical line), there is a noticeable gap between the forecast date and the end
 of the red time-series to its left.
 In fact, if we take a snapshot and get the last `time_value`,
@@ -308,7 +313,7 @@ To see how the version faithful and un-faithful predictions compare, let's plot
 rates, using the same versioned plotting method as above.
 Note that even though we fit the model on four states (California, Texas, Florida, and
 New York), we'll just display the results for two states, California (CA) and Florida
-(FL), to get a sense of the model performance while keeping the graphic simple.
+(FL), to get a sense of the model performance while keeping the graphic simpler.
 
 <details>
 <summary>Code for plotting</summary>
@@ -326,13 +331,19 @@ plotting_data <- bind_rows(
     mutate(version = max(percent_cli_data$version)) |>
     mutate(version_faithful = "Version faithful")
 )
-p1 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
-  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = (time_value)), alpha = 0.4) +
+
+p1 <- ggplot(data = forecasts_filtered,
+             aes(x = target_date, group = time_value)) +
+  geom_ribbon(
+    aes(ymin = `0.05`, ymax = `0.95`, fill = (time_value)),
+    alpha = 0.4) +
   geom_line(aes(y = .pred, color = (time_value)), linetype = 2L) +
   geom_point(aes(y = .pred, color = (time_value)), size = 0.75) +
   # the forecast date
   geom_vline(
-    data = percent_cli_data |> filter(geo_value == geo_choose) |> select(-version_faithful),
+    data = percent_cli_data |>
+      filter(geo_value == geo_choose) |>
+      select(-version_faithful),
     aes(color = version, xintercept = version, group = version),
     lty = 2
   ) +
@@ -345,11 +356,11 @@ p1 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)
   facet_grid(version_faithful ~ geo_value, scales = "free") +
   scale_x_date(breaks = "2 months", date_labels = "%b %Y") +
   scale_y_continuous(expand = expansion(c(0, 0.05))) +
-  labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
-  scale_color_viridis_c(option = "turbo", direction = -1) +
-  scale_fill_viridis_c(option = "turbo", direction = -1) +
+  labs(x = "Date",
+       y = "smoothed, day of week adjusted covid-like doctors visits") +
+  scale_color_viridis_c(option = "viridis", direction = -1) +
+  scale_fill_viridis_c(option = "viridis", direction = -1) +
   theme(legend.position = "none")
-p1
 ```
 
 ```{r plot_fl_forecasts, warning = FALSE}
@@ -380,8 +391,8 @@ p2 <-
   scale_x_date(breaks = "2 months", date_labels = "%b %Y") +
   scale_y_continuous(expand = expansion(c(0, 0.05))) +
   labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
-  scale_color_viridis_c(option = "turbo", direction = -1) +
-  scale_fill_viridis_c(option = "turbo", direction = -1) +
+  scale_color_viridis_c(option = "viridis", direction = -1) +
+  scale_fill_viridis_c(option = "viridis", direction = -1) +
   theme(legend.position = "none")
 p2
 ```
@@ -391,17 +402,17 @@ p2
 p1
 ```
 
-The version faithful and un-faithful forecasts look moderately similar except for the 1 day horizons
-(although neither approach produces amazingly accurate forecasts).
+There are some weeks when the forecasts are somewhat similar, and others when they are wildly different, although neither approach produces amazingly accurate forecasts.
 
 In the version faithful case for California, the March 2021 forecast (turquoise)
 starts at a value just above 10, which is very well lined up with reported values leading up to that forecast.
 The measured and forecasted trends are also concordant (both increasingly moderately fast).
 
 Because the data for this time period was later adjusted down with a decreasing trend, the March 2021 forecast looks quite bad compared to finalized data.
-
 The equivalent version un-faithful forecast starts at a value of 5, which is in line with the finalized data but would have been out of place compared to the version data.
 
+The October 2021 forecast for the version faithful case floors out at zero, whereas the un-faithful is much closer to the finalized data.
+
 ```{r show-plot2, warning = FALSE, echo=FALSE}
 p2
 ```
diff --git a/vignettes/custom_epiworkflows.Rmd b/vignettes/custom_epiworkflows.Rmd
@@ -347,7 +347,7 @@ There are many ways we could modify `four_week_ahead`. We might consider:
 expect there to be a strong seasonal component to the outcome
 - Scaling by a factor
 
-We will demo a couple of these modifications below.
+We will demonstrate a couple of these modifications below.
 
 ## Growth rate
 
@@ -456,9 +456,9 @@ result_plot
 
 ## Population scaling
 
-Suppose we want to modify our predictions to apply to counts, rather than rates.
+Suppose we want to modify our predictions to return a rate prediction, rather than the count prediction.
 To do that, we can adjust _just_ the `frosting` to perform post-processing on our existing rates forecaster.
-Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$.
+Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{ \text{regional population} }{100,000}$.
 
 ```{r rate_scale}
 count_layers <-
diff --git a/vignettes/epipredict.Rmd b/vignettes/epipredict.Rmd
@@ -325,8 +325,18 @@ Instead, we'll use the fluview ILI dataset, which is weekly influenza like illne
 We'll predict the 2023/24 season using all previous data, including 2020-2022, the two years where there was approximately no seasonal flu, forecasting from the start of the season, `2023-10-08`:
 
 ```{r make-climatological-forecast, warning=FALSE}
-fluview_hhs <- pub_fluview(regions = paste0("hhs", 1:10), epiweeks = epirange(100001,222201))
-fluview <- fluview_hhs %>% select(geo_value = region, time_value = epiweek, issue, ili) %>% as_epi_archive() %>% epix_as_of_current()
+fluview_hhs <- pub_fluview(
+  regions = paste0("hhs", 1:10),
+  epiweeks = epirange(100001,222201)
+)
+fluview <- fluview_hhs %>%
+  select(
+    geo_value = region,
+    time_value = epiweek,
+    issue,
+    ili) %>%
+  as_epi_archive() %>%
+  epix_as_of_current()
 
 all_climate <- climatological_forecaster(
   fluview %>% filter(time_value < "2023-10-08"),
@@ -343,7 +353,9 @@ results <- all_climate$predictions
 autoplot(
   object = workflow,
   predictions = results,
-  observed_response = fluview %>% filter(time_value >= "2023-10-08", time_value < "2024-05-01") %>% mutate(geo_value = factor(geo_value, levels = paste0("hhs", 1:10)))
+  observed_response = fluview %>%
+    filter(time_value >= "2023-10-08", time_value < "2024-05-01") %>%
+    mutate(geo_value = factor(geo_value, levels = paste0("hhs", 1:10)))
 )
 ```
 
@@ -412,11 +424,10 @@ The accuracy is 50%, since all 4 states were predicted to be in the interval
 
 If multiple keys are set in the `epi_df` as `other_keys`, `arx_forecaster` will
 automatically group by those in addition to the required geographic key.
-For example, predicting the number of graduates in each of the categories in
+For example, predicting the number of graduates in a subset of the categories in
 `grad_employ_subset` from above:
 
 ```{r multi_key_forecast, warning=FALSE}
-# only fitting a subset, otherwise there are ~550 distinct pairs, which is bad for plotting
 edu_quals <- c("Undergraduate degree", "Professional degree")
 geo_values <- c("Quebec", "British Columbia")
 
@@ -478,7 +489,7 @@ all_fits |>
   list_rbind()
 ```
 
-Estimating separate models for each geography is both 56 times slower[^7] than geo-pooling, and uses far less data for each estimate.
+Estimating separate models for each geography uses far less data for each estimate than geo-pooling and is 56 times slower[^7].
 If a dataset contains relatively few observations for each geography, fitting a geo-pooled model is likely to produce better, more stable results.
 However, geo-pooling can only be used if values are comparable in meaning and scale across geographies or can be made comparable, for example by normalization.
 
@@ -488,7 +499,56 @@ workflow](custom_epiworkflows) with geography as a factor.
 
 # Anatomy of a canned forecaster
 
-This section describes the resulting object from `arx_forecaster()`, an `arx_fcast` object, along with a fairly minimal description of the actual mathematical model used for `arx_forecaster()`.
+This section describes the resulting object from `arx_forecaster()`, a fairly minimal description of the mathematical model used, and a description of an `arx_fcast` object.
+
+## Mathematical description
+
+Let's look at the mathematical details of the model in more detail, using a minimal version of
+`four_week_ahead`:
+
+```{r, four_week_again}
+four_week_small <- arx_forecaster(
+  covid_case_death_rates |> filter(time_value <= forecast_date),
+  outcome = "death_rate",
+  predictors = c("case_rate", "death_rate"),
+  args_list = arx_args_list(
+    lags = list(c(0, 7, 14), c(0, 7, 14)),
+    ahead = 4 * 7,
+    quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)
+  )
+)
+hardhat::extract_fit_engine(four_week_small$epi_workflow)
+```
+
+If $d_{t,j}$ is the death rate on day $t$ at location $j$ and $c_{t,j}$ is the
+associated case rate, then the corresponding model is:
+
+$$
+\begin{aligned}
+d_{t+28, j} = & a_0 + a_1 d_{t,j} + a_2 d_{t-7,j} + a_3 d_{t-14, j} +\\
+     & a_4 c_{t, j} + a_5 c_{t-7, j} + a_6 c_{t-14, j} + \varepsilon_{t,j}.
+\end{aligned}
+$$
+
+For example, $a_1$ is `lag_0_death_rate` above, with a value of 
+`r round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"],3)`,
+while $a_5$ is 
+`r round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"],4) `.
+Note that unlike `d_{t,j}` or `c_{t,j}`, these *don't* depend on either the time
+$t$ or the location $j$.
+This is what make it a geo-pooled model.
+
+
+The training data for estimating the parameters of this linear model is
+constructed within the `arx_forecaster()` function by shifting a series of
+columns the appropriate amount -- based on the requested `lags`.
+Each row containing no `NA` values in the predictors is used as a training
+observation to fit the coefficients $a_0,\ldots, a_6$.
+
+The equation above is only an accurate description of the model for a linear
+engine like `quantile_reg()` or `linear_reg()`; a nonlinear model like
+`rand_forest(mode = "regression")` will use the same input variables and
+training data, but fit the appropriate model for them.
 
 ## Code object
 Let's dissect the forecaster we trained back on the [landing
@@ -551,50 +611,6 @@ An `epi_workflow()` consists of 3 parts:
 See the [Custom Epiworkflows vignette](custom_epiworkflows) for recreating and then
 extending `four_week_ahead` using the custom forecaster framework.
 
-## Mathematical description
-
-Let's look at the mathematical details of the model in more detail, using a minimal version of
-`four_week_ahead`:
-
-```{r, four_week_again}
-four_week_small <- arx_forecaster(
-  covid_case_death_rates |> filter(time_value <= forecast_date),
-  outcome = "death_rate",
-  predictors = c("case_rate", "death_rate"),
-  args_list = arx_args_list(
-    lags = list(c(0, 7, 14), c(0, 7, 14)),
-    ahead = 4 * 7,
-    quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)
-  )
-)
-hardhat::extract_fit_engine(four_week_small$epi_workflow)
-```
-
-If $d_{t,j}$ is the death rate on day $t$ at location $j$ and $c_{t,j}$ is the
-associated case rate, then the corresponding model is:
-
-$$
-\begin{aligned}
-d_{t+28, j} = & a_0 + a_1 d_{t,j} + a_2 d_{t-7,j} + a_3 d_{t-14, j} +\\
-     & a_4 c_{t, j} + a_5 c_{t-7, j} + a_6 c_{t-14, j} + \varepsilon_{t,j}.
-\end{aligned}
-$$
-
-For example, $a_1$ is `lag_0_death_rate` above, with a value of `r
-round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"],
-3)`,
-while $a_5$ is `r
-round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"],
-4) `.
-Note that unlike `d_{t,j}` or `c_{t,j}`, these *don't* depend on either the time
-$t$ or the location $j$.
-This is what make it a geo-pooled model.
-
-The training data for estimating the parameters of this linear model is
-constructed within the `arx_forecaster()` function by shifting a series of
-columns the appropriate amount -- based on the requested `lags`.
-Each row containing no `NA` values in the predictors is used as a training observation to fit the
-coefficients $a_0,\ldots, a_6$.
 
 [^4]: in the case of a `{parsnip}` engine which doesn't explicitly predict
     quantiles, these quantiles are created using `layer_residual_quantiles()`,
@@ -611,4 +627,4 @@ coefficients $a_0,\ldots, a_6$.
 
 [^8]: It has only a year of data, which is barely enough to run the method without errors, let alone get a meaningful prediction.
 
-[^9]: Though not 28 weeks into the future! Such a forecast will be an absurd extrapolation.
+[^9]: Though not 28 weeks into the future! Such a forecast will likely be absurdly low or high.