Skip to content

Commit 6047d4f

Browse files
committed
final pass
1 parent f6a311e commit 6047d4f

File tree

4 files changed

+103
-76
lines changed

4 files changed

+103
-76
lines changed

man/step_adjust_latency.Rd

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vignettes/backtesting.Rmd

Lines changed: 30 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,10 @@ example, here is a plot that compares monthly snapshots of the data.
9292
<summary>Code for plotting</summary>
9393
```{r plot_revision_example, warn = FALSE, message = FALSE}
9494
geo_choose <- "ca"
95-
forecast_dates <- seq(from = as.Date("2020-08-01"), to = as.Date("2021-11-01"), by = "1 month")
95+
forecast_dates <- seq(
96+
from = as.Date("2020-08-01"),
97+
to = as.Date("2021-11-01"),
98+
by = "1 month")
9699
percent_cli_data <- bind_rows(
97100
# Snapshotted data for the version-faithful forecasts
98101
map(
@@ -110,13 +113,15 @@ percent_cli_data <- bind_rows(
110113
) |> as_tibble()
111114
p0 <- autoplot(
112115
archive_cases_dv_subset, percent_cli,
113-
.versions = seq(ymd("2020-07-01"), ymd("2021-11-30"), by = "1 month"),
116+
.versions = forecast_dates,
114117
.mark_versions = TRUE,
115118
.facet_filter = (geo_value == "ca")
116119
) +
117120
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
118121
labs(x = "", y = "% of doctor's visits with\n Covid-like illness") +
119-
scale_color_viridis_c(option = "turbo", guide = guide_legend(reverse=TRUE), direction = -1) +
122+
scale_color_viridis_c(
123+
option = "viridis",
124+
guide = guide_legend(reverse=TRUE), direction = -1) +
120125
scale_y_continuous(limits = c(0, NA), expand = expansion(c(0, 0.05))) +
121126
theme(legend.position = "none")
122127
```
@@ -133,19 +138,19 @@ For example, the snapshot on March 1st, 2021 is aquamarine, and increases to
133138
slightly over 10.
134139
Every series is necessarily to the left of the snapshot date (since all known
135140
values must happen before the snapshot is taken[^4]).
136-
The grey line overlaying the various snapshots represents the "final
141+
The black line overlaying the various snapshots represents the "final
137142
value", which is just the snapshot at the last version in the archive (the
138143
`versions_end`).
139144

140-
Comparing with the grey line tells us how much the value at the time of the
145+
Comparing with the black line tells us how much the value at the time of the
141146
snapshot differs with what was eventually reported.
142147
The drop in January 2021 in the snapshot on `2021-02-01` was initially reported
143148
as much steeper than it eventually turned out to be, while in the period after
144149
that the values were initially reported as higher than they actually were.
145150

146151
Handling data latency is important in both real-time forecasting and retrospective
147152
forecasting.
148-
Looking at the very first snapshot, `2020-08-01` (the red dotted
153+
Looking at the very first snapshot, `2020-08-01` (the purple dotted
149154
vertical line), there is a noticeable gap between the forecast date and the end
150155
of the red time-series to its left.
151156
In fact, if we take a snapshot and get the last `time_value`,
@@ -308,7 +313,7 @@ To see how the version faithful and un-faithful predictions compare, let's plot
308313
rates, using the same versioned plotting method as above.
309314
Note that even though we fit the model on four states (California, Texas, Florida, and
310315
New York), we'll just display the results for two states, California (CA) and Florida
311-
(FL), to get a sense of the model performance while keeping the graphic simple.
316+
(FL), to get a sense of the model performance while keeping the graphic simpler.
312317

313318
<details>
314319
<summary>Code for plotting</summary>
@@ -326,13 +331,19 @@ plotting_data <- bind_rows(
326331
mutate(version = max(percent_cli_data$version)) |>
327332
mutate(version_faithful = "Version faithful")
328333
)
329-
p1 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
330-
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = (time_value)), alpha = 0.4) +
334+
335+
p1 <- ggplot(data = forecasts_filtered,
336+
aes(x = target_date, group = time_value)) +
337+
geom_ribbon(
338+
aes(ymin = `0.05`, ymax = `0.95`, fill = (time_value)),
339+
alpha = 0.4) +
331340
geom_line(aes(y = .pred, color = (time_value)), linetype = 2L) +
332341
geom_point(aes(y = .pred, color = (time_value)), size = 0.75) +
333342
# the forecast date
334343
geom_vline(
335-
data = percent_cli_data |> filter(geo_value == geo_choose) |> select(-version_faithful),
344+
data = percent_cli_data |>
345+
filter(geo_value == geo_choose) |>
346+
select(-version_faithful),
336347
aes(color = version, xintercept = version, group = version),
337348
lty = 2
338349
) +
@@ -345,11 +356,11 @@ p1 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)
345356
facet_grid(version_faithful ~ geo_value, scales = "free") +
346357
scale_x_date(breaks = "2 months", date_labels = "%b %Y") +
347358
scale_y_continuous(expand = expansion(c(0, 0.05))) +
348-
labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
349-
scale_color_viridis_c(option = "turbo", direction = -1) +
350-
scale_fill_viridis_c(option = "turbo", direction = -1) +
359+
labs(x = "Date",
360+
y = "smoothed, day of week adjusted covid-like doctors visits") +
361+
scale_color_viridis_c(option = "viridis", direction = -1) +
362+
scale_fill_viridis_c(option = "viridis", direction = -1) +
351363
theme(legend.position = "none")
352-
p1
353364
```
354365

355366
```{r plot_fl_forecasts, warning = FALSE}
@@ -380,8 +391,8 @@ p2 <-
380391
scale_x_date(breaks = "2 months", date_labels = "%b %Y") +
381392
scale_y_continuous(expand = expansion(c(0, 0.05))) +
382393
labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
383-
scale_color_viridis_c(option = "turbo", direction = -1) +
384-
scale_fill_viridis_c(option = "turbo", direction = -1) +
394+
scale_color_viridis_c(option = "viridis", direction = -1) +
395+
scale_fill_viridis_c(option = "viridis", direction = -1) +
385396
theme(legend.position = "none")
386397
p2
387398
```
@@ -391,17 +402,17 @@ p2
391402
p1
392403
```
393404

394-
The version faithful and un-faithful forecasts look moderately similar except for the 1 day horizons
395-
(although neither approach produces amazingly accurate forecasts).
405+
There are some weeks when the forecasts are somewhat similar, and others when they are wildly different, although neither approach produces amazingly accurate forecasts.
396406

397407
In the version faithful case for California, the March 2021 forecast (turquoise)
398408
starts at a value just above 10, which is very well lined up with reported values leading up to that forecast.
399409
The measured and forecasted trends are also concordant (both increasingly moderately fast).
400410

401411
Because the data for this time period was later adjusted down with a decreasing trend, the March 2021 forecast looks quite bad compared to finalized data.
402-
403412
The equivalent version un-faithful forecast starts at a value of 5, which is in line with the finalized data but would have been out of place compared to the version data.
404413

414+
The October 2021 forecast for the version faithful case floors out at zero, whereas the un-faithful is much closer to the finalized data.
415+
405416
```{r show-plot2, warning = FALSE, echo=FALSE}
406417
p2
407418
```

vignettes/custom_epiworkflows.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ There are many ways we could modify `four_week_ahead`. We might consider:
347347
expect there to be a strong seasonal component to the outcome
348348
- Scaling by a factor
349349

350-
We will demo a couple of these modifications below.
350+
We will demonstrate a couple of these modifications below.
351351

352352
## Growth rate
353353

@@ -456,9 +456,9 @@ result_plot
456456

457457
## Population scaling
458458

459-
Suppose we want to modify our predictions to apply to counts, rather than rates.
459+
Suppose we want to modify our predictions to return a rate prediction, rather than the count prediction.
460460
To do that, we can adjust _just_ the `frosting` to perform post-processing on our existing rates forecaster.
461-
Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$.
461+
Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{ \text{regional population} }{100,000}$.
462462

463463
```{r rate_scale}
464464
count_layers <-

vignettes/epipredict.Rmd

Lines changed: 68 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -325,8 +325,18 @@ Instead, we'll use the fluview ILI dataset, which is weekly influenza like illne
325325
We'll predict the 2023/24 season using all previous data, including 2020-2022, the two years where there was approximately no seasonal flu, forecasting from the start of the season, `2023-10-08`:
326326

327327
```{r make-climatological-forecast, warning=FALSE}
328-
fluview_hhs <- pub_fluview(regions = paste0("hhs", 1:10), epiweeks = epirange(100001,222201))
329-
fluview <- fluview_hhs %>% select(geo_value = region, time_value = epiweek, issue, ili) %>% as_epi_archive() %>% epix_as_of_current()
328+
fluview_hhs <- pub_fluview(
329+
regions = paste0("hhs", 1:10),
330+
epiweeks = epirange(100001,222201)
331+
)
332+
fluview <- fluview_hhs %>%
333+
select(
334+
geo_value = region,
335+
time_value = epiweek,
336+
issue,
337+
ili) %>%
338+
as_epi_archive() %>%
339+
epix_as_of_current()
330340
331341
all_climate <- climatological_forecaster(
332342
fluview %>% filter(time_value < "2023-10-08"),
@@ -343,7 +353,9 @@ results <- all_climate$predictions
343353
autoplot(
344354
object = workflow,
345355
predictions = results,
346-
observed_response = fluview %>% filter(time_value >= "2023-10-08", time_value < "2024-05-01") %>% mutate(geo_value = factor(geo_value, levels = paste0("hhs", 1:10)))
356+
observed_response = fluview %>%
357+
filter(time_value >= "2023-10-08", time_value < "2024-05-01") %>%
358+
mutate(geo_value = factor(geo_value, levels = paste0("hhs", 1:10)))
347359
)
348360
```
349361

@@ -412,11 +424,10 @@ The accuracy is 50%, since all 4 states were predicted to be in the interval
412424

413425
If multiple keys are set in the `epi_df` as `other_keys`, `arx_forecaster` will
414426
automatically group by those in addition to the required geographic key.
415-
For example, predicting the number of graduates in each of the categories in
427+
For example, predicting the number of graduates in a subset of the categories in
416428
`grad_employ_subset` from above:
417429

418430
```{r multi_key_forecast, warning=FALSE}
419-
# only fitting a subset, otherwise there are ~550 distinct pairs, which is bad for plotting
420431
edu_quals <- c("Undergraduate degree", "Professional degree")
421432
geo_values <- c("Quebec", "British Columbia")
422433
@@ -478,7 +489,7 @@ all_fits |>
478489
list_rbind()
479490
```
480491

481-
Estimating separate models for each geography is both 56 times slower[^7] than geo-pooling, and uses far less data for each estimate.
492+
Estimating separate models for each geography uses far less data for each estimate than geo-pooling and is 56 times slower[^7].
482493
If a dataset contains relatively few observations for each geography, fitting a geo-pooled model is likely to produce better, more stable results.
483494
However, geo-pooling can only be used if values are comparable in meaning and scale across geographies or can be made comparable, for example by normalization.
484495

@@ -488,7 +499,56 @@ workflow](custom_epiworkflows) with geography as a factor.
488499

489500
# Anatomy of a canned forecaster
490501

491-
This section describes the resulting object from `arx_forecaster()`, an `arx_fcast` object, along with a fairly minimal description of the actual mathematical model used for `arx_forecaster()`.
502+
This section describes the resulting object from `arx_forecaster()`, a fairly minimal description of the mathematical model used, and a description of an `arx_fcast` object.
503+
504+
## Mathematical description
505+
506+
Let's look at the mathematical details of the model in more detail, using a minimal version of
507+
`four_week_ahead`:
508+
509+
```{r, four_week_again}
510+
four_week_small <- arx_forecaster(
511+
covid_case_death_rates |> filter(time_value <= forecast_date),
512+
outcome = "death_rate",
513+
predictors = c("case_rate", "death_rate"),
514+
args_list = arx_args_list(
515+
lags = list(c(0, 7, 14), c(0, 7, 14)),
516+
ahead = 4 * 7,
517+
quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)
518+
)
519+
)
520+
hardhat::extract_fit_engine(four_week_small$epi_workflow)
521+
```
522+
523+
If $d_{t,j}$ is the death rate on day $t$ at location $j$ and $c_{t,j}$ is the
524+
associated case rate, then the corresponding model is:
525+
526+
$$
527+
\begin{aligned}
528+
d_{t+28, j} = & a_0 + a_1 d_{t,j} + a_2 d_{t-7,j} + a_3 d_{t-14, j} +\\
529+
& a_4 c_{t, j} + a_5 c_{t-7, j} + a_6 c_{t-14, j} + \varepsilon_{t,j}.
530+
\end{aligned}
531+
$$
532+
533+
For example, $a_1$ is `lag_0_death_rate` above, with a value of
534+
`r round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"],3)`,
535+
while $a_5$ is
536+
`r round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"],4) `.
537+
Note that unlike `d_{t,j}` or `c_{t,j}`, these *don't* depend on either the time
538+
$t$ or the location $j$.
539+
This is what make it a geo-pooled model.
540+
541+
542+
The training data for estimating the parameters of this linear model is
543+
constructed within the `arx_forecaster()` function by shifting a series of
544+
columns the appropriate amount -- based on the requested `lags`.
545+
Each row containing no `NA` values in the predictors is used as a training
546+
observation to fit the coefficients $a_0,\ldots, a_6$.
547+
548+
The equation above is only an accurate description of the model for a linear
549+
engine like `quantile_reg()` or `linear_reg()`; a nonlinear model like
550+
`rand_forest(mode = "regression")` will use the same input variables and
551+
training data, but fit the appropriate model for them.
492552

493553
## Code object
494554
Let's dissect the forecaster we trained back on the [landing
@@ -551,50 +611,6 @@ An `epi_workflow()` consists of 3 parts:
551611
See the [Custom Epiworkflows vignette](custom_epiworkflows) for recreating and then
552612
extending `four_week_ahead` using the custom forecaster framework.
553613

554-
## Mathematical description
555-
556-
Let's look at the mathematical details of the model in more detail, using a minimal version of
557-
`four_week_ahead`:
558-
559-
```{r, four_week_again}
560-
four_week_small <- arx_forecaster(
561-
covid_case_death_rates |> filter(time_value <= forecast_date),
562-
outcome = "death_rate",
563-
predictors = c("case_rate", "death_rate"),
564-
args_list = arx_args_list(
565-
lags = list(c(0, 7, 14), c(0, 7, 14)),
566-
ahead = 4 * 7,
567-
quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)
568-
)
569-
)
570-
hardhat::extract_fit_engine(four_week_small$epi_workflow)
571-
```
572-
573-
If $d_{t,j}$ is the death rate on day $t$ at location $j$ and $c_{t,j}$ is the
574-
associated case rate, then the corresponding model is:
575-
576-
$$
577-
\begin{aligned}
578-
d_{t+28, j} = & a_0 + a_1 d_{t,j} + a_2 d_{t-7,j} + a_3 d_{t-14, j} +\\
579-
& a_4 c_{t, j} + a_5 c_{t-7, j} + a_6 c_{t-14, j} + \varepsilon_{t,j}.
580-
\end{aligned}
581-
$$
582-
583-
For example, $a_1$ is `lag_0_death_rate` above, with a value of `r
584-
round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"],
585-
3)`,
586-
while $a_5$ is `r
587-
round(hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"],
588-
4) `.
589-
Note that unlike `d_{t,j}` or `c_{t,j}`, these *don't* depend on either the time
590-
$t$ or the location $j$.
591-
This is what make it a geo-pooled model.
592-
593-
The training data for estimating the parameters of this linear model is
594-
constructed within the `arx_forecaster()` function by shifting a series of
595-
columns the appropriate amount -- based on the requested `lags`.
596-
Each row containing no `NA` values in the predictors is used as a training observation to fit the
597-
coefficients $a_0,\ldots, a_6$.
598614

599615
[^4]: in the case of a `{parsnip}` engine which doesn't explicitly predict
600616
quantiles, these quantiles are created using `layer_residual_quantiles()`,
@@ -611,4 +627,4 @@ coefficients $a_0,\ldots, a_6$.
611627

612628
[^8]: It has only a year of data, which is barely enough to run the method without errors, let alone get a meaningful prediction.
613629

614-
[^9]: Though not 28 weeks into the future! Such a forecast will be an absurd extrapolation.
630+
[^9]: Though not 28 weeks into the future! Such a forecast will likely be absurdly low or high.

0 commit comments

Comments
 (0)