cmu-delphi · dsweber2 · Jan 23, 2025 · Jan 23, 2025 · Jan 23, 2025 · Jan 23, 2025
@@ -39,7 +39,7 @@ Imports:
     lifecycle,
     lubridate,
     magrittr,
-    recipes (>= 1.0.4),
+    recipes (>= 1.1.1),
     rlang (>= 1.1.0),
     stats,
     tibble,
@@ -53,6 +53,7 @@ Suggests:
     epidatr (>= 1.0.0),
     fs,
     grf,
+    here,
     knitr,
     poissonreg,
     purrr,

@@ -35,10 +35,13 @@ R -e 'devtools::document()'
 R -e 'pkgdown::build_site()'
 ```
 
+Note that sometimes the caches from either `pkgdown` or `knitr` can cause
+difficulties. To clear those, run `make`, with either `clean_knitr`,
+`clean_site`, or `clean` (which does both).
+
 If you work without R Studio and want to iterate on documentation, you might
-find [this
-script](https://gist.github.com/gadenbuie/d22e149e65591b91419e41ea5b2e0621)
-helpful.
+find `Rscript pkgdown-watch.R` useful.
+helpful. For updating references, you will need to manually call `pkgdown::build_reference()`.
 
 ## Versioning
 

@@ -0,0 +1,14 @@
+##
+# epipredict docs build
+#
+
+# knitr doesn't actually clean it's own cache properly; this just deletes any of
+# the article knitr caches in vignettes or the base
+clean_knitr:
+	rm -r *_cache; rm -r vignettes/*_cache
+clean_site:
+	Rscript -e "pkgdown::clean_cache(); pkgdown::clean_site()"
+# this combines 
+clean: clean_knitr clean_site
+
+# end
@@ -38,6 +38,7 @@ Pre-1.0.0 numbering scheme: 0.x will indicate releases, while 0.0.x will indicat
 - Replace `dist_quantiles()` with `hardhat::quantile_pred()`
 - Allow `quantile()` to threshold to an interval if desired (#434)
 - `arx_forecaster()` detects if there's enough data to predict
+- Add `observed_response` to `autoplot` so that forecasts can be plotted against the values they're predicting
 
 ## Bug fixes
 
@@ -69,7 +70,7 @@ Pre-1.0.0 numbering scheme: 0.x will indicate releases, while 0.0.x will indicat
 - training window step debugged
 - `min_train_window` argument removed from canned forecasters
 - add forecasters
-- implement postprocessing
+- implement post-processing
 - vignettes avaliable
 - arx_forecaster
 - pkgdown

@@ -1,8 +1,106 @@
 #' Direct autoregressive classifier with covariates
 #'
-#' This is an autoregressive classification model for
-#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It does "direct" forecasting, meaning
-#' that it estimates a class at a particular target horizon.
+#'
+#' @description
+#' This is an autoregressive classification model for continuous data. It does
+#'   "direct" forecasting, meaning that it estimates a class at a particular
+#'   target horizon.
+#'
+#' @details
+#' The `arx_classifier()` is an autoregressive classification model for `epi_df`
+#'   data that is used to predict a discrete class for each case under
+#'   consideration.  It is a direct forecaster in that it estimates the classes
+#'   at a specific horizon or ahead value.
+#'
+#' To get a sense of how the `arx_classifier()` works, let's consider a simple
+#'   example with minimal inputs. For this, we will use the built-in
+#'   `covid_case_death_rates` that contains confirmed COVID-19 cases and deaths
+#'   from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this,
+#'   we'll take a subset of data for five states over June 4, 2021 to December
+#'   31, 2021. Our objective is to predict whether the case rates are increasing
+#'   when considering the 0, 7 and 14 day case rates:
+#'
+#' ```{r}
+#' jhu <- covid_case_death_rates %>%
+#'   filter(
+#'     time_value >= "2021-06-04",
+#'     time_value <= "2021-12-31",
+#'     geo_value %in% c("ca", "fl", "tx", "ny", "nj")
+#'   )
+#'
+#' out <- arx_classifier(jhu, outcome = "case_rate", predictors = "case_rate")
+#'
+#' out$predictions
+#' ```
+#'
+#' The key takeaway from the predictions is that there are two prediction
+#'   classes: `(-Inf, 0.25]` and `(0.25, Inf)`. This is because for our goal of
+#'   classification the classes must be discrete. The discretization of the
+#'   real-valued outcome is controlled by the `breaks` argument, which defaults
+#'   to `0.25`. Such breaks will be automatically extended to cover the entire
+#'   real line. For example, the default break of `0.25` is silently extended to
+#'   `breaks = c(-Inf, .25, Inf)` and, therefore, results in two classes:
+#'   `[-Inf, 0.25]` and `(0.25, Inf)`. These two classes are used to discretize
+#'   the outcome. The conversion of the outcome to such classes is handled
+#'   internally. So if discrete classes already exist for the outcome in the
+#'   `epi_df`, then we recommend to code a classifier from scratch using the
+#'   `epi_workflow` framework for more control.
+#'
+#' The `trainer` is a `parsnip` model describing the type of estimation such
+#'   that `mode = "classification"` is enforced. The two typical trainers that
+#'   are used are `parsnip::logistic_reg()` for two classes or
+#'   `parsnip::multinom_reg()` for more than two classes.
+#'
+#' ```{r}
+#' workflows::extract_spec_parsnip(out$epi_workflow)
+#' ```
+#'
+#' From the parsnip model specification, we can see that the trainer used is
+#'   logistic regression, which is expected for our binary outcome. More
+#'   complicated trainers like `parsnip::naive_Bayes()` or
+#'   `parsnip::rand_forest()` may also be used (however, we will stick to the
+#'   basics in this gentle introduction to the classifier).
+#'
+#' If you use the default trainer of logistic regression for binary
+#'   classification and you decide against using the default break of 0.25, then
+#'   you should only input one break so that there are two classification bins
+#'   to properly dichotomize the outcome. For example, let's set a break of 0.5
+#'   instead of relying on the default of 0.25. We can do this by passing 0.5 to
+#'   the `breaks` argument in `arx_class_args_list()` as follows:
+#'
+#' ```{r}
+#' out_break_0.5 <- arx_classifier(
+#'   jhu,
+#'   outcome = "case_rate",
+#'   predictors = "case_rate",
+#'   args_list = arx_class_args_list(
+#'     breaks = 0.5
+#'   )
+#' )
+#'
+#' out_break_0.5$predictions
+#' ```
+#' Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and
+#'   (0.5, Inf). See `help(arx_class_args_list)` for other available
+#'   modifications.
+#'
+#' Additional arguments that may be supplied to `arx_class_args_list()` include
+#'   the expected `lags` and `ahead` arguments for an autoregressive-type model.
+#'   These have default values of 0, 7, and 14 days for the lags of the
+#'   predictors and 7 days ahead of the forecast date for predicting the
+#'   outcome. There is also `n_training` to indicate the upper bound for the
+#'   number of training rows per key. If you would like some practice with using
+#'   this, then remove the filtering command to obtain data within "2021-06-04"
+#'   and "2021-12-31" and instead set `n_training` to be the number of days
+#'   between these two dates, inclusive of the end points. The end results
+#'   should be the same. In addition to `n_training`, there are `forecast_date`
+#'   and `target_date` to specify the date that the forecast is created and
+#'   intended, respectively. We will not dwell on such arguments here as they
+#'   are not unique to this classifier or absolutely essential to understanding
+#'   how it operates. The remaining arguments will be discussed organically, as
+#'   they are needed to serve our purposes. For information on any remaining
+#'   arguments that are not discussed here, please see the function
+#'   documentation for a complete list and their definitions.
 #'
 #' @inheritParams arx_forecaster
 #' @param outcome A character (scalar) specifying the outcome (in the
@@ -68,9 +166,7 @@ arx_classifier <- function(
   }
   forecast_date <- args_list$forecast_date %||% forecast_date_default
   target_date <- args_list$target_date %||% (forecast_date + args_list$ahead)
-  preds <- forecast(
-    wf,
-  ) %>%
+  preds <- forecast(wf) %>%
     as_tibble() %>%
     select(-time_value)
 
@@ -249,7 +345,7 @@ arx_class_epi_workflow <- function(
 #'   be created using growth rates (as the predictors are) or lagged
 #'   differences. The second case is closer to the requirements for the
 #'   [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md).
-#'   See the Classification Vignette for details of how to create a reasonable
+#'   See the [Classification chapter from the forecasting book](https://cmu-delphi.github.io/delphi-tooling-book/arx-classifier.html) Vignette for details of how to create a reasonable
 #'   baseline for this case. Selecting `"growth_rate"` (the default) uses
 #'   [epiprocess::growth_rate()] to create the outcome using some of the
 #'   additional arguments below. Choosing `"lag_difference"` instead simply

@@ -1,26 +1,29 @@
 #' Direct autoregressive forecaster with covariates
 #'
 #' This is an autoregressive forecasting model for
-#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It does "direct" forecasting, meaning
-#' that it estimates a model for a particular target horizon.
+#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It does "direct"
+#' forecasting, meaning that it estimates a model for a particular target
+#' horizon of `outcome` based on the lags of the `predictors`. See the [Get
+#' started vignette](../articles/epipredict.html) for some worked examples and
+#' [Custom epi_workflows vignette](../articles/custom_epiworkflows.html) for a
+#' recreation using a custom `epi_workflow()`.
 #'
 #'
 #' @param epi_data An `epi_df` object
-#' @param outcome A character (scalar) specifying the outcome (in the
-#'   `epi_df`).
+#' @param outcome A character (scalar) specifying the outcome (in the `epi_df`).
 #' @param predictors A character vector giving column(s) of predictor variables.
-#'   This defaults to the `outcome`. However, if manually specified, only those variables
-#'   specifically mentioned will be used. (The `outcome` will not be added.)
-#'   By default, equals the outcome. If manually specified, does not add the
-#'   outcome variable, so make sure to specify it.
-#' @param trainer A `{parsnip}` model describing the type of estimation.
-#'   For now, we enforce `mode = "regression"`.
-#' @param args_list A list of customization arguments to determine
-#'   the type of forecasting model. See [arx_args_list()].
+#'   This defaults to the `outcome`. However, if manually specified, only those
+#'   variables specifically mentioned will be used. (The `outcome` will not be
+#'   added.)  By default, equals the outcome. If manually specified, does not
+#'   add the outcome variable, so make sure to specify it.
+#' @param trainer A `{parsnip}` model describing the type of estimation.  For
+#'   now, we enforce `mode = "regression"`.
+#' @param args_list A list of customization arguments to determine the type of
+#'   forecasting model. See [arx_args_list()].
 #'
-#' @return A list with (1) `predictions` an `epi_df` of predicted values
-#'   and (2) `epi_workflow`, a list that encapsulates the entire estimation
-#'   workflow
+#' @return An `arx_fcast`, with the fields `predictions` and `epi_workflow`.
+#'   `predictions` is an `epi_df` of predicted values while `epi_workflow()` is
+#'   the fit workflow used to make those predictions
-#'   the fit workflow used to make those predictions
+#'   the trained workflow used to make those predictions
-#'   the fit workflow used to make those predictions
+#'   the trained workflow used to make those predictions
 #' @export
 #' @seealso [arx_fcast_epi_workflow()], [arx_args_list()]
 #'
@@ -29,15 +32,18 @@
 #'   dplyr::filter(time_value >= as.Date("2021-12-01"))
 #'
 #' out <- arx_forecaster(
-#'   jhu, "death_rate",
+#'   jhu,
+#'   "death_rate",
 #'   c("case_rate", "death_rate")
 #' )
 #'
-#' out <- arx_forecaster(jhu, "death_rate",
+#' out <- arx_forecaster(jhu,
+#'   "death_rate",
 #'   c("case_rate", "death_rate"),
 #'   trainer = quantile_reg(),
 #'   args_list = arx_args_list(quantile_levels = 1:9 / 10)
 #' )
+#' out
 arx_forecaster <- function(
     epi_data,
     outcome,
@@ -60,7 +66,7 @@ arx_forecaster <- function(
   forecast_date <- args_list$forecast_date %||% forecast_date_default
 
 
-  preds <- forecast(wf, forecast_date = forecast_date) %>%
+  preds <- forecast(wf) %>%
     as_tibble() %>%
     select(-time_value)
 
@@ -262,10 +268,13 @@ arx_fcast_epi_workflow <- function(
 #' @param quantile_levels Vector or `NULL`. A vector of probabilities to produce
 #'   prediction intervals. These are created by computing the quantiles of
 #'   training residuals. A `NULL` value will result in point forecasts only.
-#' @param symmetrize Logical. The default `TRUE` calculates
-#'   symmetric prediction intervals. This argument only applies when
-#'   residual quantiles are used. It is not applicable with
-#'   `trainer = quantile_reg()`, for example.
+#' @param symmetrize Logical. The default `TRUE` calculates symmetric prediction
+#'   intervals. This argument only applies when residual quantiles are used. It
+#'   is not applicable with `trainer = quantile_reg()`, for example. This is
+#'   achieved by including both the residuals and their negation. Typically, one
+#'   would only want non-symmetric quantiles when increasing trajectories are
+#'   quite different from decreasing ones, such as a strictly postive variable
+#'   near zero.
 #' @param nonneg Logical. The default `TRUE` enforces nonnegative predictions
 #'   by hard-thresholding at 0.
 #' @param quantile_by_key Character vector. Groups residuals by listed keys