cmu-delphi
diff --git a/‎.gitignore
Lines changed: 5 additions & 1 deletion b/‎.gitignore
Lines changed: 5 additions & 1 deletion
diff --git a/‎R/archive.R
Lines changed: 9 additions & 8 deletions b/‎R/archive.R
Lines changed: 9 additions & 8 deletions
diff --git a/‎R/epi_df.R
Lines changed: 5 additions & 3 deletions b/‎R/epi_df.R
Lines changed: 5 additions & 3 deletions
diff --git a/‎R/slide.R
Lines changed: 69 additions & 1 deletion b/‎R/slide.R
Lines changed: 69 additions & 1 deletion
diff --git a/‎README.Rmd
Lines changed: 165 additions & 0 deletions b/‎README.Rmd
Lines changed: 165 additions & 0 deletions
@@ -13,4 +13,8 @@ docs
 renv/
 renv.lock
 .Rprofile
-sandbox.R
+sandbox.R
+# Vignette caches
+*_cache/
+vignettes/*.html
+vignettes/*.R
@@ -147,14 +147,15 @@ next_after.Date <- function(x) x + 1L
 NULL
 
 
-#' Epi Archive
-#'
-#' @title `epi_archive` object
-#'
-#' @description An `epi_archive` is an S3 class which contains a data table
-#'   along with several relevant pieces of metadata. The data table can be seen
-#'   as the full archive (version history) for some signal variables of
-#'   interest.
+#' `epi_archive` object
+#'
+#' The second main data structure for storing time series in `epiprocess`. It is
+#' similar to `epi_df` in that it fundamentally a table with a few required
+#' columns that stores epidemiological time series data. An `epi_archive`
+#' requires a `geo_value`, `time_value`, and `version` column (and possibly
+#' other key columns) along with measurement values. In brief, an `epi_archive`
+#' is a history of the time series data, where the `version` column tracks the
+#' time at which the data was available. This allows for version-aware forecasting.
 #'
 #' @details An `epi_archive` contains a data table `DT`, of class `data.table`
 #'   from the `data.table` package, with (at least) the following columns:
 
@@ -1,8 +1,10 @@
 #' `epi_df` object
 #'
-#' An `epi_df` is a tibble with certain minimal column structure and metadata.
-#'   It can be seen as a snapshot of a data set that contains the most
-#'   up-to-date values of some signal variables of interest, as of a given time.
+#' One of the two main data structures for storing time series in `epiprocess`.
+#' It is simply tibble with at least two columns, `geo_value` and `time_value`,
+#' that provide the keys for the time series. It can have any other columns,
+#' which can be seen as measured variables at each key. In brief, an `epi_df`
+#' represents a snapshot of an epidemiological data set at a point in time.
 #'
 #' @details An `epi_df` is a tibble with (at least) the following columns:
 #'
 
@@ -37,7 +37,75 @@
 #'   into the constituent columns and those names used. New columns should not
 #'   be given names that clash with the existing columns of `.x`; see details.
 #'
-#' @template basic-slide-details
+#' @details To "slide" means to apply a function or formula over a rolling
+#'   window. The `.window_size` arg determines the width of the window
+#'   (including the reference time) and the `.align` arg governs how the window
+#'   is aligned (see below for examples). The `.ref_time_values` arg controls
+#'   which time values to consider for the slide and `.all_rows` allows you to
+#'   keep NAs around.
+#'
+#'   `epi_slide()` does not require a complete window (such as on the left
+#'   boundary of the dataset) and will attempt to perform the computation
+#'   anyway. The issue of what to do with partial computations (those run on
+#'   incomplete windows) is therefore left up to the user, either through the
+#'   specified function or formula, or through post-processing.
+#'
+#'   Let's look at some window examples, assuming that the reference time value
+#'   is "tv". With .align = "right" and .window_size = 3, the window will be:
+#'
+#'   time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
+#'   window:              tv - 2, tv - 1, tv
+#'
+#'   With .align = "center" and .window_size = 3, the window will be:
+#'
+#'   time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
+#'   window:                      tv - 1, tv, tv + 1
+#'
+#'   With .align = "center" and .window_size = 4, the window will be:
+#'
+#'   time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
+#'   window:              tv - 2, tv - 1, tv, tv + 1
+#'
+#'   With .align = "left" and .window_size = 3, the window will be:
+#'
+#'   time_values: ttv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
+#'   window:                               tv, tv + 1, tv + 2
+#'
+#'   If `.f` is missing, then ["data-masking"][rlang::args_data_masking]
+#'   expression(s) for tidy evaluation can be specified, for example, as in:
+#'   ```
+#'   epi_slide(x, cases_7dav = mean(cases), .window_size = 7)
+#'   ```
+#'   which would be equivalent to:
+#'   ```
+#'   epi_slide(x, function(x, g, t) mean(x$cases), .window_size = 7,
+#'             .new_col_name = "cases_7dav")
+#'   ```
+#'   In a manner similar to [`dplyr::mutate`]:
+#'   * Expressions evaluating to length-1 vectors will be recycled to
+#'     appropriate lengths.
+#'   * `, name_var := value` can be used to set the output column name based on
+#'     a variable `name_var` rather than requiring you to use a hard-coded
+#'     name. (The leading comma is needed to make sure that `.f` is treated as
+#'     missing.)
+#'   * `= NULL` can be used to remove results from previous expressions (though
+#'     we don't allow it to remove pre-existing columns).
+#'   * `, fn_returning_a_data_frame(.x)` will unpack the output of the function
+#'     into multiple columns in the result.
+#'   * Named expressions evaluating to data frames will be placed into
+#'     [`tidyr::pack`]ed columns.
+#'
+#'   In addition to [`.data`] and [`.env`], we make some additional
+#'   "pronoun"-like bindings available:
+#'   * .x, which is like `.x` in [`dplyr::group_modify`]; an ordinary object
+#'     like an `epi_df` rather than an rlang [pronoun][rlang::as_data_pronoun]
+#'     like [`.data`]; this allows you to use additional `dplyr`, `tidyr`, and
+#'     `epiprocess` operations. If you have multiple expressions in `...`, this
+#'     won't let you refer to the output of the earlier expressions, but `.data`
+#'     will.
+#'   * .group_key, which is like `.y` in [`dplyr::group_modify`].
+#'   * .ref_time_value, which is the element of `.ref_time_values` that
+#'     determined the time window for the current computation.
 #'
 #' @importFrom lubridate days weeks
 #' @importFrom dplyr bind_rows group_map group_vars filter select
 
@@ -0,0 +1,165 @@
+---
+output: github_document
+---
+
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  fig.path = "man/figures/README-",
+  out.width = "100%"
+)
+ggplot2::theme_set(ggplot2::theme_bw())
+```
+
+# epiprocess
+
+## TODO: Condense these paragraphs
+
+The [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) package works
+with epidemiological time series data to provide situational
+awareness, processing, and transformations in preparation for modeling, and
+version-faithful model backtesting. It contains:
+
+- `epi_df`, a class for working with epidemiological time series data which
+behaves like a tibble (and can be manipulated with
+[`{dplyr}`](https://dplyr.tidyverse.org/)-esque "verbs") but with some
+additional structure;
+- `epi_archive`, a class for working with the version history of such time series data;
+- sample epidemiological data in these formats;
+
+This package is provided by the Delphi group at Carnegie Mellon University. The
+Delphi group provides many tools also hosts the Delphi Epidata API, which provides access to a wide
+range of epidemiological data sets, including COVID-19 data, flu data, and more.
+This package is designed to work seamlessly with the data in the Delphi Epidata
+API, which can be accessed using the `epidatr` package.
+
+It is part of a broader suite of packages that includes
+[`{epipredict}`](https://cmu-delphi.github.io/epipredict/),
+[`{epidatr}`](https://cmu-delphi.github.io/epidatr/),
+[`{rtestim}`](https://dajmcdon.github.io/rtestim/), and
+[`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/), for accessing,
+analyzing, and forecasting epidemiological time series data. We have expanded
+documentation and demonstrations for some of these packages available in an
+online "book" format [here](https://cmu-delphi.github.io/delphi-tooling-book/).
+
+## Motivation
+
+[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) and
+[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) are designed to lower
+the barrier to entry and implementation cost for epidemiological time series
+analysis and forecasting. Epidemiologists and forecasting groups repeatedly and
+separately have had to rush to implement this type of functionality in a much
+more ad hoc manner; we are trying to save such effort in the future by providing
+well-documented, tested, and general packages that can be called for many common
+tasks instead.
+
+## Installation
+
+To install:
+
+```{r, eval=FALSE}
+# Stable version
+pak::pkg_install("cmu-delphi/epiprocess@main")
+
+# Dev version
+pak::pkg_install("cmu-delphi/epiprocess@dev")
+```
+
+The package is not yet on CRAN.
+
+## Usage
+
+Once `epiprocess` and `epidatr` are installed, you can use the following code to
+get started:
+
+```{r, results=FALSE, warning=FALSE, message=FALSE}
+library(epiprocess)
+library(epidatr)
+library(dplyr)
+library(magrittr)
+```
+
+Get COVID-19 confirmed cumulative case data from JHU CSSE for California,
+Florida, New York, and Texas, from March 1, 2020 to January 31, 2022
+
+```{r cache=TRUE}
+df <- pub_covidcast(
+  source = "jhu-csse",
+  signals = "confirmed_cumulative_num",
+  geo_type = "state",
+  time_type = "day",
+  geo_values = "ca,fl,ny,tx",
+  time_values = epirange(20200301, 20220131),
+) %>%
+  select(geo_value, time_value, cases_cumulative = value)
+df
+```
+
+Convert the data to an epi_df object and sort by geo_value and time_value. You
+can work with the epi_df object like a tibble using dplyr
+
+```{r}
+edf <- df %>%
+  as_epi_df() %>%
+  arrange_canonical() %>%
+  group_by(geo_value) %>%
+  mutate(cases_daily = cases_cumulative - lag(cases_cumulative, default = 0))
+edf
+```
+
+Autoplot the confirmed daily cases for each geo_value
+
+```{r}
+edf %>%
+  autoplot(cases_cumulative)
+```
+
+Compute the 7 day moving average of the confirmed daily cases for each geo_value
+
+```{r}
+edf %>%
+  group_by(geo_value) %>%
+  epi_slide_mean(cases_daily, .window_size = 7, na.rm = TRUE)
+```
+
+Compute the growth rate of the confirmed cumulative cases for each geo_value
+
+```{r}
+edf %>%
+  group_by(geo_value) %>%
+  mutate(cases_growth = growth_rate(x = time_value, y = cases_cumulative, method = "rel_change", h = 7))
+```
+
+Detect outliers in the growth rate of the confirmed cumulative cases for each
+
+```{r}
+edf %>%
+  group_by(geo_value) %>%
+  mutate(outlier_info = detect_outlr(x = time_value, y = cases_daily)) %>%
+  ungroup()
+```
+
+Add a column to the epi_df object with the daily deaths for each geo_value and
+compute the correlations between cases and deaths for each geo_value
+
+```{r cache=TRUE}
+df <- pub_covidcast(
+  source = "jhu-csse",
+  signals = "deaths_incidence_num",
+  geo_type = "state",
+  time_type = "day",
+  geo_values = "ca,fl,ny,tx",
+  time_values = epirange(20200301, 20220131),
+) %>%
+  select(geo_value, time_value, deaths_daily = value) %>%
+  as_epi_df() %>%
+  arrange_canonical()
+edf <- inner_join(edf, df, by = c("geo_value", "time_value"))
+edf %>%
+  group_by(geo_value) %>%
+  epi_slide_mean(deaths_daily, .window_size = 7, na.rm = TRUE) %>%
+  epi_cor(cases_daily, deaths_daily)
+```