Skip to content

Commit 58bf9d8

Browse files
committed
wip doc: README and Getting Started
1 parent 63cb820 commit 58bf9d8

16 files changed

+958
-629
lines changed

.gitignore

+5-1
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,8 @@ docs
1313
renv/
1414
renv.lock
1515
.Rprofile
16-
sandbox.R
16+
sandbox.R
17+
# Vignette caches
18+
*_cache/
19+
vignettes/*.html
20+
vignettes/*.R

R/archive.R

+9-8
Original file line numberDiff line numberDiff line change
@@ -147,14 +147,15 @@ next_after.Date <- function(x) x + 1L
147147
NULL
148148

149149

150-
#' Epi Archive
151-
#'
152-
#' @title `epi_archive` object
153-
#'
154-
#' @description An `epi_archive` is an S3 class which contains a data table
155-
#' along with several relevant pieces of metadata. The data table can be seen
156-
#' as the full archive (version history) for some signal variables of
157-
#' interest.
150+
#' `epi_archive` object
151+
#'
152+
#' The second main data structure for storing time series in `epiprocess`. It is
153+
#' similar to `epi_df` in that it fundamentally a table with a few required
154+
#' columns that stores epidemiological time series data. An `epi_archive`
155+
#' requires a `geo_value`, `time_value`, and `version` column (and possibly
156+
#' other key columns) along with measurement values. In brief, an `epi_archive`
157+
#' is a history of the time series data, where the `version` column tracks the
158+
#' time at which the data was available. This allows for version-aware forecasting.
158159
#'
159160
#' @details An `epi_archive` contains a data table `DT`, of class `data.table`
160161
#' from the `data.table` package, with (at least) the following columns:

R/epi_df.R

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
#' `epi_df` object
22
#'
3-
#' An `epi_df` is a tibble with certain minimal column structure and metadata.
4-
#' It can be seen as a snapshot of a data set that contains the most
5-
#' up-to-date values of some signal variables of interest, as of a given time.
3+
#' One of the two main data structures for storing time series in `epiprocess`.
4+
#' It is simply tibble with at least two columns, `geo_value` and `time_value`,
5+
#' that provide the keys for the time series. It can have any other columns,
6+
#' which can be seen as measured variables at each key. In brief, an `epi_df`
7+
#' represents a snapshot of an epidemiological data set at a point in time.
68
#'
79
#' @details An `epi_df` is a tibble with (at least) the following columns:
810
#'

R/slide.R

+69-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,75 @@
3737
#' into the constituent columns and those names used. New columns should not
3838
#' be given names that clash with the existing columns of `.x`; see details.
3939
#'
40-
#' @template basic-slide-details
40+
#' @details To "slide" means to apply a function or formula over a rolling
41+
#' window. The `.window_size` arg determines the width of the window
42+
#' (including the reference time) and the `.align` arg governs how the window
43+
#' is aligned (see below for examples). The `.ref_time_values` arg controls
44+
#' which time values to consider for the slide and `.all_rows` allows you to
45+
#' keep NAs around.
46+
#'
47+
#' `epi_slide()` does not require a complete window (such as on the left
48+
#' boundary of the dataset) and will attempt to perform the computation
49+
#' anyway. The issue of what to do with partial computations (those run on
50+
#' incomplete windows) is therefore left up to the user, either through the
51+
#' specified function or formula, or through post-processing.
52+
#'
53+
#' Let's look at some window examples, assuming that the reference time value
54+
#' is "tv". With .align = "right" and .window_size = 3, the window will be:
55+
#'
56+
#' time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
57+
#' window: tv - 2, tv - 1, tv
58+
#'
59+
#' With .align = "center" and .window_size = 3, the window will be:
60+
#'
61+
#' time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
62+
#' window: tv - 1, tv, tv + 1
63+
#'
64+
#' With .align = "center" and .window_size = 4, the window will be:
65+
#'
66+
#' time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
67+
#' window: tv - 2, tv - 1, tv, tv + 1
68+
#'
69+
#' With .align = "left" and .window_size = 3, the window will be:
70+
#'
71+
#' time_values: ttv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3
72+
#' window: tv, tv + 1, tv + 2
73+
#'
74+
#' If `.f` is missing, then ["data-masking"][rlang::args_data_masking]
75+
#' expression(s) for tidy evaluation can be specified, for example, as in:
76+
#' ```
77+
#' epi_slide(x, cases_7dav = mean(cases), .window_size = 7)
78+
#' ```
79+
#' which would be equivalent to:
80+
#' ```
81+
#' epi_slide(x, function(x, g, t) mean(x$cases), .window_size = 7,
82+
#' .new_col_name = "cases_7dav")
83+
#' ```
84+
#' In a manner similar to [`dplyr::mutate`]:
85+
#' * Expressions evaluating to length-1 vectors will be recycled to
86+
#' appropriate lengths.
87+
#' * `, name_var := value` can be used to set the output column name based on
88+
#' a variable `name_var` rather than requiring you to use a hard-coded
89+
#' name. (The leading comma is needed to make sure that `.f` is treated as
90+
#' missing.)
91+
#' * `= NULL` can be used to remove results from previous expressions (though
92+
#' we don't allow it to remove pre-existing columns).
93+
#' * `, fn_returning_a_data_frame(.x)` will unpack the output of the function
94+
#' into multiple columns in the result.
95+
#' * Named expressions evaluating to data frames will be placed into
96+
#' [`tidyr::pack`]ed columns.
97+
#'
98+
#' In addition to [`.data`] and [`.env`], we make some additional
99+
#' "pronoun"-like bindings available:
100+
#' * .x, which is like `.x` in [`dplyr::group_modify`]; an ordinary object
101+
#' like an `epi_df` rather than an rlang [pronoun][rlang::as_data_pronoun]
102+
#' like [`.data`]; this allows you to use additional `dplyr`, `tidyr`, and
103+
#' `epiprocess` operations. If you have multiple expressions in `...`, this
104+
#' won't let you refer to the output of the earlier expressions, but `.data`
105+
#' will.
106+
#' * .group_key, which is like `.y` in [`dplyr::group_modify`].
107+
#' * .ref_time_value, which is the element of `.ref_time_values` that
108+
#' determined the time window for the current computation.
41109
#'
42110
#' @importFrom lubridate days weeks
43111
#' @importFrom dplyr bind_rows group_map group_vars filter select

README.Rmd

+165
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
output: github_document
3+
---
4+
5+
<!-- README.md is generated from README.Rmd. Please edit that file -->
6+
7+
```{r, include = FALSE}
8+
knitr::opts_chunk$set(
9+
collapse = TRUE,
10+
comment = "#>",
11+
fig.path = "man/figures/README-",
12+
out.width = "100%"
13+
)
14+
ggplot2::theme_set(ggplot2::theme_bw())
15+
```
16+
17+
# epiprocess
18+
19+
## TODO: Condense these paragraphs
20+
21+
The [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) package works
22+
with epidemiological time series data to provide situational
23+
awareness, processing, and transformations in preparation for modeling, and
24+
version-faithful model backtesting. It contains:
25+
26+
- `epi_df`, a class for working with epidemiological time series data which
27+
behaves like a tibble (and can be manipulated with
28+
[`{dplyr}`](https://dplyr.tidyverse.org/)-esque "verbs") but with some
29+
additional structure;
30+
- `epi_archive`, a class for working with the version history of such time series data;
31+
- sample epidemiological data in these formats;
32+
33+
This package is provided by the Delphi group at Carnegie Mellon University. The
34+
Delphi group provides many tools also hosts the Delphi Epidata API, which provides access to a wide
35+
range of epidemiological data sets, including COVID-19 data, flu data, and more.
36+
This package is designed to work seamlessly with the data in the Delphi Epidata
37+
API, which can be accessed using the `epidatr` package.
38+
39+
It is part of a broader suite of packages that includes
40+
[`{epipredict}`](https://cmu-delphi.github.io/epipredict/),
41+
[`{epidatr}`](https://cmu-delphi.github.io/epidatr/),
42+
[`{rtestim}`](https://dajmcdon.github.io/rtestim/), and
43+
[`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/), for accessing,
44+
analyzing, and forecasting epidemiological time series data. We have expanded
45+
documentation and demonstrations for some of these packages available in an
46+
online "book" format [here](https://cmu-delphi.github.io/delphi-tooling-book/).
47+
48+
## Motivation
49+
50+
[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) and
51+
[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) are designed to lower
52+
the barrier to entry and implementation cost for epidemiological time series
53+
analysis and forecasting. Epidemiologists and forecasting groups repeatedly and
54+
separately have had to rush to implement this type of functionality in a much
55+
more ad hoc manner; we are trying to save such effort in the future by providing
56+
well-documented, tested, and general packages that can be called for many common
57+
tasks instead.
58+
59+
## Installation
60+
61+
To install:
62+
63+
```{r, eval=FALSE}
64+
# Stable version
65+
pak::pkg_install("cmu-delphi/epiprocess@main")
66+
67+
# Dev version
68+
pak::pkg_install("cmu-delphi/epiprocess@dev")
69+
```
70+
71+
The package is not yet on CRAN.
72+
73+
## Usage
74+
75+
Once `epiprocess` and `epidatr` are installed, you can use the following code to
76+
get started:
77+
78+
```{r, results=FALSE, warning=FALSE, message=FALSE}
79+
library(epiprocess)
80+
library(epidatr)
81+
library(dplyr)
82+
library(magrittr)
83+
```
84+
85+
Get COVID-19 confirmed cumulative case data from JHU CSSE for California,
86+
Florida, New York, and Texas, from March 1, 2020 to January 31, 2022
87+
88+
```{r cache=TRUE}
89+
df <- pub_covidcast(
90+
source = "jhu-csse",
91+
signals = "confirmed_cumulative_num",
92+
geo_type = "state",
93+
time_type = "day",
94+
geo_values = "ca,fl,ny,tx",
95+
time_values = epirange(20200301, 20220131),
96+
) %>%
97+
select(geo_value, time_value, cases_cumulative = value)
98+
df
99+
```
100+
101+
Convert the data to an epi_df object and sort by geo_value and time_value. You
102+
can work with the epi_df object like a tibble using dplyr
103+
104+
```{r}
105+
edf <- df %>%
106+
as_epi_df() %>%
107+
arrange_canonical() %>%
108+
group_by(geo_value) %>%
109+
mutate(cases_daily = cases_cumulative - lag(cases_cumulative, default = 0))
110+
edf
111+
```
112+
113+
Autoplot the confirmed daily cases for each geo_value
114+
115+
```{r}
116+
edf %>%
117+
autoplot(cases_cumulative)
118+
```
119+
120+
Compute the 7 day moving average of the confirmed daily cases for each geo_value
121+
122+
```{r}
123+
edf %>%
124+
group_by(geo_value) %>%
125+
epi_slide_mean(cases_daily, .window_size = 7, na.rm = TRUE)
126+
```
127+
128+
Compute the growth rate of the confirmed cumulative cases for each geo_value
129+
130+
```{r}
131+
edf %>%
132+
group_by(geo_value) %>%
133+
mutate(cases_growth = growth_rate(x = time_value, y = cases_cumulative, method = "rel_change", h = 7))
134+
```
135+
136+
Detect outliers in the growth rate of the confirmed cumulative cases for each
137+
138+
```{r}
139+
edf %>%
140+
group_by(geo_value) %>%
141+
mutate(outlier_info = detect_outlr(x = time_value, y = cases_daily)) %>%
142+
ungroup()
143+
```
144+
145+
Add a column to the epi_df object with the daily deaths for each geo_value and
146+
compute the correlations between cases and deaths for each geo_value
147+
148+
```{r cache=TRUE}
149+
df <- pub_covidcast(
150+
source = "jhu-csse",
151+
signals = "deaths_incidence_num",
152+
geo_type = "state",
153+
time_type = "day",
154+
geo_values = "ca,fl,ny,tx",
155+
time_values = epirange(20200301, 20220131),
156+
) %>%
157+
select(geo_value, time_value, deaths_daily = value) %>%
158+
as_epi_df() %>%
159+
arrange_canonical()
160+
edf <- inner_join(edf, df, by = c("geo_value", "time_value"))
161+
edf %>%
162+
group_by(geo_value) %>%
163+
epi_slide_mean(deaths_daily, .window_size = 7, na.rm = TRUE) %>%
164+
epi_cor(cases_daily, deaths_daily)
165+
```

0 commit comments

Comments
 (0)