Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specialized epix_slide for epi_slide_opt #611

Open
wants to merge 47 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
d8700fb
refactor: hoist some epi_slide_opt pre-processing to helpers
brookslogan Feb 26, 2025
9effa5c
feat: WIP epix_epi_slide_opt
brookslogan Feb 26, 2025
7f97236
epix_epi_slide_opt: don't re-order input columns in output
brookslogan Feb 28, 2025
e59e2e9
feat: support before = Inf in epix_epi_slide_opt
brookslogan Mar 3, 2025
486b488
feat(epi_slide_opt): improve feedback when .f is forgotten
brookslogan Mar 3, 2025
df36e2c
fix(epix_epi_slide_opt): support ... forwarding
brookslogan Mar 3, 2025
8b94f79
Clean up some comments&code + note bug
brookslogan Mar 3, 2025
0462a09
fix(epix_epi_slide_opt): on data.table `.f`s with `.align != "right"`
brookslogan Mar 3, 2025
29d2e29
Clear out some more comments
brookslogan Mar 3, 2025
d991590
Fix missing ukey_names arg
brookslogan Mar 4, 2025
0e4f3de
Check for missing & improper ukey_names args
brookslogan Mar 4, 2025
9054d1f
Remove some commented ideas that aren't quick wins
brookslogan Mar 4, 2025
0a69ce2
WIP cleaning up approx_equal
brookslogan Mar 5, 2025
5c5b098
fix(apply_compactify): avoid arrange on data.table, `i` parsing issues
brookslogan Mar 5, 2025
6d03a0f
perf: speed up compactification with `approx_equal`
brookslogan Mar 5, 2025
f74485a
fix(approx_equal): missing import
brookslogan Mar 5, 2025
2a42e19
docs(approx_equal): roxygen2 + comment on inconsistencies/bugs
brookslogan Mar 5, 2025
e84f3dc
fix(approx_equal): consistency with vec_slice(na_equal=FALSE)
brookslogan Mar 5, 2025
a3a52c0
fix(approx_equal): on bare numeric matrices
brookslogan Mar 5, 2025
32a1c79
feat: approx_equal on lists
brookslogan Mar 5, 2025
191fc7c
docs(approx_equal): iterate on @return + doc approx_equal0
brookslogan Mar 5, 2025
c02166a
WIP docs(epix_epi_slide_opt_one_epikey): initial
brookslogan Mar 5, 2025
7e36241
refactor: move epi_slide_opt & helpers to its own file
brookslogan Mar 7, 2025
6ffa6a2
Actually turn epi_slide_opt into S3 method
brookslogan Mar 7, 2025
ab82b09
Clean up unnecessary comments and unused helper functions, +@keywords…
brookslogan Mar 6, 2025
08b1783
approx_equal: make "abs_tol=" mandatory, +validation, +docs
brookslogan Mar 6, 2025
2593419
Expand epi_slide_opt_archive_one_epikey example
brookslogan Mar 6, 2025
5bd8d0e
WIP epi_slide_opt.epi_archive tests
brookslogan Mar 6, 2025
1323bf1
More WIP on tests
brookslogan Mar 7, 2025
e2fb79b
Mark renaming TODO on approx_equal
brookslogan Mar 7, 2025
540549d
fix!: as_epi_archive(tibble) key setting; + distrust key if data.table
brookslogan Mar 10, 2025
1be9df7
Make epi_archive key order geo !!!other time version
brookslogan Mar 10, 2025
56d7cb0
docs(new_epi_archive): roxygen2 for new param requirements
brookslogan Mar 10, 2025
0435460
fix(epi_slide_opt.epi_archive): as.data.table(tibble) key setting
brookslogan Mar 10, 2025
dd84924
tests(epi_slide_opt): on example data sets
brookslogan Mar 10, 2025
84c9db0
Fix & test epi_slide_opt.grouped_epi_archive behavior
brookslogan Mar 10, 2025
8149aa2
Rename approx_equal -> vec_approx_equal
brookslogan Mar 10, 2025
5df41cd
Fix missing n_groups import + epiprocess::: CHECK lint
brookslogan Mar 10, 2025
cbe1cb1
Address additional lints, CHECK issues
brookslogan Mar 10, 2025
3bb956c
Address missing ::: in tests
brookslogan Mar 10, 2025
27ff6dd
Fix another missing `:::`
brookslogan Mar 10, 2025
58ce1d0
Fix + add NEWS.md entries
brookslogan Mar 11, 2025
afc5d06
Fix CHECK doc line length lint
brookslogan Mar 11, 2025
df8ad0c
Fix missing library(dplyr) in example
brookslogan Mar 11, 2025
168af56
Fix {epiprocess} -> `{epiprocess}` in roxygen
brookslogan Mar 11, 2025
cc22517
docs: add vec_approx_equal to pkgdown reference index
brookslogan Mar 11, 2025
bd86054
docs(vec_approx_equal): mention vctrs::vec_proxy_equal
brookslogan Mar 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ Collate:
'correlation.R'
'epi_df.R'
'epi_df_forbidden_methods.R'
'epi_slide_opt_archive.R'
'epi_slide_opt_edf.R'
'epiprocess-package.R'
'group_by_epi_df_methods.R'
'methods-epi_archive.R'
Expand All @@ -105,6 +107,7 @@ Collate:
'key_colnames.R'
'methods-epi_df.R'
'outliers.R'
'patch.R'
'reexports.R'
'revision_analysis.R'
'slide.R'
Expand Down
28 changes: 27 additions & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ S3method(dplyr_col_modify,col_modify_recorder_df)
S3method(dplyr_col_modify,epi_df)
S3method(dplyr_reconstruct,epi_df)
S3method(dplyr_row_slice,epi_df)
S3method(epi_slide_opt,epi_archive)
S3method(epi_slide_opt,epi_df)
S3method(epi_slide_opt,grouped_epi_archive)
S3method(epix_slide,epi_archive)
S3method(epix_slide,grouped_epi_archive)
S3method(epix_truncate_versions_after,epi_archive)
Expand Down Expand Up @@ -101,6 +104,7 @@ export(time_column_names)
export(ungroup)
export(unnest)
export(validate_epi_archive)
export(vec_approx_equal)
export(version_column_names)
import(epidatasets)
importFrom(checkmate,anyInfinite)
Expand All @@ -117,13 +121,19 @@ importFrom(checkmate,assert_logical)
importFrom(checkmate,assert_number)
importFrom(checkmate,assert_numeric)
importFrom(checkmate,assert_scalar)
importFrom(checkmate,assert_set_equal)
importFrom(checkmate,assert_string)
importFrom(checkmate,assert_subset)
importFrom(checkmate,assert_tibble)
importFrom(checkmate,assert_true)
importFrom(checkmate,checkInt)
importFrom(checkmate,check_atomic)
importFrom(checkmate,check_character)
importFrom(checkmate,check_data_frame)
importFrom(checkmate,check_logical)
importFrom(checkmate,check_names)
importFrom(checkmate,check_null)
importFrom(checkmate,check_numeric)
importFrom(checkmate,expect_class)
importFrom(checkmate,test_int)
importFrom(checkmate,test_set_equal)
Expand All @@ -143,6 +153,7 @@ importFrom(data.table,address)
importFrom(data.table,as.data.table)
importFrom(data.table,between)
importFrom(data.table,copy)
importFrom(data.table,fifelse)
importFrom(data.table,frollapply)
importFrom(data.table,frollmean)
importFrom(data.table,frollsum)
Expand All @@ -151,6 +162,8 @@ importFrom(data.table,key)
importFrom(data.table,rbindlist)
importFrom(data.table,set)
importFrom(data.table,setDF)
importFrom(data.table,setDT)
importFrom(data.table,setcolorder)
importFrom(data.table,setkeyv)
importFrom(dplyr,"%>%")
importFrom(dplyr,across)
Expand All @@ -173,8 +186,8 @@ importFrom(dplyr,if_all)
importFrom(dplyr,if_any)
importFrom(dplyr,if_else)
importFrom(dplyr,is_grouped_df)
importFrom(dplyr,lag)
importFrom(dplyr,mutate)
importFrom(dplyr,n_groups)
importFrom(dplyr,pick)
importFrom(dplyr,pull)
importFrom(dplyr,relocate)
Expand All @@ -200,6 +213,7 @@ importFrom(rlang,"%||%")
importFrom(rlang,.data)
importFrom(rlang,.env)
importFrom(rlang,arg_match)
importFrom(rlang,arg_match0)
importFrom(rlang,caller_arg)
importFrom(rlang,caller_env)
importFrom(rlang,check_dots_empty)
Expand All @@ -212,6 +226,7 @@ importFrom(rlang,expr_label)
importFrom(rlang,f_env)
importFrom(rlang,f_rhs)
importFrom(rlang,is_bare_integerish)
importFrom(rlang,is_bare_list)
importFrom(rlang,is_bare_numeric)
importFrom(rlang,is_environment)
importFrom(rlang,is_formula)
Expand All @@ -235,10 +250,12 @@ importFrom(slider,slide_sum)
importFrom(stats,cor)
importFrom(stats,median)
importFrom(tibble,as_tibble)
importFrom(tibble,is_tibble)
importFrom(tibble,new_tibble)
importFrom(tibble,validate_tibble)
importFrom(tidyr,complete)
importFrom(tidyr,full_seq)
importFrom(tidyr,nest)
importFrom(tidyr,unnest)
importFrom(tidyselect,any_of)
importFrom(tidyselect,eval_select)
Expand All @@ -248,15 +265,24 @@ importFrom(tsibble,as_tsibble)
importFrom(utils,capture.output)
importFrom(utils,tail)
importFrom(vctrs,"vec_slice<-")
importFrom(vctrs,obj_is_vector)
importFrom(vctrs,vec_cast)
importFrom(vctrs,vec_cast_common)
importFrom(vctrs,vec_data)
importFrom(vctrs,vec_duplicate_any)
importFrom(vctrs,vec_duplicate_id)
importFrom(vctrs,vec_equal)
importFrom(vctrs,vec_in)
importFrom(vctrs,vec_match)
importFrom(vctrs,vec_order)
importFrom(vctrs,vec_ptype)
importFrom(vctrs,vec_rbind)
importFrom(vctrs,vec_recycle)
importFrom(vctrs,vec_recycle_common)
importFrom(vctrs,vec_rep)
importFrom(vctrs,vec_rep_each)
importFrom(vctrs,vec_seq_along)
importFrom(vctrs,vec_size)
importFrom(vctrs,vec_size_common)
importFrom(vctrs,vec_slice)
importFrom(vctrs,vec_sort)
14 changes: 14 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,20 @@

Pre-1.0.0 numbering scheme: 0.x will indicate releases, while 0.x.y will indicate PR's.

# epiprocess 0.12

## Breaking changes

- `new_epi_archive()`'s `x` argument has been replaced with a `data_table`
argument, which must be a `data.table` with the key already set appropriately.
The `key()` of its `DT` will also now place `other_keys` before rather than after
`"time_value"`.

## Bug fixes

- `as_epi_archive()` no longer has issues setting its `DT`'s `key` on some
versions of `{data.table}` when `x` is a tibble.

# epiprocess 0.11

## Breaking changes
Expand Down
150 changes: 73 additions & 77 deletions R/archive.R
Original file line number Diff line number Diff line change
Expand Up @@ -186,8 +186,8 @@ next_after.Date <- function(x) x + 1L
#' archive. Unexpected behavior may result from modifying the metadata
#' directly.
#'
#' @param x A data.frame, data.table, or tibble, with columns `geo_value`,
#' `time_value`, `version`, and then any additional number of columns.
#' @param data_table a data.table with [`data.table::key()`] equal to
#' `c("geo_value", other_keys, "time_value", "version")`.
#' @param geo_type DEPRECATED Has no effect. Geo value type is inferred from the
#' location column and set to "custom" if not recognized.
#' @param time_type DEPRECATED Has no effect. Time value type inferred from the time
Expand Down Expand Up @@ -278,41 +278,22 @@ next_after.Date <- function(x) x + 1L
#' x <- df %>% as_epi_archive(other_keys = "county")
#'
new_epi_archive <- function(
x,
data_table,
geo_type,
time_type,
other_keys,
clobberable_versions_start,
versions_end) {
assert_data_frame(x)
assert_class(data_table, "data.table")
assert_string(geo_type)
assert_string(time_type)
assert_character(other_keys, any.missing = FALSE)
if (any(c("geo_value", "time_value", "version") %in% other_keys)) {
cli_abort("`other_keys` cannot contain \"geo_value\", \"time_value\", or \"version\".")
}
validate_version_bound(clobberable_versions_start, x, na_ok = TRUE)
validate_version_bound(versions_end, x, na_ok = FALSE)

key_vars <- c("geo_value", "time_value", other_keys, "version")
if (!all(key_vars %in% names(x))) {
# Give a more tailored error message than as.data.table would:
cli_abort(c(
"`x` is missing the following expected columns:
{format_varnames(setdiff(key_vars, names(x)))}.",
">" = "You might need to `dplyr::rename()` beforehand
or use `as_epi_archive()`'s renaming feature.",
">" = if (!all(other_keys %in% names(x))) {
"Check also for typos in `other_keys`."
}
))
}

# Create the data table; if x was an un-keyed data.table itself,
# then the call to as.data.table() will fail to set keys, so we
# need to check this, then do it manually if needed
data_table <- as.data.table(x, key = key_vars)
if (!identical(key_vars, key(data_table))) setkeyv(data_table, cols = key_vars)
assert_true(identical(key(data_table), c("geo_value", other_keys, "time_value", "version")))
validate_version_bound(clobberable_versions_start, data_table, na_ok = TRUE)
validate_version_bound(versions_end, data_table, na_ok = FALSE)

structure(
list(
Expand All @@ -334,7 +315,7 @@ new_epi_archive <- function(
validate_epi_archive <- function(x) {
assert_class(x, "epi_archive")

ukey_vars1 <- c("geo_value", "time_value", x$other_keys, "version")
ukey_vars1 <- c("geo_value", x$other_keys, "time_value", "version")
ukey_vars2 <- key(x$DT)
if (!identical(ukey_vars1, ukey_vars2)) {
cli_abort(c("`data.table::key(x$DT)` not as expected",
Expand Down Expand Up @@ -401,7 +382,7 @@ validate_epi_archive <- function(x) {
#' would be `key(DT)`.
#' @param abs_tol numeric, >=0; absolute tolerance to use on numeric measurement
#' columns when determining whether something can be compactified away; see
#' [`is_locf`]
#' [`vec_approx_equal`]
#'
#' @importFrom data.table is.data.table key
#' @importFrom dplyr arrange filter
Expand All @@ -420,10 +401,23 @@ apply_compactify <- function(updates_df, ukey_names, abs_tol = 0) {
}
assert_numeric(abs_tol, len = 1, lower = 0)

if (!is.data.table(updates_df) || !identical(key(updates_df), ukey_names)) {
if (is.data.table(updates_df)) {
if (!identical(key(updates_df), ukey_names)) {
cli_abort(c("`ukey_names` should match `key(updates_df)`",
"i" = "`ukey_names` was {format_chr_deparse(ukey_names)}",
"i" = "`key(updates_df)` was {format_chr_deparse(key(updates_df))}"
))
}
} else {
updates_df <- updates_df %>% arrange(pick(all_of(ukey_names)))
}
updates_df[!update_is_locf(updates_df, ukey_names, abs_tol), ]

# In case updates_df is a data.table, store keep flags in a local: "When the
# first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table
# looks for var in calling scope". In case it's not a data.table, make sure to
# use df[i,] not just df[i].
to_keep <- !update_is_locf(updates_df, ukey_names, abs_tol)
updates_df[to_keep, ]
}

#' get the entries that `compactify` would remove
Expand Down Expand Up @@ -460,56 +454,38 @@ update_is_locf <- function(arranged_updates_df, ukey_names, abs_tol) {
ekt_names <- ukey_names[ukey_names != "version"]
val_names <- all_names[!all_names %in% ukey_names]

Reduce(`&`, lapply(updates_col_refs[ekt_names], is_locf, abs_tol, TRUE)) &
Reduce(`&`, lapply(updates_col_refs[val_names], is_locf, abs_tol, FALSE))
}

#' Checks to see if a value in a vector is LOCF
#' @description LOCF meaning last observation carried forward (to later
#' versions). Lags the vector by 1, then compares with itself. If `is_key` is
#' `TRUE`, only values that are exactly the same between the lagged and
#' original are considered LOCF. If `is_key` is `FALSE` and `vec` is a vector
#' of numbers ([`base::is.numeric`]), then approximate equality will be used,
#' checking whether the absolute difference between each pair of entries is
#' `<= abs_tol`; if `vec` is something else, then exact equality is used
#' instead.
#'
#' @details
#'
#' We include epikey-time columns in LOCF comparisons as part of an optimization
#' to avoid slower grouped operations while still ensuring that the first
#' observation for each time series will not be marked as LOCF. We test these
#' key columns for exact equality to prevent chopping off consecutive
#' time_values during flat periods when `abs_tol` is high.
#'
#' We use exact equality for non-`is.numeric` double/integer columns such as
#' dates, datetimes, difftimes, `tsibble::yearmonth`s, etc., as these may be
#' used as part of re-indexing or grouping procedures, and we don't want to
#' change the number of groups for those operations when we remove LOCF data
#' during compactification.
#'
#' @importFrom dplyr lag if_else
#' @importFrom rlang is_bare_numeric
#' @importFrom vctrs vec_equal
#' @keywords internal
is_locf <- function(vec, abs_tol, is_key) { # nolint: object_usage_linter
lag_vec <- lag(vec)
if (is.vector(vec, mode = "numeric") && !is_key) {
# (integer or double vector, no class (& no dims); maybe names, which we'll
# ignore like `vec_equal`); not a key column
unname(if_else(
!is.na(vec) & !is.na(lag_vec),
abs(vec - lag_vec) <= abs_tol,
is.na(vec) & is.na(lag_vec)
))
n_updates <- nrow(arranged_updates_df)
if (n_updates == 0L) {
logical(0L)
} else if (n_updates == 1L) {
FALSE # sole observation is not LOCF
} else {
vec_equal(vec, lag_vec, na_equal = TRUE)
ekts_tbl <- new_tibble(updates_col_refs[ekt_names])
vals_tbl <- new_tibble(updates_col_refs[val_names])
# n_updates >= 2L so we can use `:` naturally (this is the reason for
# separating out n_updates == 1L from this case):
inds1 <- 2L:n_updates
inds2 <- 1L:(n_updates - 1L)
c(
FALSE, # first observation is not LOCF
vec_approx_equal0(ekts_tbl,
inds1 = inds1, ekts_tbl, inds2 = inds2,
# check ekt (key) cols with 0 tolerance:
na_equal = TRUE, abs_tol = 0
) &
vec_approx_equal0(vals_tbl,
inds1 = inds1, vals_tbl, inds2 = inds2,
na_equal = TRUE, abs_tol = abs_tol
)
)
}
}

#' `as_epi_archive` converts a data frame, data table, or tibble into an
#' `epi_archive` object.
#'
#' @param x A data.frame, data.table, or tibble, with columns `geo_value`,
#' `time_value`, `version`, and then any additional number of columns.
#' @param ... used for specifying column names, as in [`dplyr::rename`]. For
#' example `version = release_date`
#' @param .versions_end location based versions_end, used to avoid prefix
Expand All @@ -530,11 +506,32 @@ as_epi_archive <- function(
.versions_end = max_version_with_row_in(x), ...,
versions_end = .versions_end) {
assert_data_frame(x)
# Convert first to data.frame to guard against data.table#6859 and potentially
# other things epiprocess#618:
x_already_copied <- identical(class(x), c("data.table", "data.frame"))
x <- as.data.frame(x)
x <- rename(x, ...)
x <- guess_column_name(x, "time_value", time_column_names())
x <- guess_column_name(x, "geo_value", geo_column_names())
if (!all(other_keys %in% names(x))) {
# Give a more tailored error message than as.data.table would:
cli_abort(c(
"`x` is missing the following expected columns:
{format_varnames(setdiff(other_keys, names(x)))}.",
">" = "You might need to `dplyr::rename()` beforehand
or using `as_epi_archive()`'s renaming feature."
))
}
x <- guess_column_name(x, "time_value", time_column_names())
x <- guess_column_name(x, "version", version_column_names())

# Convert to data.table:
key_vars <- c("geo_value", other_keys, "time_value", "version")
if (x_already_copied) {
setDT(x, key = key_vars)
} else {
x <- as.data.table(x, key = key_vars)
}

if (lifecycle::is_present(geo_type)) {
cli_warn("epi_archive constructor argument `geo_type` is now ignored. Consider removing.")
}
Expand All @@ -555,11 +552,10 @@ as_epi_archive <- function(
cli_abort('`compactify` must be `TRUE`, `FALSE`, or `"message"`')
}

data_table <- result$DT
key_vars <- key(data_table)
data_table <- result$DT # probably just `x`, but take no chances

nrow_before_compactify <- nrow(data_table)
# Runs compactify on data frame
# Runs compactify on data_table
if (identical(compactify, TRUE) || identical(compactify, "message")) {
compactified <- apply_compactify(data_table, key_vars, compactify_abs_tol)
} else {
Expand Down
Loading
Loading