Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specialized epix_slide for epi_slide_opt #611

Open
wants to merge 47 commits into
base: dev
Choose a base branch
from

Conversation

brookslogan
Copy link
Contributor

@brookslogan brookslogan commented Feb 26, 2025

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 26, 2025

@brookslogan brookslogan marked this pull request as ready for review March 10, 2025 20:00
Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various nits and some questions. The logic looks fine to me, but given the complexity of this change, someone else should review as well.

Comment on lines +45 to +46
"problem with {rlang::expr_label(rlang::caller_arg(f))}",
"i" = "`f` must be one of `data.table`'s rolling functions (`frollmean`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"problem with {rlang::expr_label(rlang::caller_arg(f))}",
"i" = "`f` must be one of `data.table`'s rolling functions (`frollmean`,
"problem with {rlang::expr_label(rlang::caller_arg(.f))}",
"i" = "`.f` must be one of `data.table`'s rolling functions (`frollmean`,

#' @keywords internal
across_ish_names_info <- function(.x, time_type, col_names_quo, .f_namer,
.window_size, .align, .prefix, .suffix, .new_col_names) {
# The position of a given column can be differ between input `.x` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
# The position of a given column can be differ between input `.x` and
# The position of a given column can differ between input `.x` and

#'
#' * On `epi_df`s, it will take care of looping over `geo_value`s, temporarily
#' filling in time gaps with `NA`s and other work needed to ensure there are
#' exactly n consecutive time steps per computation, and has some other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
#' exactly n consecutive time steps per computation, and has some other
#' exactly `n` consecutive time steps per computation, and has some other

#' @description
#'
#' `epi_slide_opt` calculates n-time-step rolling means&sums,
#' cumulative/"running" means&sums, or other operations supported by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
#' cumulative/"running" means&sums, or other operations supported by
#' cumulative/"running" means&sums, and other operations supported by

# Most input validation + handle NULL earlier_snapshot. This is a small function so
# use faster validation variants:
if (!is_tibble(later_tbl)) {
cli_abort("`later_tbl` must be a tibble")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: please add class names to errors.

group_updates <- group_values %>%
nest(.by = version, .key = "subtbl") %>%
arrange(version)
# TODO move nesting inside the helper?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: some TODOs and "XXX"s left. Please either implement, or convert to issues and link to from code.

Comment on lines +68 to +70
# in slide outputs in the range t1-after..t2+before, and to compute those
# slide values, we need to look at the input snapshot from
# t1-after-before..t2+before+after. nolint: commented_code_linter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I understand how we arrived at t1-after..t2+before, but where did the t1-after-before.... come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose before = 6, after = 0 and there's an update to the 1d value at time t alone. We need to compute the trailing 7dav for t..t+6. But in order to compute the trailing 7dav for time t, we need 1d data from t-6..t. So we end up needing the union of the ranges t-6..t, t-5..t+1, t-4..t+2, ..., t..t+6, which gives us t-6..t+6.

Comment on lines +84 to +105
if (f_from_package == "data.table") {
for (col_i in seq_along(in_colnames)) {
if (before == Inf) {
slide[[out_colnames[[col_i]]]] <-
f_dots_baked(slide[[in_colnames[[col_i]]]], seq_len(slide_nrow), adaptive = TRUE)
} else {
out_col <- f_dots_baked(slide[[in_colnames[[col_i]]]], before + after + 1L)
if (after != 0L) {
# data.table always puts NAs at tails, even with na.rm = TRUE; chop
# off extra NAs from beginning and place missing NAs at end:
out_col <- c(out_col[seq(after + 1L, slide_nrow)], rep(NA, after))
}
slide[[out_colnames[[col_i]]]] <- out_col
}
}
} else if (f_from_package == "slider") {
for (col_i in seq_along(in_colnames)) {
slide[[out_colnames[[col_i]]]] <- f_dots_baked(slide[[in_colnames[[col_i]]]], before = before, after = after)
}
} else {
cli_abort("epiprocess internal error: `f_from_package` was {format_chr_deparse(f_from_package)}")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: data.table functions can process more than one column at a time and I assume it's faster. So we don't need the column loop into the data.table branch here.

Suggested change
if (f_from_package == "data.table") {
for (col_i in seq_along(in_colnames)) {
if (before == Inf) {
slide[[out_colnames[[col_i]]]] <-
f_dots_baked(slide[[in_colnames[[col_i]]]], seq_len(slide_nrow), adaptive = TRUE)
} else {
out_col <- f_dots_baked(slide[[in_colnames[[col_i]]]], before + after + 1L)
if (after != 0L) {
# data.table always puts NAs at tails, even with na.rm = TRUE; chop
# off extra NAs from beginning and place missing NAs at end:
out_col <- c(out_col[seq(after + 1L, slide_nrow)], rep(NA, after))
}
slide[[out_colnames[[col_i]]]] <- out_col
}
}
} else if (f_from_package == "slider") {
for (col_i in seq_along(in_colnames)) {
slide[[out_colnames[[col_i]]]] <- f_dots_baked(slide[[in_colnames[[col_i]]]], before = before, after = after)
}
} else {
cli_abort("epiprocess internal error: `f_from_package` was {format_chr_deparse(f_from_package)}")
}
if (f_from_package == "data.table") {
if (before == Inf) {
slide[, out_colnames] <-
f_dots_baked(slide[, in_colnames], seq_len(slide_nrow), adaptive = TRUE)
} else {
out_cols <- f_dots_baked(slide[, in_colnames], before + after + 1L)
if (after != 0L) {
# data.table always puts NAs at tails, even with na.rm = TRUE; chop
# off extra NAs from beginning and place missing NAs at end:
out_cols <- purrr::map(out_cols, function(.x) {
c(.x[seq(after + 1L, slide_nrow)], rep(NA, after))
})
}
slide[, out_colnames] <- out_cols
}
} else if (f_from_package == "slider") {
for (col_i in seq_along(in_colnames)) {
slide[[out_colnames[[col_i]]]] <- f_dots_baked(slide[[in_colnames[[col_i]]]], before = before, after = after)
}
} else {
cli_abort("epiprocess internal error: `f_from_package` was {format_chr_deparse(f_from_package)}")
}

if (after != 0L) {
# data.table always puts NAs at tails, even with na.rm = TRUE; chop
# off extra NAs from beginning and place missing NAs at end:
out_col <- c(out_col[seq(after + 1L, slide_nrow)], rep(NA, after))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This doesn't respect data.table's fill param. If a computation uses fill, these values should not be NA. I think what we really want to do is move the first after values to the end of the result.

epi_slide_opt.epi_df has the same behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I think I was worried about some garbage-rather-than-NA being in the tails which is why I didn't just copy that approach, but on second look that worry seems unfounded.

prev_inp_snapshot <<- inp_snapshot
prev_out_snapshot <<- out_snapshot
out_diff$version <- version
out_diff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: this always returns a compactified archive. Will users ever want a non-compactified result? If we add a compactify arg or inherit compactification status from the original archive, the user could toggle whether we return out_snapshot here instead of out_diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps; some thoughts:

  • Recording all full out_snapshots seems like it will potentially give a very hefty result, and they could get that same result by something like epix_slide(~ .x) on the result or a helper function doing something like that.
  • Recording just out_snapshot restricted to times that were in inp_update plus any times that are required for validity (e.g., an change in just the time t 1d value producing an update to the trailing 7dav for t+1..t+6 as well as t) seems maybe more useful. I'm not sure how hard/slow this would be; I had some issues trying to selectively apply compactify_tol to the 7dav/etc. cols and set aside that idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement smart sparse archive -> archive slide
3 participants