Skip to content

Step through pre-calculated start times for each group using closure rather than using .real col in epi_slide #397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 19, 2024

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Jan 17, 2024

Instead of re-calculating .ref_time_values, use pre-calculated values stored in starts. For each group, use a closure to keep track of position in starts vector. This avoids the slow .real filtering and removal steps, and simplifies the code by removing various bits of .real handling.

This is ~4x faster than the old version.

@nmdefries nmdefries mentioned this pull request Jan 17, 2024
@nmdefries
Copy link
Contributor Author

nmdefries commented Jan 17, 2024

The test failures appear to be a problem with the new withr (3.0.0, released yesterday); tests pass locally.

@nmdefries nmdefries marked this pull request as ready for review January 17, 2024 21:08
@nmdefries nmdefries force-pushed the ndefries/f-wrapper-speedup-factory branch from 7ac9c6e to aae9dee Compare January 18, 2024 16:48
@nmdefries
Copy link
Contributor Author

All the man/*.Rd files are from styling changes.

`slider::hop_index` doesn't require starts & stops to be in `.i`, and we aren't
actually doing that anyway.

Plus comment to help clarify that we're passing the group key to comps via `...`.
Rename `time_values` to `ref_time_values` or `kept_ref_time_values` depending on
the context.  Does not change the interface of `epi_slide`.
We checked them for nonzero length when we filtered `ref_time_values` down to
those present in the `x$time_value`, but now we require `all(ref_time_values
%in% unique(x$time_value))`.
Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please see minor TODOs above.

Nice 4x!!! Was this on something like 7-day averaging? [I'm getting something like a 2x doing jhu_csse_daily_subset %>% group_by(geo_value) %>% epi_slide(before = 6, cases_7davish = mean(cases)), 3x on an ungrouped version. Still super nice.]

@nmdefries
Copy link
Contributor Author

My (main) test case was indeed a 7-dav, calling the exploration-tooling rolling mean fn on a synthetic dataset,

n_days <- 4000
removed_date <- 10
simple_dates <- seq(as.Date("2012-01-01"), by = "day", length.out = n_days)
simple_dates <- simple_dates[-removed_date]
rand_vals <- rnorm(n_days - 1)

# Three states, with 2 variables. a is linear, going up in one state and down in the other
# b is just random
# note that day 10 is missing
epi_data <- epiprocess::as_epi_df(rbind(tibble(
  geo_value = "al",
  time_value = simple_dates,
  a = 1:(n_days - 1),
  b = rand_vals
), tibble(
  geo_value = "ca",
  time_value = simple_dates,
  a = (n_days - 1):1,
  b = rand_vals + 10
), tibble(
  geo_value = "fl",
  time_value = simple_dates,
  a = (n_days - 1):1,
  b = rand_vals * 2
)))

Surprisingly,

jhu_csse_daily_subset %>% group_by(geo_value) %>% epi_slide(before = 6, ~mean(.x$cases), new_col_name = "cases_7davish")

is twice as fast as

jhu_csse_daily_subset %>% group_by(geo_value) %>% epi_slide(before = 6, cases_7davish = mean(cases))

It looks like converting a quosure to fn is the bottleneck, with this line being particularly slow.

@nmdefries nmdefries changed the title Make epi_slide calculation of .ref_time_value faster Step through pre-calculated start times for each group using closure rather than using .real col in epi_slide Jan 19, 2024
@nmdefries nmdefries merged commit 44e4646 into dev Jan 19, 2024
@nmdefries nmdefries deleted the ndefries/f-wrapper-speedup-factory branch January 19, 2024 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants