Skip to content

Allow epi_slide to access ref_time_value #318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 16, 2023

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented May 19, 2023

This implements Ryan's suggestion for how to calculate ref_time_value, and makes sure epi_slide computations of all types can access ref_time_value and the group key using the same approaches as in #313 and #317.

I plan to remove the n_mandatory_f_args arg to assert_sufficient_f_args in another PR.

Closes #171.

@nmdefries nmdefries force-pushed the ndefries/epi-slide-rtv branch from 5d14ddd to 8bfb4d8 Compare May 23, 2023 21:01
@nmdefries
Copy link
Contributor Author

nmdefries commented May 23, 2023

There was some concern over the efficiency of finding time_values and joining them onto the data, then sorting the data.

In the old implementation, we were already doing a direct sort. I've just moved that step to be a littler bit later in the new version. So that step hasn't changed computational complexity of the function.

In the new code here, we do need to identify dates not already in the epi_df. I used a standard %in% to identify these since we are doing the same kind of search already. Ditto about complexity.

The main addition, that is, something that could be slow that we're not already doing elsewhere in the epi_slide function, are the handful of bind_rows, which will copy some data around. I haven't compared this approach to doing a merge/join.

@nmdefries
Copy link
Contributor Author

nmdefries commented May 23, 2023

Doing some benchmarking, the slow steps of the new approach are dropping the .real column and filtering by the .real column. Improved (~10x faster) in 5564c0d.

@nmdefries nmdefries marked this pull request as ready for review May 24, 2023 16:07
@nmdefries
Copy link
Contributor Author

nmdefries commented May 24, 2023

When testing that computations can access ref_time_value, I noticed that slide returns date columns in numeric format (18672, e.g.). It is converted during unlist. Not sure it's worth changing. We don't really expect users to return dates. They can also convert back to date format themselves.

Opened issue #321

@nmdefries
Copy link
Contributor Author

nmdefries commented May 30, 2023

The input data

set.seed(100)
n <- 5000L
x <- tibble::tribble(~geo_value, ~time_value, ~binary,
                     "x",       ceiling(runif(n, 0, 200)),   runif(n, 0, 100)^2) %>%
  tidyr::unnest(c(time_value,binary))

# Randomize order
x <- x[sample(nrow(x)), ] %>% as_epi_df()

Looking at a computation that doesn't require access to ref_time_value so that we can compare between the new and old versions (the new logic calculates ref_time_value even if it isn't used):

bench::mark(
  ref_value_calculated = {epi_slide(x, f = function(x, g, t) sum(x$binary),
                                    before = 2,
                                    new_col_name = "sum_binary")},
  old = {old_epi_slide(x, f = function(x, g, t) sum(x$binary),
                       before = 2,
                       new_col_name = "sum_binary")},
  iterations = 10)
# A tibble: 2 × 13
#   expression                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 ref_value_calculated   40.8ms   41.9ms      22.8    3.46MB     9.78     7     3
# 2 old                    20.5ms   21.1ms      46.6    2.46MB     5.17     9     1

For this computation, the new code is ~2x slower. With larger data sizes, the difference in runtime gets smaller (new version ~30% slower with 50k rows).

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, and I like the more bite-size test data with more extensive tests! Major things that need action I think are:

@nmdefries
Copy link
Contributor Author

I won't be able to finish the last few issues here before my PTO. Will pick those up when I get back.

@nmdefries
Copy link
Contributor Author

Need to update the documentation for the f arg to epi_slide and add epi_slide to the NEWS blurb about ref_time_value being available.

and add some unrelated comments regarding time value counting details.

Since we're not potentially dealing with R6 objects, we can just us
`as_data_mask(.x)`. We can also flatten the data mask by installing the `.x`,
`.group_key` and `.ref_time_value` in the same way as that pronouns are
installed (except keeping their original classes, not converting to pronouns),
rather than using another level in the environment chain for the data mask.
@brookslogan brookslogan self-requested a review June 16, 2023 19:53
@brookslogan
Copy link
Contributor

brookslogan commented Jun 16, 2023

An interesting performance experiment... I thought we might speed up things by changing x[x$.real,] to x[x[[".real"]],] as I thought it's supposed to be faster based on some R6 benchmarks that include comparisons of these things, but it's actually slower and doubles the amount of garbage collections!

bench::mark(
  grped %>% epi_slide(cases_7dav = mean(cases), before=6),
  grped %>% curr_epi_slide(cases_7dav = mean(cases), before=6)
)
#> # A tibble: 2 × 13
#>   expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 grped %>% ep… 3.13s  3.13s     0.319      26MB     6.38     1    20      3.13s
#> 2 grped %>% cu… 2.96s  2.96s     0.338      26MB     3.04     1     9      2.96s
#> # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
#> Warning message:
#> Some expressions had a GC in every iteration; so filtering is disabled. 

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clearing up some of my confusion. I've applied some minor updates and am going to merge. Docs and NEWS might still need a little work; I'll try to tackle that in after the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants