Skip to content

Make epi[x]_slide named data-masking expressions output tibble column bundles #293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Mar 29, 2023 · 2 comments
Labels
op-semantics Operational semantics; many potentially breaking changes here P2 low priority

Comments

@brookslogan
Copy link
Contributor

Our named data masking expressions don't give the same behavior as dplyr::mutate&co when the named expression is a tibble; epi[x]_slide will make separate name-prefixed columns by default, while mutate will create a tibble-type column (column bundle):

library(dplyr, warn.conflicts=FALSE)
library(epiprocess, warn.conflicts=FALSE)
invisible(withr::local_rng_version("3.5.0"))
invisible(withr::local_seed(295595251L))

edf = new_epi_df(tibble(geo_value="geo1",
                        time_value = as.Date("2020-01-01") + 0:19,
                        x1 = runif(20L),
                        x2 = 0.2*runif(20L),
                        y = 2*x1 + 3*x2 + 10 + rnorm(20L)))

edf %>%
  epi_slide(before = 100L,
            terms =
              predict(lm(y ~ x1 + x2,
                         # not doing a real train-test split
                         tibble(x1, x2, y)),
                      tibble(x1, x2, y) %>% tail(n=1L),
                      type="terms") %>%
              as_tibble() %>%
              mutate(constant = attr(., "constant"))
            )
#> Warning in predict.lm(lm(y ~ x1 + x2, tibble(x1, x2, y)), tibble(x1, x2, :
#> prediction from a rank-deficient fit may be misleading

#> Warning in predict.lm(lm(y ~ x1 + x2, tibble(x1, x2, y)), tibble(x1, x2, :
#> prediction from a rank-deficient fit may be misleading
#> An `epi_df` object, 20 x 7 with metadata:
#> * geo_type  = custom
#> * time_type = day
#> * as_of     = 2023-03-28 17:38:41
#> 
#> # A tibble: 20 × 7
#>    geo_value time_value      x1     x2     y terms_x1 terms_x2
#>  * <chr>     <date>       <dbl>  <dbl> <dbl>    <dbl>    <dbl>
#>  1 geo1      2020-01-01 0.290   0.0803 10.1    0       0      
#>  2 geo1      2020-01-02 0.276   0.0372 10.5    0.183   0      
#>  3 geo1      2020-01-03 0.00323 0.127  11.1    0.889  -0.315  
#>  4 geo1      2020-01-04 0.236   0.0382 11.9   -0.279   0.752  
#>  5 geo1      2020-01-05 0.786   0.0676 12.5    0.939  -0.00634
#>  6 geo1      2020-01-06 0.518   0.180  11.0    0.270  -0.261  
#>  7 geo1      2020-01-07 0.384   0.0372 11.4    0.0457  0.132  
#>  8 geo1      2020-01-08 0.144   0.113  10.6   -0.315  -0.0889 
#>  9 geo1      2020-01-09 0.640   0.0746 12.5    0.589   0.0317 
#> 10 geo1      2020-01-10 0.0360  0.191  12.6   -0.406   0.347  
#> 11 geo1      2020-01-11 0.0532  0.118   9.80  -0.474   0.0719 
#> 12 geo1      2020-01-12 0.132   0.108   9.66  -0.350   0.0347 
#> 13 geo1      2020-01-13 0.950   0.0669 13.2    1.58   -0.104  
#> 14 geo1      2020-01-14 0.862   0.0714 13.3    1.39   -0.0845 
#> 15 geo1      2020-01-15 0.00865 0.190   9.59  -1.03    0.134  
#> 16 geo1      2020-01-16 0.527   0.140  11.3    0.460   0.0229 
#> 17 geo1      2020-01-17 0.784   0.0246 10.9    1.03   -0.174  
#> 18 geo1      2020-01-18 0.621   0.194   9.45   0.413  -0.318  
#> 19 geo1      2020-01-19 0.356   0.189   8.80  -0.0775 -0.517  
#> 20 geo1      2020-01-20 0.661   0.0807  9.87   0.376   0.156

edf %>%
  mutate(terms =
           predict(lm(y ~ x1 + x2,
                      tibble(x1, x2, y)),
                   # everything in-sample
                   tibble(x1, x2, y),
                   type="terms") %>%
           as_tibble() %>%
           mutate(constant = attr(., "constant"))
         )
#> An `epi_df` object, 20 x 6 with metadata:
#> * geo_type  = custom
#> * time_type = day
#> * as_of     = 2023-03-28 17:38:41
#> 
#> # A tibble: 20 × 6
#>    geo_value time_value      x1     x2     y terms$x1      $x2
#>  * <chr>     <date>       <dbl>  <dbl> <dbl>    <dbl>    <dbl>
#>  1 geo1      2020-01-01 0.290   0.0803 10.1   -0.187   0.158  
#>  2 geo1      2020-01-02 0.276   0.0372 10.5   -0.208   0.420  
#>  3 geo1      2020-01-03 0.00323 0.127  11.1   -0.622  -0.123  
#>  4 geo1      2020-01-04 0.236   0.0382 11.9   -0.269   0.414  
#>  5 geo1      2020-01-05 0.786   0.0676 12.5    0.565   0.236  
#>  6 geo1      2020-01-06 0.518   0.180  11.0    0.159  -0.448  
#>  7 geo1      2020-01-07 0.384   0.0372 11.4   -0.0440  0.421  
#>  8 geo1      2020-01-08 0.144   0.113  10.6   -0.408  -0.0379 
#>  9 geo1      2020-01-09 0.640   0.0746 12.5    0.343   0.193  
#> 10 geo1      2020-01-10 0.0360  0.191  12.6   -0.572  -0.514  
#> 11 geo1      2020-01-11 0.0532  0.118   9.80  -0.546  -0.0691 
#> 12 geo1      2020-01-12 0.132   0.108   9.66  -0.427  -0.00755
#> 13 geo1      2020-01-13 0.950   0.0669 13.2    0.813   0.240  
#> 14 geo1      2020-01-14 0.862   0.0714 13.3    0.680   0.213  
#> 15 geo1      2020-01-15 0.00865 0.190   9.59  -0.613  -0.507  
#> 16 geo1      2020-01-16 0.527   0.140  11.3    0.172  -0.205  
#> 17 geo1      2020-01-17 0.784   0.0246 10.9    0.561   0.497  
#> 18 geo1      2020-01-18 0.621   0.194   9.45   0.314  -0.535  
#> 19 geo1      2020-01-19 0.356   0.189   8.80  -0.0868 -0.502  
#> 20 geo1      2020-01-20 0.661   0.0807  9.87   0.376   0.156

Created on 2023-03-28 with reprex v2.0.2

We should probably try to match dplyr here.

See also #255 regarding unnamed data masking expressions yielding tibbles, which we don't allow and which dplyr turns into separate columns.

We may not be in this situation very often, since we don't have cur_data() etc. implemented, so we're going to reach for the function/formula form in these situations first. So marking this low priority.

@brookslogan brookslogan added P2 low priority op-semantics Operational semantics; many potentially breaking changes here labels Mar 29, 2023
@brookslogan brookslogan changed the title Make epi[x]_slide named data-masking expressions output tibble column groups Make epi[x]_slide named data-masking expressions output tibble column bundles Mar 29, 2023
@brookslogan
Copy link
Contributor Author

See also #261.

@brookslogan
Copy link
Contributor Author

Completed in 9.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
op-semantics Operational semantics; many potentially breaking changes here P2 low priority
Projects
None yet
Development

No branches or pull requests

1 participant