Closed
Description
Our named data masking expressions don't give the same behavior as dplyr::mutate
&co when the named expression is a tibble; epi[x]_slide
will make separate name-prefixed columns by default, while mutate
will create a tibble-type column (column bundle):
library(dplyr, warn.conflicts=FALSE)
library(epiprocess, warn.conflicts=FALSE)
invisible(withr::local_rng_version("3.5.0"))
invisible(withr::local_seed(295595251L))
edf = new_epi_df(tibble(geo_value="geo1",
time_value = as.Date("2020-01-01") + 0:19,
x1 = runif(20L),
x2 = 0.2*runif(20L),
y = 2*x1 + 3*x2 + 10 + rnorm(20L)))
edf %>%
epi_slide(before = 100L,
terms =
predict(lm(y ~ x1 + x2,
# not doing a real train-test split
tibble(x1, x2, y)),
tibble(x1, x2, y) %>% tail(n=1L),
type="terms") %>%
as_tibble() %>%
mutate(constant = attr(., "constant"))
)
#> Warning in predict.lm(lm(y ~ x1 + x2, tibble(x1, x2, y)), tibble(x1, x2, :
#> prediction from a rank-deficient fit may be misleading
#> Warning in predict.lm(lm(y ~ x1 + x2, tibble(x1, x2, y)), tibble(x1, x2, :
#> prediction from a rank-deficient fit may be misleading
#> An `epi_df` object, 20 x 7 with metadata:
#> * geo_type = custom
#> * time_type = day
#> * as_of = 2023-03-28 17:38:41
#>
#> # A tibble: 20 × 7
#> geo_value time_value x1 x2 y terms_x1 terms_x2
#> * <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 geo1 2020-01-01 0.290 0.0803 10.1 0 0
#> 2 geo1 2020-01-02 0.276 0.0372 10.5 0.183 0
#> 3 geo1 2020-01-03 0.00323 0.127 11.1 0.889 -0.315
#> 4 geo1 2020-01-04 0.236 0.0382 11.9 -0.279 0.752
#> 5 geo1 2020-01-05 0.786 0.0676 12.5 0.939 -0.00634
#> 6 geo1 2020-01-06 0.518 0.180 11.0 0.270 -0.261
#> 7 geo1 2020-01-07 0.384 0.0372 11.4 0.0457 0.132
#> 8 geo1 2020-01-08 0.144 0.113 10.6 -0.315 -0.0889
#> 9 geo1 2020-01-09 0.640 0.0746 12.5 0.589 0.0317
#> 10 geo1 2020-01-10 0.0360 0.191 12.6 -0.406 0.347
#> 11 geo1 2020-01-11 0.0532 0.118 9.80 -0.474 0.0719
#> 12 geo1 2020-01-12 0.132 0.108 9.66 -0.350 0.0347
#> 13 geo1 2020-01-13 0.950 0.0669 13.2 1.58 -0.104
#> 14 geo1 2020-01-14 0.862 0.0714 13.3 1.39 -0.0845
#> 15 geo1 2020-01-15 0.00865 0.190 9.59 -1.03 0.134
#> 16 geo1 2020-01-16 0.527 0.140 11.3 0.460 0.0229
#> 17 geo1 2020-01-17 0.784 0.0246 10.9 1.03 -0.174
#> 18 geo1 2020-01-18 0.621 0.194 9.45 0.413 -0.318
#> 19 geo1 2020-01-19 0.356 0.189 8.80 -0.0775 -0.517
#> 20 geo1 2020-01-20 0.661 0.0807 9.87 0.376 0.156
edf %>%
mutate(terms =
predict(lm(y ~ x1 + x2,
tibble(x1, x2, y)),
# everything in-sample
tibble(x1, x2, y),
type="terms") %>%
as_tibble() %>%
mutate(constant = attr(., "constant"))
)
#> An `epi_df` object, 20 x 6 with metadata:
#> * geo_type = custom
#> * time_type = day
#> * as_of = 2023-03-28 17:38:41
#>
#> # A tibble: 20 × 6
#> geo_value time_value x1 x2 y terms$x1 $x2
#> * <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 geo1 2020-01-01 0.290 0.0803 10.1 -0.187 0.158
#> 2 geo1 2020-01-02 0.276 0.0372 10.5 -0.208 0.420
#> 3 geo1 2020-01-03 0.00323 0.127 11.1 -0.622 -0.123
#> 4 geo1 2020-01-04 0.236 0.0382 11.9 -0.269 0.414
#> 5 geo1 2020-01-05 0.786 0.0676 12.5 0.565 0.236
#> 6 geo1 2020-01-06 0.518 0.180 11.0 0.159 -0.448
#> 7 geo1 2020-01-07 0.384 0.0372 11.4 -0.0440 0.421
#> 8 geo1 2020-01-08 0.144 0.113 10.6 -0.408 -0.0379
#> 9 geo1 2020-01-09 0.640 0.0746 12.5 0.343 0.193
#> 10 geo1 2020-01-10 0.0360 0.191 12.6 -0.572 -0.514
#> 11 geo1 2020-01-11 0.0532 0.118 9.80 -0.546 -0.0691
#> 12 geo1 2020-01-12 0.132 0.108 9.66 -0.427 -0.00755
#> 13 geo1 2020-01-13 0.950 0.0669 13.2 0.813 0.240
#> 14 geo1 2020-01-14 0.862 0.0714 13.3 0.680 0.213
#> 15 geo1 2020-01-15 0.00865 0.190 9.59 -0.613 -0.507
#> 16 geo1 2020-01-16 0.527 0.140 11.3 0.172 -0.205
#> 17 geo1 2020-01-17 0.784 0.0246 10.9 0.561 0.497
#> 18 geo1 2020-01-18 0.621 0.194 9.45 0.314 -0.535
#> 19 geo1 2020-01-19 0.356 0.189 8.80 -0.0868 -0.502
#> 20 geo1 2020-01-20 0.661 0.0807 9.87 0.376 0.156
Created on 2023-03-28 with reprex v2.0.2
We should probably try to match dplyr here.
See also #255 regarding unnamed data masking expressions yielding tibbles, which we don't allow and which dplyr turns into separate columns.
We may not be in this situation very often, since we don't have cur_data()
etc. implemented, so we're going to reach for the function/formula form in these situations first. So marking this low priority.