Skip to content

Consider improving nowcaster/forecaster epi_slide sample in advanced.Rmd #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Mar 27, 2023 · 0 comments · Fixed by #538
Closed

Consider improving nowcaster/forecaster epi_slide sample in advanced.Rmd #288

brookslogan opened this issue Mar 27, 2023 · 0 comments · Fixed by #538
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation P2 low priority

Comments

@brookslogan
Copy link
Contributor

Problem described here. Allowing negative after values would not completely resolve the issue above, as the test-time prediction would still need to be made. We're just missing the train/test split. E.g., this version should perform the split (and expands the window to get two time steps (max) worth of training data). However, it runs into issues with a missing training split when there is no training data available (previously not a problem because there was always the test data to train on):

edf %>%
  epi_slide(function(d, ...) {
    d_split = d %>%
      group_by(geo_value) %>%
      mutate(subset = if_else(time_value == max(time_value), "test", "train")) %>%
      ungroup() %>%
      split(.$subset) %>%
      lapply(select, -"subset")
    obj <- lm(y ~ x, data = d_split$train)
    return(
      as.data.frame(
        predict(obj, newdata = d_split$test,
                interval = "prediction", level = 0.9)
      ))
  }, before = 2, new_col_name = "fc", names_sep = NULL)

This gives a mysterious message

Error in eval(predvars, data, env) : object 'y' not found

due to the carefree coding (because we are trying to pull y out of a NULL training set). We can likely get more appropriate error messages via:

  • using a factor instead of string for the subset indicator
  • using group_split, nest_by, nest(....., .by=.....), etc. + a bunch of awkward indexing (filter pull unwrap)
  • filtering all rows once to get the training set, then again to get the test set, e.g.:
edf %>%
  epi_slide(function(d, ...) {
    d_split = d %>%
      group_by(geo_value) %>%
      mutate(subset = if_else(time_value == max(time_value), "test", "train")) %>%
      ungroup()
    obj <- lm(y ~ x, data = d_split %>% filter(subset=="train"))
    return(
      as.data.frame(
        predict(obj, newdata = d_split %>% filter(subset=="test") %>% select(-subset),
                interval = "prediction", level = 0.9)
      ))
  }, before = 2, new_col_name = "fc", names_sep = NULL)

This improves the error message

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  0 (non-NA) cases

but regardless of whether we get an intelligible error message or not, we still have to have manual code to deal with skipping/completing instances with no training data (see #256).

For now, I plan to just explain the additional problem noted in the linked issue, and hold off on any improvements. A solution to #256 may give us more options. Another approach would be to change to using epipredict in this example.

@brookslogan brookslogan added bug Something isn't working documentation Improvements or additions to documentation P2 low priority labels Mar 27, 2023
@dshemetov dshemetov self-assigned this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation P2 low priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants