Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust ahead #296

Merged
merged 92 commits into from
Oct 1, 2024
Merged

Adjust ahead #296

merged 92 commits into from
Oct 1, 2024

Conversation

dsweber2
Copy link
Contributor

@dsweber2 dsweber2 commented Mar 18, 2024

Checklist

Please:

  • Make sure this PR is against "dev", not "main".
  • Request a review from one of the current epipredict main reviewers:
    dajmcdon.
  • Makes sure to bump the version number in DESCRIPTION and NEWS.md.
    Always increment the patch version number (the third number), unless you are
    making a release PR from dev to main, in which case increment the minor
    version number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    0.7.2, then write your changes under the 0.8 heading).

Change explanations for reviewer

Implementation of latency adjustments as an argument to step_epi_ahead. The rough idea is to make ahead relative to the as_of date, rather than the last day of data. It throws a warning if this is too far in the future. Only extend_ahead and None are currently implemented. Biggest downside is it only works for the epi_df that the pipeline was created on.

Still working on:

  • some tests (such as making sure it only works on the first epi_df and the end-to-end application works, rather than just the specialized function extend_ahead).
  • adjustments to layers to account for the existence of step_adjust_ahead
  • locf and associated tests
  • extend_lags we should talk about this one, it's a new-ish @lcbrooks idea. Sort of the opposite of extend_aheads. Setting the forecast date to be as_of, per predictor, adjust the lags so that they're relative to the last day of data. E.g. if predictor A has 3 days of latency, and the lags are given as (0,7,14), set the lags to (3,10,17), while if B has 5 days of latency, set the lags for B to (5,12,19). Feel free to move discussion to the issue
  • demo vignettes

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@dsweber2 dsweber2 self-assigned this Mar 18, 2024
@dsweber2 dsweber2 requested a review from dajmcdon as a code owner March 18, 2024 18:20
@dsweber2 dsweber2 mentioned this pull request Mar 19, 2024
4 tasks
@dsweber2 dsweber2 marked this pull request as draft March 19, 2024 22:03
@dsweber2
Copy link
Contributor Author

based on feedback, I'm making this a separate step and making sure full-pipeline tests work. It will be a PR until both of those are done.

@dsweber2
Copy link
Contributor Author

So I'm still putting the tests in a format that isn't a pile of spaghetti, but I wanted to get this out so we could maybe talk about it.

To implement a separate step, I couldn't find a way to get previous steps, so I basically had to communicate everything relevant through the headers of new_data in the bake step. Hopefully modifying those doesn't cause problems.

While I was looking around I noticed that both step_epi_lag and step_epi_ahead were essentially the same; I made them call a separate function to highlight the actual differences. Couldn't quite reuse that for step_adjust_latency unfortunately, though that has a similar thing where adjust_ahead and adjust_lag basically do the same thing.

There's still some downstream implications for how layer_forecast_date works, specifically in the default case. We may have to add it as an option here too for it to work for the case of someone setting it manually using that step, which seems a bit unfortunate but kinda unavoidable

key_cols = keys
)
}) %>%
map(\(x) zoo::na.trim(x)) %>% # TODO need to talk about this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things about this:

  1. Zoo dependency. Seems we don't currently have this. are we trying to minimize package dependencies? This is the only function I use from it
  2. This shears NA's off the end of the column. We have a step that explicitly drops NA's, so I'm not sure how much we're trying to avoid removing NA's otherwise. Not sure how to do this correctly otherwise though.

Copy link
Contributor

@brookslogan brookslogan May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • by default, na.trim removes from both the beginning and the end
  • [oops just repeating Daniel here] side note: \( is R >= 4.1. We were recently talking about not trying to support versions 3.* but not sure about 4.0.
  • this function from latency processing code lying around may be helpful:
n_true_at_tail = function(x) {
  match(FALSE, rev(x), nomatch = length(x) + 1L) - 1L
}
# So then to remove NAs at tail you could do the following (or maybe an S3 version of this):
remove_na_tail <- function(x) {
  x[seq_len(length(x) - n_true_at_tail(is.na(x)))]
}
# (don't use `tail` with "negative" `n` unless you know it's actually negative and not zero, which isn't the case here)

remove_na_tail(c(NA, NA, 1:10, NA, 1:10, NA, NA))
remove_na_tail(1:100) # ALTREP
remove_na_tail(1:100 + 1) # non-ALTREP
remove_na_tail(integer(0L))
remove_na_tail(rep(NA_integer_, 5))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[oops just repeating Daniel here]

yeah, threading in github issues can be hard to follow, especially when there are a lot of comments >.<

could you add some docs to those functions?

@dsweber2
Copy link
Contributor Author

dsweber2 commented Apr 2, 2024

Getting a peculiar error where the local snapshot test for population scaling has 4 lines like this

    Code
      p <- predict(wf, latest)
    Message
      Joining with `by = join_by(geo_value)`
      Joining with `by = join_by(geo_value)`
      Joining with `by = join_by(geo_value)`
      Joining with `by = join_by(geo_value)`

whereas the CI is apparently only generating 2. And this apparently changed within the past 2 weeks.

@dajmcdon
Copy link
Contributor

dajmcdon commented Apr 2, 2024

This was a change in {dplyr} at some point that generates these if you don't specify by =. I tried purposely to get rid of them. But maybe I missed some? Or your new code is recreating more?

@dajmcdon
Copy link
Contributor

dajmcdon commented Apr 2, 2024

A few other notes (we can discuss synchronously):

  1. The only time a step_*() function has access to the recipe is when it is added. At that point, it can "see" all previous steps. So inside of step_adjust_latency(), you can immediately check to see if there are leads/lags in the current set of recipe instructions. In principle, you could then adjust them if desired. I'm not sure this handles everything, but I suspect it handles some of it.
  2. The map(\(x), x+3) style syntax was added in R 4.1. We've avoided forcing this Dependency on users to this point. I'm open to changing this (it's almost 3 years old), but we should discuss. This is also the reason for sticking with the {magrittr} pipe %>% rather than the native pipe |>.
  3. The {zoo} dependency. There are situations where removal is always the right thing (creating NAs on future time_value's is probably one of these), and our step_epi_lead() does this currently. However, my previous practice has been to pass any NAs created by steps along. This way, additional steps can potentially be added to handle them (say, any of these, or similar time series fill versions). Simplest is the step_naomit() functions. Is there a reason that this can't be done without the dependency or handled downstream?

Comment on lines 231 to 220
expect_snapshot(prep <- prep(r, jhu))

expect_snapshot(b <- bake(prep, jhu))
Copy link
Contributor Author

@dsweber2 dsweber2 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are the tests generating the snapshots. Looking at my loca gitl timemachine, this was a set of tests that was previously being skipped, and I added these as snapshot tests. I guess I'll just put this back to skipped for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, it does look like there are some lingering Joining with by = join_by(geo_value)` messages

@dsweber2 dsweber2 marked this pull request as ready for review April 30, 2024 22:10
Copy link
Contributor

@dajmcdon dajmcdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a number of questions and comments through out. It may be easiest to schedule a check in or maybe on Slack?

@dsweber2 dsweber2 force-pushed the adjustAhead branch 2 times, most recently from b07656c to 03c2120 Compare May 8, 2024 21:57
@dsweber2
Copy link
Contributor Author

dsweber2 commented May 8, 2024

Overview of new changes:

  1. I dropped stringr and stringi, added purrr.
  2. zoo is still present, waiting for us to decide on whether we want to trim or not.
  3. I moved lag/ahead detection into the step creation thanks to liberal use of recipes_eval_select. If the template is different from the training data this could cause problems, but I guess that's supposed to be true anyways.

@dajmcdon dajmcdon mentioned this pull request May 9, 2024
4 tasks
Copy link
Contributor

@dajmcdon dajmcdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think everything looks potentially fine. It would help for the example to be a bit more elaborate.

Potential for merge conflicts with #331.

@@ -39,14 +39,16 @@ Imports:
quantreg,
recipes (>= 1.0.4),
rlang,
purrr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it. See below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you only use purrr::map(), we already have this, so purrr only needs to be in 'Suggests'

@brookslogan
Copy link
Contributor

Wishlist item based on some dogfooding: if the minimum reporting latency across signals&geos is really large, say, a month or greater, that's worth a warning. E.g., FluSurv-NET has gaps in reporting between seasons, and I tried to forecast a new season before the first report actually was published using ahead adjustment, and happily tried to forecast the beginning of that season from the end of the preceding season, likely getting poor predictions.

@dajmcdon
Copy link
Contributor

@dsweber2 I'm going to make what may seem like an annoying request:

This processing seems sufficiently complicated that I think it should be on it's own, not in arx_forecaster(). I don't think that arx_forecaster() should accommodate every specialized step/layer that we end up creating. So maybe this needs to just be a new canned forecaster? What do you think?

@dsweber2
Copy link
Contributor Author

I'm not sure it'd be different enough from arx_forecaster to have as a completely separate version (comparing arx_forecaster, arx_classifier and... arx_latency_adjusted_forecaster?). What part of the complication pushes you to not include it? Is your concern for having too many options for arx_forecaster? Too much complexity to understanding the options? Maintaining the additional feature?

I'd be ok relegating it to just an advanced feature if necessary, though I do think without it, the forecast_date and target_date arguments are kind of half-baked.

@brookslogan
Copy link
Contributor

brookslogan commented Jun 24, 2024

My 2 cents:

  • We want this adjustment for every forecasting Hub task I'm aware of.
  • We want this adjustment every time we use multiple predictors unless they are perfectly in sync reporting-latency-wise [or have 0 variability in reporting latency for each signal across the requested forecast_dates].

My preference/request would that there be a canned forecaster that includes this feature. Could be arx_forecaster(), could be an arx_forecaster_from_forecast_date() (or better-named) alternative.

Side note: I also share some misgivings about the forecast_date and target_date settings... they always felt like stop-gaps until we had access to this latency/ahead-adjustment feature. I'm not sure they'd be needed anymore --- in either the max time_value-relative or the true-forecast_date-relative approaches --- once the adjustment approach is available; presumably, you'd only use the non-adjusted version when you actually want things to be relative to the max time_value. [Also target_date might make it awkward to support multiple aheads at once, which sounds very convenient + would make a nice uniform interface with smooth quantile_reg approaches.]

  • [David noted that he changed what the forecast_date parameter does, which actually makes forecast_date impact the model for the adjustment approach. That sounds better, though I'm still not sure if people will read it as an epix_as_of() + forecast, which it isn't. It might also be a convenient way to change the meaning of this label when doing max-time_value-relative stuff... maybe the larger problem there is actually the existence of a default forecast_date with a value that might surprise half of readers/users.]

@brookslogan
Copy link
Contributor

brookslogan commented Jun 26, 2024

From our discussion yesterday, realized I overlooked an important possibility of someone newer forecasting without a Hub setup. Then there is no Hub validator to yell at you for forecasting a changing set of (forecast date relative) aheads when target latency changes / you pull data too early in the day, but you also may be able to say different aheads are fine. I think maybe the new forecasting examples Richard and Ryan N. are working on may be in that regime still and may give some insight; not sure if we might also have some experiences with epipredict precursors before running things in Hub validator. Ryan N. already ran into some confusion with (not easily being able to use epi_slide for forecasting and) the epix_slide time_value forecast_date target_date labeling, though the obvious issue there was (typing and recycling rules and) poor epiprocess labeling; I'm not sure if there is also an epipredict thing to worry about there.

Still probably don't want to streamline misrepresenting what the real target dates are in this case. (And some surprises might come if/when (a) [epipredict] setting forecast date label to be the max time value and that not being the weekday expected [based on a natural reading of the word], (b) [epipredict] using the true forecast date and target dates as labels and seeing the aheads changing when trying to retrospectively evaluate or plot, plus potential uneven spacing or duplicate target date values (per epikey-ahead-forecaster), (c) not using target date and assuming aheads inputted are relative to today in decisionmaking or plotting code and being a bit off, and/or (d) running forecasts too early in the day (stale forecast if AR, not sure what all can happen with ARX, and it's also an issue for the adjusted version but with different behavior). [(a), (b), and (d) AR might cause them to ask for support and learn. (c) and (d) ARX might not be immediately detected and degrade forecast/decision quality.])

@dsweber2 dsweber2 merged commit 3892a53 into dev Oct 1, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add latency adjustment
4 participants