-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust ahead #296
Adjust ahead #296
Conversation
based on feedback, I'm making this a separate step and making sure full-pipeline tests work. It will be a PR until both of those are done. |
So I'm still putting the tests in a format that isn't a pile of spaghetti, but I wanted to get this out so we could maybe talk about it. To implement a separate step, I couldn't find a way to get previous steps, so I basically had to communicate everything relevant through the headers of While I was looking around I noticed that both There's still some downstream implications for how |
R/utils-latency.R
Outdated
key_cols = keys | ||
) | ||
}) %>% | ||
map(\(x) zoo::na.trim(x)) %>% # TODO need to talk about this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things about this:
- Zoo dependency. Seems we don't currently have this. are we trying to minimize package dependencies? This is the only function I use from it
- This shears
NA
's off the end of the column. We have a step that explicitly dropsNA
's, so I'm not sure how much we're trying to avoid removingNA
's otherwise. Not sure how to do this correctly otherwise though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- by default,
na.trim
removes from both the beginning and the end - [oops just repeating Daniel here] side note:
\(
is R >= 4.1. We were recently talking about not trying to support versions 3.* but not sure about 4.0. - this function from latency processing code lying around may be helpful:
n_true_at_tail = function(x) {
match(FALSE, rev(x), nomatch = length(x) + 1L) - 1L
}
# So then to remove NAs at tail you could do the following (or maybe an S3 version of this):
remove_na_tail <- function(x) {
x[seq_len(length(x) - n_true_at_tail(is.na(x)))]
}
# (don't use `tail` with "negative" `n` unless you know it's actually negative and not zero, which isn't the case here)
remove_na_tail(c(NA, NA, 1:10, NA, 1:10, NA, NA))
remove_na_tail(1:100) # ALTREP
remove_na_tail(1:100 + 1) # non-ALTREP
remove_na_tail(integer(0L))
remove_na_tail(rep(NA_integer_, 5))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[oops just repeating Daniel here]
yeah, threading in github issues can be hard to follow, especially when there are a lot of comments >.<
could you add some docs to those functions?
Getting a peculiar error where the local snapshot test for population scaling has 4 lines like this Code
p <- predict(wf, latest)
Message
Joining with `by = join_by(geo_value)`
Joining with `by = join_by(geo_value)`
Joining with `by = join_by(geo_value)`
Joining with `by = join_by(geo_value)` whereas the CI is apparently only generating 2. And this apparently changed within the past 2 weeks. |
This was a change in |
A few other notes (we can discuss synchronously):
|
expect_snapshot(prep <- prep(r, jhu)) | ||
|
||
expect_snapshot(b <- bake(prep, jhu)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are the tests generating the snapshots. Looking at my loca gitl timemachine, this was a set of tests that was previously being skipped, and I added these as snapshot tests. I guess I'll just put this back to skipped for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes, it does look like there are some lingering Joining with
by = join_by(geo_value)` messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a number of questions and comments through out. It may be easiest to schedule a check in or maybe on Slack?
b07656c
to
03c2120
Compare
Overview of new changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think everything looks potentially fine. It would help for the example to be a bit more elaborate.
Potential for merge conflicts with #331.
@@ -39,14 +39,16 @@ Imports: | |||
quantreg, | |||
recipes (>= 1.0.4), | |||
rlang, | |||
purrr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it. See below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you only use purrr::map()
, we already have this, so purrr
only needs to be in 'Suggests'
Wishlist item based on some dogfooding: if the minimum reporting latency across signals&geos is really large, say, a month or greater, that's worth a warning. E.g., FluSurv-NET has gaps in reporting between seasons, and I tried to forecast a new season before the first report actually was published using ahead adjustment, and happily tried to forecast the beginning of that season from the end of the preceding season, likely getting poor predictions. |
@dsweber2 I'm going to make what may seem like an annoying request: This processing seems sufficiently complicated that I think it should be on it's own, not in |
I'm not sure it'd be different enough from I'd be ok relegating it to just an advanced feature if necessary, though I do think without it, the |
My 2 cents:
My preference/request would that there be a canned forecaster that includes this feature. Could be Side note: I also share some misgivings about the
|
From our discussion yesterday, realized I overlooked an important possibility of someone newer forecasting without a Hub setup. Then there is no Hub validator to yell at you for forecasting a changing set of (forecast date relative) aheads when target latency changes / you pull data too early in the day, but you also may be able to say different aheads are fine. I think maybe the new forecasting examples Richard and Ryan N. are working on may be in that regime still and may give some insight; not sure if we might also have some experiences with epipredict precursors before running things in Hub validator. Ryan N. already ran into some confusion with (not easily being able to use epi_slide for forecasting and) the epix_slide time_value forecast_date target_date labeling, though the obvious issue there was (typing and recycling rules and) poor epiprocess labeling; I'm not sure if there is also an epipredict thing to worry about there. Still probably don't want to streamline misrepresenting what the real target dates are in this case. (And some surprises might come if/when (a) [epipredict] setting forecast date label to be the max time value and that not being the weekday expected [based on a natural reading of the word], (b) [epipredict] using the true forecast date and target dates as labels and seeing the aheads changing when trying to retrospectively evaluate or plot, plus potential uneven spacing or duplicate target date values (per epikey-ahead-forecaster), (c) not using target date and assuming aheads inputted are relative to today in decisionmaking or plotting code and being a bit off, and/or (d) running forecasts too early in the day (stale forecast if AR, not sure what all can happen with ARX, and it's also an issue for the adjusted version but with different behavior). [(a), (b), and (d) AR might cause them to ask for support and learn. (c) and (d) ARX might not be immediately detected and degrade forecast/decision quality.]) |
Strategy: 1. Add the grid created in ahead/lag steps to the object 2. At bake time, directly modify these if needed.
Co-authored-by: David Weber <[email protected]>
Checklist
Please:
dajmcdon.
DESCRIPTION
andNEWS.md
.Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
Change explanations for reviewer
Implementation of latency adjustments as an argument to
step_epi_ahead
. The rough idea is to makeahead
relative to theas_of
date, rather than the last day of data. It throws a warning if this is too far in the future. Onlyextend_ahead
andNone
are currently implemented. Biggest downside is it only works for theepi_df
that the pipeline was created on.Still working on:
epi_df
and the end-to-end application works, rather than just the specialized functionextend_ahead
).step_adjust_ahead
locf
and associated testsextend_lags
we should talk about this one, it's a new-ish @lcbrooks idea. Sort of the opposite ofextend_aheads
. Setting the forecast date to beas_of
, per predictor, adjust the lags so that they're relative to the last day of data. E.g. if predictorA
has 3 days of latency, and the lags are given as(0,7,14)
, set the lags to(3,10,17)
, while ifB
has 5 days of latency, set the lags forB
to(5,12,19)
. Feel free to move discussion to the issueMagic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch