Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs overhaul #431

Open
wants to merge 45 commits into
base: dev
Choose a base branch
from
Open

Docs overhaul #431

wants to merge 45 commits into from

Conversation

dsweber2
Copy link
Contributor

@dsweber2 dsweber2 commented Jan 23, 2025

Checklist

Please:

  • Make sure this PR is against "dev", not "main".
  • Request a review from one of the current epipredict main reviewers:
    dajmcdon.
  • Make sure to bump the version number in DESCRIPTION and NEWS.md.
    Always increment the patch version number (the third number), unless you are
    making a release PR from dev to main, in which case increment the minor
    version number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    0.7.2, then write your changes under the 0.8 heading).
  • Consider pinning the epiprocess version in the DESCRIPTION file if
    • You anticipate breaking changes in epiprocess soon
    • You want to co-develop features in epipredict and epiprocess

Change explanations for reviewer

Draft ready for review:

  • Landing Page
  • Getting Started
  • Customized Forecasters
  • Reference
    • Using the add/update/remove and adjust functions
    • Smooth Quantile regression
  • preprocessing and models examples
  • backtesting forecasters

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@dsweber2 dsweber2 requested a review from dajmcdon as a code owner January 23, 2025 20:59
@dajmcdon
Copy link
Contributor

/preview-docs

Copy link

github-actions bot commented Jan 23, 2025

@dshemetov
Copy link
Contributor

dshemetov commented Jan 23, 2025

Our setup is generating docs in dev/, so the link is off, this works:
https://6792d4953137ef0ce0547a4f--epipredict.netlify.app/dev/

Also FYI: the bot edits its own comment for links. Each preview is a separate link and the links stick around for like 90 days. You can see the previous links in the comment edit history.

Edit: this has been fixed on main so is no longer necessary

@dsweber2 dsweber2 force-pushed the docsDraft branch 4 times, most recently from 8044b98 to d35363e Compare January 27, 2025 22:50
@dsweber2
Copy link
Contributor Author

/preview-docs

@dsweber2
Copy link
Contributor Author

So something weird is happening with the plot for flatline_forecaster, not really sure why. Going to dig into that next.

I added an option to replace the data for the autoplot so you can compare with new data instead

@dsweber2
Copy link
Contributor Author

dsweber2 commented Feb 3, 2025

Draft of the getting started is ready, moving on to a draft of the "guts" page (name a placeholder), which is an overview of creating workflows by hand

@dsweber2
Copy link
Contributor Author

dsweber2 commented Feb 5, 2025

So something weird is happening with the plot for flatline_forecaster, not really sure why. Going to dig into that next.
image

After some digging, I don't think there are any bugs, just some edge-case behavior that we may not want:

  1. Thresholding and extrapolation don't interact well. In this case, the quantiles it fits are 0.05 and 0.95, and it correctly rounds the 5% quantile up to zero (b/c of the negative values it is actually negative w/out constraint). But, it also looks to plot the 2.5% and 97.5% quantiles, which extrapolate doesn't know should also be zero. This results in quantiles with negative values.
  2. The other thing is that the interpolated quantiles change quite a bit after thresholding if there's not very many quantiles. For example, the median gets pushed up quite a bit, but the point prediction doesn't reflect that.

My take away: never fit just the 5% and 95% quantiles. At least do the 50%. That fixes most of the jank this uncovers.

@dsweber2 dsweber2 self-assigned this Feb 7, 2025
@dsweber2
Copy link
Contributor Author

dsweber2 commented Feb 7, 2025

/preview-docs

@dshemetov
Copy link
Contributor

Including 0.5 into the user's selection sounds simple and reasonable to me. They can always filter out what they don't want.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Feb 7, 2025

/preview-docs

@dsweber2
Copy link
Contributor Author

/preview-docs

@nmdefries this also updates the backtesting vignette; I'm dropping the Canadian example because it basically had no revisions.

@dsweber2
Copy link
Contributor Author

/preview-docs


``` r
two_week_ahead <- arx_forecaster(
covid_case_death_rates,
four_week_ahead <- arx_forecaster(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This gives me an error

Error in prep.epi_recipe(blueprint$recipe, training = training, fresh = blueprint$fresh,  : 
  object 'validate_training_data' not found

I'm using the dev version of epipredict, installed today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's internal to {recipes} (unexported). The DESCRIPTION may need to depend on a higher version. News.md here suggests 1.1.0, but worth looking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Long term goal is to remove any dependence on internal functions from other packages)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recipes 1.1.1 worked along with an update to hardhat

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments/questions so far.

I just need to finish the custom_workflows vignette.

Comment on lines 201 to 202
As truth data, we'll compare with the `epix_as_of()` to generate a snapshot of
the archive at the last date[^1].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why are we comparing our forecast to the finalized value?

Comment on lines +359 to 361
data = percent_cli_data |> filter(geo_value == geo_choose),
aes(x = time_value, y = percent_cli, color = factor(version)),
inherit.aes = FALSE, na.rm = TRUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: On the version-faithful plot, the "finalized" line is bolded and pink vs gray on the version un-faithful plot. This makes them confusing to compare.

The version faithful finalized data also doesn't cover the full time period.

library(epidatr)
```

To get a better handle on custom `epi_workflow()`s, lets recreate and then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: please explain what an epi_workflow is and why you'd want to use it.

versions you should assume performance is worse than what the test would
otherwise suggest.

[^4]: Until we have a time machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: 😆

Comment on lines +371 to +380
The version faithful and un-faithful forecasts look moderately similar except for the 1 day horizons
(although neither approach produces amazingly accurate forecasts).

### Example using case data from Canada
In the version faithful case for California, the March 2021 forecast (turquoise)
starts at a value just above 10, which is very well lined up with reported values leading up to that forecast.
The measured and forecasted trends are also concordant (both increasingly moderately fast).

<details>
Because the data for this time period was later adjusted down with a decreasing trend, the March 2021 forecast looks quite bad compared to finalized data.

<summary>Data and forecasts. Similar to the above.</summary>

By leveraging the flexibility of `epiprocess`, we can apply the same techniques
to data from other sources. Since some collaborators are in British Columbia,
Canada, we'll do essentially the same thing for Canada as we did above.

The [COVID-19 Canada Open Data Working Group](https://opencovid.ca/) collects
daily time series data on COVID-19 cases, deaths, recoveries, testing and
vaccinations at the health region and province levels. Data are collected from
publicly available sources such as government datasets and news releases.
Unfortunately, there is no simple versioned source, so we have created our own
from the Github commit history.

First, we load versioned case rates at the provincial level. After converting
these to 7-day averages (due to highly variable provincial reporting
mismatches), we then convert the data to an `epi_archive` object, and extract
the latest version from it. Finally, we run the same forcasting exercise as for
the American data, but here we compare the forecasts produced from using simple
linear regression with those from using boosted regression trees.

```{r get-can-fc, warning = FALSE}
aheads <- c(7, 14, 21, 28)
canada_archive <- can_prov_cases
canada_archive_faux <- epix_as_of(canada_archive, canada_archive$versions_end) %>%
mutate(version = time_value) %>%
as_epi_archive()
# This function will add the 7-day average of the case rate to the data
# before forecasting.
smooth_cases <- function(epi_df) {
epi_df %>%
group_by(geo_value) %>%
epi_slide_mean("case_rate", .window_size = 7, na.rm = TRUE, .suffix = "_{.n}dav")
}
forecast_dates <- seq.Date(
from = min(canada_archive$DT$version),
to = max(canada_archive$DT$version),
by = "1 month"
)
The equivalent version un-faithful forecast starts at a value of 5, which is in line with the finalized data but would have been out of place compared to the version data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: I heavily modified this section to focus more on why/how version faithful and un-faithful forecasts differ, and "advertise" backtesting as a useful tool. The previous blurb made it sound like version un-faithful forecasts performed better, which they do on finalized data of course, but is not what we're trying to say here.

Please check over the new version @dsweber2 to see if i left anything out.

```

So there are 6 steps we will need to recreate.
One thing to note about the extracted recipe is that it has already been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: confused about recipe vs workflow (this is probably not the right spot to explain). We should probably link to recipes documentation of recipes so that we don't have to get into a lot of detail here.

engines (such as `quantile_reg()`) are

- `layer_quantile_distn()`: adds the specified quantiles.
If they differ from the ones actually fit, they will be interpolated and/or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: please clarify: differ in what way?


## Predicting

To do a prediction, we need to first narrow the dataset down to the relevant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: clarify why they need to be removed. Won't unused obs just be ignored?

```

The resulting tibble is 800 rows long, however.
This produces forecasts for not just the actual `forecast_date`, but for every
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: where did we set the forecast_date/how does the workflow know what it is? what if we want to use a different forecast date? do we have to re-define and re-compile the whole workflow?

Comment on lines +280 to +281
This can be useful for cases where `get_test_data()` doesn't pull sufficient
data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This sounds suspicious. If get_test_data decided there wasn't sufficient data, it sounds like the predict -> filter approach is doing something wrong (predicting with insufficient data).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that the predict -> filter approach would return the same predictions for forecast_date, but it sounds like not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vignettes
4 participants