Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add step_/layer_ epi_YeoJohnson #451

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from
Open

feat: add step_/layer_ epi_YeoJohnson #451

wants to merge 7 commits into from

Conversation

dshemetov
Copy link
Contributor

Checklist

Please:

  • Make sure this PR is against "dev", not "main".
  • Request a review from one of the current epipredict main reviewers:
    dajmcdon.
  • Make sure to bump the version number in DESCRIPTION and NEWS.md.
    Always increment the patch version number (the third number), unless you are
    making a release PR from dev to main, in which case increment the minor
    version number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    0.7.2, then write your changes under the 0.8 heading).
  • Consider pinning the epiprocess version in the DESCRIPTION file if
    • You anticipate breaking changes in epiprocess soon
    • You want to co-develop features in epipredict and epiprocess

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

  • Resolves #{issue number}

Copy link
Contributor

@dajmcdon dajmcdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. I've gotten through the layer stuff, but haven't really started on the step. There are a number of tasks we can discuss if I haven't provided enough detail.

Comment on lines 117 to 171
# The `object$terms` is where the user specifies the columns they want to
# untransform. We need to match the outcomes with their lambda columns in our
# parameter table and then apply the inverse transformation.
if (identical(col_names, ".pred")) {
# In this case, we don't get a hint for the outcome column name, so we need
# to infer it from the mold.
if (length(components$mold$outcomes) > 1) {
cli_abort("Only one outcome is allowed when specifying `.pred`.", call = rlang::caller_env())
}
# `outcomes` is a vector of objects like ahead_1_cases, ahead_7_cases, etc.
# We want to extract the cases part.
outcome_cols <- names(components$mold$outcomes) %>%
stringr::str_match("ahead_\\d+_(.*)") %>%
magrittr::extract(, 2)

components$predictions <- components$predictions %>%
rowwise() %>%
mutate(.pred := yj_inverse(.pred, !!sym(paste0(".lambda_", outcome_cols))))
} else if (identical(col_names, character(0))) {
# Wish I could suggest `all_outcomes()` here, but currently it's the same as
# not specifying any terms. I don't want to spend time with dealing with
# this case until someone asks for it.
cli::cli_abort(
"Not specifying columns to layer Yeo-Johnson is not implemented.
If you had a single outcome, you can use `.pred` as a column name.
If you had multiple outcomes, you'll need to specify them like
`.pred_ahead_1_<outcome_col>`, `.pred_ahead_7_<outcome_col>`, etc.
",
call = rlang::caller_env()
)
} else {
# In this case, we assume that the user has specified the columns they want
# transformed here. We then need to determine the lambda columns for each of
# these columns. That is, we need to convert a vector of column names like
# c(".pred_ahead_1_case_rate", ".pred_ahead_7_case_rate") to
# c("lambda_ahead_1_case_rate", "lambda_ahead_7_case_rate").
original_outcome_cols <- stringr::str_match(col_names, ".pred_ahead_\\d+_(.*)")[, 2]
outcomes_wout_ahead <- stringr::str_match(names(components$mold$outcomes), "ahead_\\d+_(.*)")[, 2]
if (any(original_outcome_cols %nin% outcomes_wout_ahead)) {
cli_abort(
"All columns specified in `...` must be outcome columns.
They must be of the form `.pred_ahead_1_<outcome_col>`, `.pred_ahead_7_<outcome_col>`, etc.
",
call = rlang::caller_env()
)
}

for (i in seq_along(col_names)) {
col <- col_names[i]
lambda_col <- paste0(".lambda_", original_outcome_cols[i])
components$predictions <- components$predictions %>%
rowwise() %>%
mutate(!!sym(col) := yj_inverse(!!sym(col), !!sym(lambda_col)))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is all pretty dangerous.

  1. all_outcomes() is a recipe selector. So you can use that, as long as you look in the workflow.
  2. Alternatively, even without the selector passed in ..., you can look there to see which vars have the role "outcome".
  3. But our typical case would want to apply the transformation to starts_with(.pred) to bring in the quantiles as well as the point prediction. I suspect that would hit your final else case but produce an error.
  4. I'm not sure that we have an example with multiple aheads in separate columns, but if we did, presumably they'd have the same YJ parameters for different aheads of the same variable? This would be detectable from the workflow, which would be much safer than trying to get good behaviour with the string processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give this another shot, using workflow and assuming the typical case is starts_with(.pred).

Comment on lines 242 to 247
for (step in this_recipe$steps) {
if (inherits(step, "step_epi_YeoJohnson")) {
lambdas <- step$lambdas
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be buggy if the step is used multiple times. Also, we want to make sure that the step applies to the outcome rather than the predictors (or other roles).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yea. I see a few ways through this:

  • require that only a single step_yj is allowed; this would mean users have to bundle all their transformation requests in one step and all the transformations have to use the same hyperparameters na_lambda_fill, limits, num_unique, na_rm; that doesn't seem like a big limitation to me
  • do a kind-of parentheses matching approach: the layer will need to count the yj layers prior to it and count backwards in the list of steps to find the step that matches it
    • doable, not super complex, and the general scheme might be handy for future layers, but the feature gain doesn't seem worth the effort right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for logging:

  1. To access the processed recipe, use hardhat::extract_recipe(workflow).
  2. Inside is last_term_info. This tibble contains the names of all variables that ever existed, along with their roles.

Alternatively:

  1. Access the mold with hardhat::extract_mold(workflow).
  2. all y are in a tibble at $outcomes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second is technically more robust within a workflow, if the preprocessor was other than a recipe (like a formula, for example).

Co-authored-by: Daniel McDonald <[email protected]>
Comment on lines +188 to +198
# If not an epi_df, make it one assuming the template of training data.
# If it is an epi_df, check that the keys match.
# Imitating the pattern in step_adjust_latency().
if (!inherits(new_data, "epi_df") || is.null(attr(new_data, "metadata")$as_of)) {
new_data <- as_epi_df(
new_data,
as_of = object$forecast_date,
other_keys = object$metadata$other_keys %||% character()
)
attr(new_data, "metadata") <- object$metadata
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I removed this guardrail and I ran into an instance where new_data is a tibble in the tests. Seems out of scope to try and fix that, but wanted to mark that this is actually doing work right now.

* vectorize in lambda
* inheritParams in docs
* lambda -> yj_param in many places
@dsweber2 dsweber2 linked an issue Apr 3, 2025 that may be closed by this pull request
@dsweber2 dsweber2 mentioned this pull request Apr 8, 2025
19 tasks
* extend to quantile_dist, exclude multi-output
* Drop by specification and infer from the epi_df
* lint+test: test coverage, handle na lambda case, lint
* fix: quantile_pred arithmetic
* fix: rlang calls

---------

Co-authored-by: Dmitry Shemetov <[email protected]>
@dshemetov dshemetov requested a review from dajmcdon April 10, 2025 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Yeo-Johnson
3 participants