feat: add step_/layer_ epi_YeoJohnson #451

dshemetov · 2025-03-21T20:42:20Z

Checklist

Please:

Make sure this PR is against "dev", not "main".
Request a review from one of the current epipredict main reviewers:
dajmcdon.
Make sure to bump the version number in DESCRIPTION and NEWS.md.
Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
Consider pinning the epiprocess version in the DESCRIPTION file if
- You anticipate breaking changes in epiprocess soon
- You want to co-develop features in epipredict and epiprocess

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Resolves #{issue number}

dajmcdon

Looking good so far. I've gotten through the layer stuff, but haven't really started on the step. There are a number of tasks we can discuss if I haven't provided enough detail.

air.toml

R/layer_yeo_johnson.R

dajmcdon · 2025-03-25T21:43:38Z

R/layer_yeo_johnson.R

+  # The `object$terms` is where the user specifies the columns they want to
+  # untransform. We need to match the outcomes with their lambda columns in our
+  # parameter table and then apply the inverse transformation.
+  if (identical(col_names, ".pred")) {
+    # In this case, we don't get a hint for the outcome column name, so we need
+    # to infer it from the mold.
+    if (length(components$mold$outcomes) > 1) {
+      cli_abort("Only one outcome is allowed when specifying `.pred`.", call = rlang::caller_env())
+    }
+    # `outcomes` is a vector of objects like ahead_1_cases, ahead_7_cases, etc.
+    # We want to extract the cases part.
+    outcome_cols <- names(components$mold$outcomes) %>%
+      stringr::str_match("ahead_\\d+_(.*)") %>%
+      magrittr::extract(, 2)
+
+    components$predictions <- components$predictions %>%
+      rowwise() %>%
+      mutate(.pred := yj_inverse(.pred, !!sym(paste0(".lambda_", outcome_cols))))
+  } else if (identical(col_names, character(0))) {
+    # Wish I could suggest `all_outcomes()` here, but currently it's the same as
+    # not specifying any terms. I don't want to spend time with dealing with
+    # this case until someone asks for it.
+    cli::cli_abort(
+      "Not specifying columns to layer Yeo-Johnson is not implemented.
+    If you had a single outcome, you can use `.pred` as a column name.
+    If you had multiple outcomes, you'll need to specify them like
+    `.pred_ahead_1_<outcome_col>`, `.pred_ahead_7_<outcome_col>`, etc.
+    ",
+      call = rlang::caller_env()
+    )
+  } else {
+    # In this case, we assume that the user has specified the columns they want
+    # transformed here. We then need to determine the lambda columns for each of
+    # these columns. That is, we need to convert a vector of column names like
+    # c(".pred_ahead_1_case_rate", ".pred_ahead_7_case_rate") to
+    # c("lambda_ahead_1_case_rate", "lambda_ahead_7_case_rate").
+    original_outcome_cols <- stringr::str_match(col_names, ".pred_ahead_\\d+_(.*)")[, 2]
+    outcomes_wout_ahead <- stringr::str_match(names(components$mold$outcomes), "ahead_\\d+_(.*)")[, 2]
+    if (any(original_outcome_cols %nin% outcomes_wout_ahead)) {
+      cli_abort(
+        "All columns specified in `...` must be outcome columns.
+      They must be of the form `.pred_ahead_1_<outcome_col>`, `.pred_ahead_7_<outcome_col>`, etc.
+      ",
+        call = rlang::caller_env()
+      )
+    }
+
+    for (i in seq_along(col_names)) {
+      col <- col_names[i]
+      lambda_col <- paste0(".lambda_", original_outcome_cols[i])
+      components$predictions <- components$predictions %>%
+        rowwise() %>%
+        mutate(!!sym(col) := yj_inverse(!!sym(col), !!sym(lambda_col)))
+    }
+  }


This part is all pretty dangerous.

all_outcomes() is a recipe selector. So you can use that, as long as you look in the workflow.

Alternatively, even without the selector passed in ..., you can look there to see which vars have the role "outcome".

But our typical case would want to apply the transformation to starts_with(.pred) to bring in the quantiles as well as the point prediction. I suspect that would hit your final else case but produce an error.

I'm not sure that we have an example with multiple aheads in separate columns, but if we did, presumably they'd have the same YJ parameters for different aheads of the same variable? This would be detectable from the workflow, which would be much safer than trying to get good behaviour with the string processing.

Let me give this another shot, using workflow and assuming the typical case is starts_with(.pred).

dajmcdon · 2025-03-25T21:45:49Z

R/layer_yeo_johnson.R

+  for (step in this_recipe$steps) {
+    if (inherits(step, "step_epi_YeoJohnson")) {
+      lambdas <- step$lambdas
+      break
+    }
+  }


this will be buggy if the step is used multiple times. Also, we want to make sure that the step applies to the outcome rather than the predictors (or other roles).

Hm yea. I see a few ways through this:

require that only a single step_yj is allowed; this would mean users have to bundle all their transformation requests in one step and all the transformations have to use the same hyperparameters na_lambda_fill, limits, num_unique, na_rm; that doesn't seem like a big limitation to me

do a kind-of parentheses matching approach: the layer will need to count the yj layers prior to it and count backwards in the list of steps to find the step that matches it

doable, not super complex, and the general scheme might be handy for future layers, but the feature gain doesn't seem worth the effort right now

Just for logging:

To access the processed recipe, use hardhat::extract_recipe(workflow).

Inside is last_term_info. This tibble contains the names of all variables that ever existed, along with their roles.

Alternatively:

Access the mold with hardhat::extract_mold(workflow).

all y are in a tibble at $outcomes

The second is technically more robust within a workflow, if the preprocessor was other than a recipe (like a formula, for example).

R/layer_yeo_johnson.R

R/step_yeo_johnson.R

dshemetov · 2025-04-01T01:07:11Z

R/step_yeo_johnson.R

+  # If not an epi_df, make it one assuming the template of training data.
+  # If it is an epi_df, check that the keys match.
+  # Imitating the pattern in step_adjust_latency().
+  if (!inherits(new_data, "epi_df") || is.null(attr(new_data, "metadata")$as_of)) {
+    new_data <- as_epi_df(
+      new_data,
+      as_of = object$forecast_date,
+      other_keys = object$metadata$other_keys %||% character()
+    )
+    attr(new_data, "metadata") <- object$metadata
+  }


So I removed this guardrail and I ran into an instance where new_data is a tibble in the tests. Seems out of scope to try and fix that, but wanted to mark that this is actually doing work right now.

Co-authored-by: Daniel McDonald <[email protected]>

* vectorize in lambda * inheritParams in docs * lambda -> yj_param in many places

* extend to quantile_dist, exclude multi-output * Drop by specification and infer from the epi_df * lint+test: test coverage, handle na lambda case, lint * fix: quantile_pred arithmetic * fix: rlang calls --------- Co-authored-by: Dmitry Shemetov <[email protected]>

dshemetov requested a review from dajmcdon as a code owner March 21, 2025 20:42

dshemetov mentioned this pull request Mar 21, 2025

feat: add yeo-johnson step cmu-delphi/exploration-tooling#179

Closed

6 tasks

dajmcdon reviewed Mar 25, 2025

View reviewed changes

dshemetov force-pushed the ds/step-yj branch from fdd1a44 to e0a84dd Compare March 31, 2025 20:51

This was referenced Mar 31, 2025

enh: add a workflow checker function for slather functions #456

Open

Step Yeo Johnson contains an unused line tidymodels/recipes#1464

Closed

dshemetov commented Apr 1, 2025

View reviewed changes

dsweber2 linked an issue Apr 3, 2025 that may be closed by this pull request

Yeo-Johnson #457

Open

dsweber2 mentioned this pull request Apr 8, 2025

Docs overhaul #431

Open

19 tasks

dshemetov requested a review from dajmcdon April 10, 2025 21:52

dshemetov and others added 7 commits April 15, 2025 12:34

feat: add step_/layer_ epi_YeoJohnson

577a866

feat: add air.toml

b4b0627

doc: version bump + news

2d05117

Update R/layer_yeo_johnson.R

95c50b5

Co-authored-by: Daniel McDonald <[email protected]>

fix: review tweaks

6a24743

* vectorize in lambda * inheritParams in docs * lambda -> yj_param in many places

fix lambda(s), styling

fd76c36

dsweber2 force-pushed the ds/step-yj branch from 2dd1ae0 to 47dce2f Compare April 15, 2025 17:38

dsweber2 mentioned this pull request Apr 21, 2025

Scaling and centering cmu-delphi/exploration-tooling#162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add step_/layer_ epi_YeoJohnson #451

feat: add step_/layer_ epi_YeoJohnson #451

Uh oh!

dshemetov commented Mar 21, 2025

Uh oh!

dajmcdon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dajmcdon Mar 25, 2025

Uh oh!

dshemetov Mar 27, 2025

Uh oh!

dajmcdon Mar 25, 2025

Uh oh!

dshemetov Mar 27, 2025

Uh oh!

dajmcdon Mar 28, 2025

Uh oh!

dajmcdon Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

dshemetov Apr 1, 2025

Uh oh!

Uh oh!

feat: add step_/layer_ epi_YeoJohnson #451

Are you sure you want to change the base?

feat: add step_/layer_ epi_YeoJohnson #451

Uh oh!

Conversation

dshemetov commented Mar 21, 2025

Checklist

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Uh oh!

dajmcdon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dajmcdon Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

dshemetov Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

dajmcdon Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

dshemetov Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

dajmcdon Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

dajmcdon Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dshemetov Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!