-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add step_/layer_ epi_YeoJohnson #451
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far. I've gotten through the layer stuff, but haven't really started on the step. There are a number of tasks we can discuss if I haven't provided enough detail.
R/layer_yeo_johnson.R
Outdated
# The `object$terms` is where the user specifies the columns they want to | ||
# untransform. We need to match the outcomes with their lambda columns in our | ||
# parameter table and then apply the inverse transformation. | ||
if (identical(col_names, ".pred")) { | ||
# In this case, we don't get a hint for the outcome column name, so we need | ||
# to infer it from the mold. | ||
if (length(components$mold$outcomes) > 1) { | ||
cli_abort("Only one outcome is allowed when specifying `.pred`.", call = rlang::caller_env()) | ||
} | ||
# `outcomes` is a vector of objects like ahead_1_cases, ahead_7_cases, etc. | ||
# We want to extract the cases part. | ||
outcome_cols <- names(components$mold$outcomes) %>% | ||
stringr::str_match("ahead_\\d+_(.*)") %>% | ||
magrittr::extract(, 2) | ||
|
||
components$predictions <- components$predictions %>% | ||
rowwise() %>% | ||
mutate(.pred := yj_inverse(.pred, !!sym(paste0(".lambda_", outcome_cols)))) | ||
} else if (identical(col_names, character(0))) { | ||
# Wish I could suggest `all_outcomes()` here, but currently it's the same as | ||
# not specifying any terms. I don't want to spend time with dealing with | ||
# this case until someone asks for it. | ||
cli::cli_abort( | ||
"Not specifying columns to layer Yeo-Johnson is not implemented. | ||
If you had a single outcome, you can use `.pred` as a column name. | ||
If you had multiple outcomes, you'll need to specify them like | ||
`.pred_ahead_1_<outcome_col>`, `.pred_ahead_7_<outcome_col>`, etc. | ||
", | ||
call = rlang::caller_env() | ||
) | ||
} else { | ||
# In this case, we assume that the user has specified the columns they want | ||
# transformed here. We then need to determine the lambda columns for each of | ||
# these columns. That is, we need to convert a vector of column names like | ||
# c(".pred_ahead_1_case_rate", ".pred_ahead_7_case_rate") to | ||
# c("lambda_ahead_1_case_rate", "lambda_ahead_7_case_rate"). | ||
original_outcome_cols <- stringr::str_match(col_names, ".pred_ahead_\\d+_(.*)")[, 2] | ||
outcomes_wout_ahead <- stringr::str_match(names(components$mold$outcomes), "ahead_\\d+_(.*)")[, 2] | ||
if (any(original_outcome_cols %nin% outcomes_wout_ahead)) { | ||
cli_abort( | ||
"All columns specified in `...` must be outcome columns. | ||
They must be of the form `.pred_ahead_1_<outcome_col>`, `.pred_ahead_7_<outcome_col>`, etc. | ||
", | ||
call = rlang::caller_env() | ||
) | ||
} | ||
|
||
for (i in seq_along(col_names)) { | ||
col <- col_names[i] | ||
lambda_col <- paste0(".lambda_", original_outcome_cols[i]) | ||
components$predictions <- components$predictions %>% | ||
rowwise() %>% | ||
mutate(!!sym(col) := yj_inverse(!!sym(col), !!sym(lambda_col))) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is all pretty dangerous.
all_outcomes()
is a recipe selector. So you can use that, as long as you look in theworkflow
.- Alternatively, even without the selector passed in
...
, you can look there to see which vars have the role"outcome"
. - But our typical case would want to apply the transformation to
starts_with(.pred)
to bring in the quantiles as well as the point prediction. I suspect that would hit your finalelse
case but produce an error. - I'm not sure that we have an example with multiple aheads in separate columns, but if we did, presumably they'd have the same YJ parameters for different aheads of the same variable? This would be detectable from the workflow, which would be much safer than trying to get good behaviour with the string processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give this another shot, using workflow
and assuming the typical case is starts_with(.pred)
.
R/layer_yeo_johnson.R
Outdated
for (step in this_recipe$steps) { | ||
if (inherits(step, "step_epi_YeoJohnson")) { | ||
lambdas <- step$lambdas | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be buggy if the step is used multiple times. Also, we want to make sure that the step applies to the outcome rather than the predictors (or other roles).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yea. I see a few ways through this:
- require that only a single step_yj is allowed; this would mean users have to bundle all their transformation requests in one step and all the transformations have to use the same hyperparameters
na_lambda_fill, limits, num_unique, na_rm
; that doesn't seem like a big limitation to me - do a kind-of parentheses matching approach: the layer will need to count the yj layers prior to it and count backwards in the list of steps to find the step that matches it
- doable, not super complex, and the general scheme might be handy for future layers, but the feature gain doesn't seem worth the effort right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for logging:
- To access the processed recipe, use
hardhat::extract_recipe(workflow)
. - Inside is
last_term_info
. This tibble contains the names of all variables that ever existed, along with their roles.
Alternatively:
- Access the mold with
hardhat::extract_mold(workflow)
. - all
y
are in a tibble at$outcomes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second is technically more robust within a workflow, if the preprocessor was other than a recipe (like a formula, for example).
Co-authored-by: Daniel McDonald <[email protected]>
# If not an epi_df, make it one assuming the template of training data. | ||
# If it is an epi_df, check that the keys match. | ||
# Imitating the pattern in step_adjust_latency(). | ||
if (!inherits(new_data, "epi_df") || is.null(attr(new_data, "metadata")$as_of)) { | ||
new_data <- as_epi_df( | ||
new_data, | ||
as_of = object$forecast_date, | ||
other_keys = object$metadata$other_keys %||% character() | ||
) | ||
attr(new_data, "metadata") <- object$metadata | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I removed this guardrail and I ran into an instance where new_data
is a tibble in the tests. Seems out of scope to try and fix that, but wanted to mark that this is actually doing work right now.
* vectorize in lambda * inheritParams in docs * lambda -> yj_param in many places
* extend to quantile_dist, exclude multi-output * Drop by specification and infer from the epi_df * lint+test: test coverage, handle na lambda case, lint * fix: quantile_pred arithmetic * fix: rlang calls --------- Co-authored-by: Dmitry Shemetov <[email protected]>
Checklist
Please:
dajmcdon.
DESCRIPTION
andNEWS.md
.Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
epiprocess
version in theDESCRIPTION
file ifepiprocess
soonepipredict
andepiprocess
Change explanations for reviewer
Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch