first pass at postprocessing proof-of-concept #225

simonpcouch · 2024-04-23T14:46:21Z

Mostly just pattern-matches recipe and model_fit implementations and hooks into the existing post-processing infrastructure. See tests for up-to-date examples.

Previous PR description, outdated

A fast and loose proof-of-concept for integrating a postprocessing container.

library(workflows)
library(parsnip)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(container)
library(modeldata)

table(credit_data$Status)
#> 
#>  bad good 
#> 1254 3200

wflow_class <- fit(workflow(Status ~ ., logistic_reg()), credit_data)

predict(wflow_class, credit_data) %>% table()
#> .pred_class
#>  bad good 
#>  681 3358
 
post <- container(mode = "classification", type = "binary") %>% 
  adjust_probability_threshold(.1)

wflow_class_container <- workflow(Status ~ ., logistic_reg(), post)
wflow_class_container <- fit(wflow_class_container, credit_data)

predict(wflow_class_container, credit_data) %>% table()
#> .pred_class
#>  bad good 
#> 2659 1380

^{Created on 2024-04-23 with reprex v2.1.0}

* move container to Suggests * document extraction generic * document `workflow(postprocessor)` argument

R/post-action-container.R

R/extract.R

topepo · 2024-04-26T17:23:56Z

R/fit.R

+#' @rdname workflows-internals
+#' @export
+.fit_post <- function(workflow, data) {
+  action_post <- workflow[["post"]][["actions"]][["container"]]


use extract_postprocessor() here?

R/utils.R

topepo

Looks great.

I don't see any issues but I wonder if we should start an issue to make sure that we have the right plumbing and arguments for when we get around to implementing postprocessors for survival models. @hfrick should have a good eye for that here and in general.

simonpcouch · 2024-05-01T18:33:56Z

ec0effa surfaces an important point; removing/updating a postprocessor from an otherwise trained workflow need not remove the preprocessor and model fits, as they won't be affected by the removal of the postprocessor. This introduces the possibility of a "partially trained" workflow, where a workflow with trained preprocessor and model but untrained postprocessor should be able to fit without issue.

to align with the analogous argument in `extract_recipe()`; extract the fitted object if it's available, otherwise extract the specification. also updates the slot name from `post` to `fit` to elicit that the slot contains the trained object

hfrick

Okay so here are my thoughts on what plumbing post-processing for survival analysis would need.

Basic assumptions

We might tailor both predictions of survival time and predictions of survival probability.
If we tailor survival probability, we will do so at a specific single time point (hello eval_time our old friend).

I have not yet validated either assumption. Max, do you have a sense of whether they are valid?

For the implementation, this implies

How/where do we specify the predictions?
How/where do we specify eval_time?

Specifying the predictions

tailor() could take survival time .pred_time via the estimate argument -- the documentation already suggest that.
tailor() could take survival probability in the tibble form of .pred (containing .eval_time and .pred_survival) via the probabilities argument.

Specifying eval_time

[tailor] tailor() would drop time (the documentation for that as "the predicted event time" contradicts the documentation for estimate listing .pred_time) and instead take eval_time. The alternative would be to derive it from .pred within a given postprocessing operation and default to the first value if we need a single one. Either way, time would disappear.
[workflows] We currently can pass eval_time to predict() and augment() methods for workflows but since we currently don't need it earlier, there are not arguments for eval time to the fit() method for workflows or the specification via workflow().
- If the post-processing operation doesn't need estimation, only predict.workflow() having an eval_time argument should be fine.
- If it does need estimation, it could come from the tailor() specification.
[tune] The tuning functions have an eval_time argument which is required for dynamic and integrated survival metrics. If we need a single eval time to optimize for, we use the first one.
- If a user includes a tailor in the workflow to be tuned, it would be nice to not make them specify eval_time twice: when making the tailor() and when calling the tuning function.
- We could pass it through workflows (and make it gain an argument) or make a function to update the tailor. I lean towards updating the tailor.

hfrick · 2024-05-03T17:53:27Z

R/post-action-tailor.R

+check_conflicts.action_tailor <- function(action, x, ..., call = caller_env()) {
+  post <- x$post
+
+  invisible(action)
+}


Do we need this function? We are not doing anything with post so we could just fall back to the default method. We also don't envision different types of post-processors, analogous to how we have different preprocessors (recipe, formula, ...).

hfrick · 2024-05-03T17:55:51Z

R/post-action-tailor.R

+      estimate = tidyselect::any_of(c(".pred", ".pred_class")),
+      probabilities = c(
+        tidyselect::contains(".pred_"),
+        -tidyselect::matches("^\\.pred$|^\\.pred_class$")
+      )


This would need changing if we pass predictions of survival time and survival probability to these arguments. Main catch: the probabilities are in .pred. (The survival times are in .pred_time.)

hfrick · 2024-05-08T10:51:28Z

After chatting with Max:

Max agrees with the basic assumptions except for the single eval time point. He thinks we might want to calibrate/post-process at multiple time points. I agree we might want to do that in general, just not sure about whether or not that is specified/done in one calibration operation (if it requires multiple calibration models to be fitted). Either way, it doesn't change where we need eval time values, just how many. How many is a decision we can make later on.
In light of Simon's and my thoughts on specifying the information for the data split needed to fit a workflow with a post-processor that needs fitting on a separate dataset, I've considered adding eval_time there (in add_tailor()) but I think specifying eval_time in tailor() directly is still the right move: it's needed for fitting the post-processor so can't only be in a workflow.
Max and I agree we should have an idea of what infrastructure we'd need for post-processing for survival but not include any placeholder arguments at this point. Hence We can remove the time argument in tailors tailor#16

…split()`

seeing some sort of strange division behavior for `prop` in tune

simonpcouch · 2024-05-31T14:20:12Z

With an eye for reducing Remotes hoopla, I'm going to go ahead and merge and open issues for smaller todos.

github-actions · 2024-06-15T01:17:40Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

simonpcouch added 3 commits April 23, 2024 09:43

first pass at postprocessing proof-of-concept

42c9170

R CMD check fixes

454b40c

* move container to Suggests * document extraction generic * document `workflow(postprocessor)` argument

merge upstream

2690960

This comment was marked as resolved.

Sign in to view

simonpcouch mentioned this pull request Apr 23, 2024

add extract_postprocessor() generic tidymodels/hardhat#248

Merged

simonpcouch added 3 commits April 23, 2024 13:45

use hardhat::extract_postprocessor() generic

2c59668

resolve roxygen S3 @export warnings

5e2a909

remove outdated TODO

313bc01

simonpcouch commented Apr 25, 2024

View reviewed changes

R/post-action-container.R Outdated Show resolved Hide resolved

simonpcouch added 3 commits April 25, 2024 13:33

correct hardhat import typo

a2ad7e8

update location of trained slot

d40955b

remove PR ref

3044f0f

simonpcouch mentioned this pull request Apr 26, 2024

fit calibrators at fit.container() tidymodels/tailor#12

Merged

This comment was marked as resolved.

Sign in to view

depend on new container Remotes ref

1e8287c

This comment was marked as resolved.

Sign in to view

simonpcouch mentioned this pull request Apr 26, 2024

resample calibration post-processors with an internal split tidymodels/tune#894

Merged

topepo reviewed Apr 26, 2024

View reviewed changes

R/extract.R Outdated Show resolved Hide resolved

topepo reviewed Apr 26, 2024

View reviewed changes

R/utils.R Outdated Show resolved Hide resolved

update Remotes ref

5b61c5f

topepo reviewed Apr 26, 2024

View reviewed changes

add standard tests for container post-processors

ec0effa

simonpcouch added 6 commits May 1, 2024 14:21

test that postprocessors are actually applied

0204ffc

correct example with new container() interface

1f11dd3

add extract_postprocessor(estimated)

f530f82

to align with the analogous argument in `extract_recipe()`; extract the fitted object if it's available, otherwise extract the specification. also updates the slot name from `post` to `fit` to elicit that the slot contains the trained object

add minimal postprocessor printing support

611c123

redocument()

bb8142c

update Remotes ref

171fa1b

simonpcouch added 4 commits May 2, 2024 14:23

remove mentions of container(mode)

d34d737

update Remotes ref

0f66fc1

merge upstream

2ffa785

container -> tailor

7f2a2c4

hfrick reviewed May 3, 2024

View reviewed changes

hfrick mentioned this pull request May 7, 2024

We can remove the time argument in tailors tidymodels/tailor#16

Closed

simonpcouch added 9 commits May 22, 2024 11:03

migrate tune:::should_internal_split() -> `workflows::should_inner_…

b1aaf45

…split()`

woops--track should_inner_split() export

8d0b259

suggest rsample

cffad08

document should_inner_split in workflows-internals

9a3e42c

use an inner split when training calibrators

2000f2d

hard-code rlang::%||%

4639983

seeing some sort of strange division behavior for `prop` in tune

update Remotes ref [no ci]

2e1d078

migrated tailor_fully_trained() to tailor

a1d23f3

fix an erroneous container -> tailor replace

9e6ca98

simonpcouch marked this pull request as ready for review May 31, 2024 14:20

simonpcouch merged commit 05be14a into main May 31, 2024
9 checks passed

simonpcouch deleted the postprocessing branch May 31, 2024 14:23

This was referenced May 31, 2024

do we need check_conflicts.action_tailor()? #228

Closed

plumbing post-processing for survival analysis #229

Open

github-actions bot locked and limited conversation to collaborators Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first pass at postprocessing proof-of-concept #225

first pass at postprocessing proof-of-concept #225

simonpcouch commented Apr 23, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

topepo Apr 26, 2024

topepo left a comment

simonpcouch commented May 1, 2024

hfrick left a comment

hfrick May 3, 2024

hfrick May 3, 2024

hfrick commented May 8, 2024

simonpcouch commented May 31, 2024

github-actions bot commented Jun 15, 2024

first pass at postprocessing proof-of-concept #225

first pass at postprocessing proof-of-concept #225

Conversation

simonpcouch commented Apr 23, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

topepo Apr 26, 2024

Choose a reason for hiding this comment

topepo left a comment

Choose a reason for hiding this comment

simonpcouch commented May 1, 2024

hfrick left a comment

Choose a reason for hiding this comment

hfrick May 3, 2024

Choose a reason for hiding this comment

hfrick May 3, 2024

Choose a reason for hiding this comment

hfrick commented May 8, 2024

simonpcouch commented May 31, 2024

github-actions bot commented Jun 15, 2024

simonpcouch commented Apr 23, 2024 •

edited

Loading