Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What else do we need for postprocessing? #68

Open
topepo opened this issue Jan 22, 2025 · 2 comments
Open

What else do we need for postprocessing? #68

topepo opened this issue Jan 22, 2025 · 2 comments

Comments

@topepo
Copy link
Member

topepo commented Jan 22, 2025

I have a specific argument to make regarding two potential adjustments. However, it would also be good to get a broader set of opinions from others. Maybe @ryantibs and/or @dajmcdon have thoughts.

My thought: three things that we might consider being optional arguments to the tailor (or an individual adjustment):

Why? Two similar calibration tools prompted these ideas. To demonstrate, let's look at what Cubist does to postprocess. This is discussed and illustrated in this blog post. The other is discussed in #67 and has requirements similar to those of the Cubist adjustment.

After the supervised model predicts, Cubist finds its nearest neighbors in the training set. It adjusts a prediction based on the distances to the neighbors and the training set predictions for the neighbors.

We don't have to use the training set; it could conceivably be a calibration set. To generalize, I'll call it the reference data set.

To do this with a tailor, we would already have the current prediction from the model (which may have already been adjusted by other postprocessors) and perhaps the reference set predictions if we are properly prepared.

To find the neighbors, we will need to process both the reference set and the new predictors in the same way as the data was given to the supervised model. For this, we'd need the mold from the workflow.

When making the tailor, we could specify the number of neighbors, pass the reference data set, and the mold. We could require the predictions for the reference set to be in the reference set data frame, avoiding the workflow need.

The presence of the workflow is a little dangerous; it would likely include the tailor. Apart from the infinite recursion of the workflow being added to a workflow that contains the tailor adjustment in the workflow, we would want to avoid people accidentally misapplying the workflow. Let's exclude the workflow from being an input into a tailor adjustment but keep the idea of adding a data set of predictors and/or the workflow mold.

Where would we specify the mold or data? In the main tailor() call or to the adjustments? The mold is independent of the data set and would not vary from adjustment to adjustment, so an option to tailor() would be my suggestion.

The data set is probably in the adjustments. Unfortunately, multiple data sets could be included depending on what is computed after the model prediction (relevant is #4).

@dajmcdon
Copy link

Hey @topepo! I'm not entirely sure I understand how this is meant to work, but I've poked around the package some, so I'll do my best. Partially I'll reference the API of the postprocessor that we've been building.

Big picture: it seems to me that it's not the tailor that would potentially need these objects but the fit() and/or predict() methods.

Proceeding along the lines of the Cubist example and checking my understanding of the tailor API:

  1. (aside1: Should the tailor() actually make the predictions as the first step?)
  2. On creation of the adjustment, you would add the reference data (or perhaps not), and specify the number of neighbours.
  3. Fitting is by dispatch to fit.tailor() --> fit.[class]().
  4. Then, new predictions are by dispatch to predict.tailor() --> predict.[class]().

It seems like:

  • The mold is required at step 2, but not really necessary at step 1. You'd need it in the high level fit.tailor() and then as a possibly unused arg in all of the adjustments, so that it could be used if needed or ignored. Perhaps safe to have ... here?
  • Supposing the user did not specify any reference data, you could allow step 3 to access the workflows::extract_fit_parsnip() result. Then, step 2 (which didn't have any data and doesn't need to refit the model) would be a null-op, and at step 3, you'd extract the fitted values, calculate the neighbours, and adjust the predictions. So, in that case the predict loop would also need ... (or just expect the parsnip fit as a mandatory argument).
  • It's not clear to me that you need the rest of the workflow. But maybe it's worth considering it too as a mandatory argument to fit.tailor()/predict.tailor() and possibly passed along to the class methods?
  • FWIW, in my implementation, I do give the postprocessor access to (a) the hardhat mold; (b) the result of hardhat::forge() on the new data with the blueprint from a; (c) the new data itself. Then when predict is called on the workflow, the predict method has access to the whole workflow. So there's no potential issue of recursion.

Footnotes

  1. This is the way that I've coded up our "postprocessor" (called frosting to really stretch the prep/bake recipe conceit). The general intention is not to do the sorts of things you're doing here (there is never any training for example), but to prepare the predicted values for other tasks. The most common is to invert preprocessing steps. So for example, if I want my predictions on the scale of the data, but I used step_normalize() on the outcome, then I need to undo that transformation. So, typically, the first thing I would do is call predict() on my model object and then access the recipe or other parts of the workflow to undo steps (or other similar ideas transformations). However, there are cases in which you would want to process your test data differently from your training data before predicting. The most salient is imputing missing data. It's easy to imagine a case where training data has no missingness, but testing data does. Of course you could do this outside the workflow, but in a production environment (especially with time series) it can be pretty important.

@ryantibs
Copy link

ryantibs commented Feb 5, 2025

Sorry for such a high-level comment in what is a thread about details.

Broadly speaking, for traditional "batch" prediction problems (i.e., NOT sequential prediction, as in time series), I think it makes sense to allow the post-processor to access what they need in order to make new predictions on a calibration set. It sounds like what you're describing accommodates this. (p.s. Thanks for the pointer to the Quinlan 1993a paper, which I learned about from your blog post --- vaguely it seems like this is may be seen as a batch analog of online calibration/debiasing methods that me and others have been working on recently).

For "online" prediction problems (i.e., sequential prediction, as in time series), there's a different set of for post-processing methods which recalibrate based on the past data and past predictions themselves. Minimally, for some methods, you actually just need single most recent data point, and the most recent corresponding prediction (along with say, the most recent parameter value, in some recalibration model that you're using). So it's really pretty minimal, and you don't really need all of the framework you're describing. That said, if you're also looking to accommodate online prediction here, then it would be nice to allow this to fit in somehow: I just need to pass in y_t (data), \hat{y}_t (prediction), and \theta_t (parameter) to the post-processor at time t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants