docs: clarify how variables are filtered out in step_corr() #1518

erictleung · 2025-06-23T19:14:30Z

I was curious on how step_corr() was filtering out variables and found one answer here by Max https://forum.posit.co/t/how-does-step-corr-pick-which-variable-to-keep/46496/3

In general, the filter tries to prioritize predictors for removal based on the global affect on the overall correlation structure. If you had two identical predictors, there is no real rule on which one to retain (it probably gets rid of the first one or something like that).

Thought I'd add some of the prose into the help(step_corr) page in this commit/PR so others know a little bit about how it works.

I was also curious about how it actually chooses, and went into the code itself https://github.com/tidymodels/recipes/blob/602dd48ecfe95535457ee422d856054a45833df4/R/corr.R#L248C1-L257C36:

    averageCorr <- colMeans(abs(x))
    averageCorr <- as.numeric(as.factor(averageCorr))
    x[lower.tri(x, diag = TRUE)] <- NA
    combsAboveCutoff <- which(abs(x) > cutoff)

    colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
    rowsToCheck <- combsAboveCutoff %% nrow(x)

    colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
    rowsToDiscard <- !colsToDiscard

It took a bit to figure out that it really is kind of random (it is based on the "meaningless" numeric value assigned to levels of factors). I didn't want to add all the details of this matrix filtering in the help documentation, so I left code comments in there in case anyone else wants to read and understand. Feel free to suggest remove these comments if it feels too unnecessary.

Thanks!

Add more context in help page on which variables are filtered out. Also add some comments in the code on how variables are filtered. Reference: https://forum.posit.co/t/how-does-step-corr-pick-which-variable-to-keep/46496/3

EmilHvitfeldt · 2025-06-23T21:36:57Z

Hello @erictleung 👋

I think this is useful but I don't like the wording, especially "meaningless".

If there are two predictors that are identical, then step_corr() will remove the first one. Which seems like a fine choice. one could argue that we should remove the second one as it feels more "duplicate" but it shouldn't really matter as they are identical. That being said, it should be documented.

library(recipes)

mtcars$mpg_dup <- mtcars$mpg

recipe(~., mtcars) |>
  step_corr(all_predictors()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 10
#>     disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_dup
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  160    110  3.9   2.62  16.5     0     1     4     4    21  
#>  2  160    110  3.9   2.88  17.0     0     1     4     4    21  
#>  3  108     93  3.85  2.32  18.6     1     1     4     1    22.8
#>  4  258    110  3.08  3.22  19.4     1     0     3     1    21.4
#>  5  360    175  3.15  3.44  17.0     0     0     3     2    18.7
#>  6  225    105  2.76  3.46  20.2     1     0     3     1    18.1
#>  7  360    245  3.21  3.57  15.8     0     0     3     4    14.3
#>  8  147.    62  3.69  3.19  20       1     0     4     2    24.4
#>  9  141.    95  3.92  3.15  22.9     1     0     4     2    22.8
#> 10  168.   123  3.92  3.44  18.3     1     0     4     4    19.2
#> # ℹ 22 more rows

mtcars <- mtcars |>
  dplyr::relocate(mpg_dup, .before = mpg)

recipe(~., mtcars) |>
  step_corr(all_predictors()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 10
#>      mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21    160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21    160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

Secondly regarding the following comment:

    # Discard columns based on whether the underlying numeric (integer)
    # representation of the factor (see above), which is meaningless, is larger
    # than its corresponding row value.

I would not call those number meaningless. If we run the following code

library(recipes)

data(biomass, package = "modeldata")

set.seed(3535)
biomass$duplicate <- biomass$carbon + rnorm(nrow(biomass))
rec <- recipe(
  HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + duplicate,
  data = biomass
)

rec |>
  step_corr(all_numeric_predictors()) |>
  prep()

we see that averageCorr takes two different types of values. First they take the mean of the absolute values of the correlations between it and the other predictors. high values -> this predictor is correlated with a lot of things, low values -> not that correlated with other variables.

After we call as.numeric(as.factor(averageCorr)). What this does is denote the order. The first value 5 states that carbon is the 5th smallest value, hydrogen is the 3rd smallest values, and so on and so forth. I would not call this meaningless

# carbon  hydrogen    oxygen  nitrogen    sulfur duplicate 
#  0.531     0.406     0.546     0.303     0.327     0.530 

# 5 3 6 1 2 4

docs: clarify how variables are filtered out in step_corr()

a8558a8

Add more context in help page on which variables are filtered out. Also add some comments in the code on how variables are filtered. Reference: https://forum.posit.co/t/how-does-step-corr-pick-which-variable-to-keep/46496/3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: clarify how variables are filtered out in step_corr() #1518

docs: clarify how variables are filtered out in step_corr() #1518

Uh oh!

erictleung commented Jun 23, 2025

Uh oh!

EmilHvitfeldt commented Jun 23, 2025

Uh oh!

Uh oh!

docs: clarify how variables are filtered out in step_corr() #1518

Are you sure you want to change the base?

docs: clarify how variables are filtered out in step_corr() #1518

Uh oh!

Conversation

erictleung commented Jun 23, 2025

Uh oh!

EmilHvitfeldt commented Jun 23, 2025

Uh oh!

Uh oh!