Skip to content

docs: clarify how variables are filtered out in step_corr() #1518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

erictleung
Copy link

I was curious on how step_corr() was filtering out variables and found one answer here by Max https://forum.posit.co/t/how-does-step-corr-pick-which-variable-to-keep/46496/3

In general, the filter tries to prioritize predictors for removal based on the global affect on the overall correlation structure. If you had two identical predictors, there is no real rule on which one to retain (it probably gets rid of the first one or something like that).

Thought I'd add some of the prose into the help(step_corr) page in this commit/PR so others know a little bit about how it works.

I was also curious about how it actually chooses, and went into the code itself https://github.com/tidymodels/recipes/blob/602dd48ecfe95535457ee422d856054a45833df4/R/corr.R#L248C1-L257C36:

    averageCorr <- colMeans(abs(x))
    averageCorr <- as.numeric(as.factor(averageCorr))
    x[lower.tri(x, diag = TRUE)] <- NA
    combsAboveCutoff <- which(abs(x) > cutoff)

    colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
    rowsToCheck <- combsAboveCutoff %% nrow(x)

    colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
    rowsToDiscard <- !colsToDiscard

It took a bit to figure out that it really is kind of random (it is based on the "meaningless" numeric value assigned to levels of factors). I didn't want to add all the details of this matrix filtering in the help documentation, so I left code comments in there in case anyone else wants to read and understand. Feel free to suggest remove these comments if it feels too unnecessary.

Thanks!

Add more context in help page on which variables are filtered out.

Also add some comments in the code on how variables are filtered.

Reference: https://forum.posit.co/t/how-does-step-corr-pick-which-variable-to-keep/46496/3
@EmilHvitfeldt
Copy link
Member

Hello @erictleung 👋

I think this is useful but I don't like the wording, especially "meaningless".

If there are two predictors that are identical, then step_corr() will remove the first one. Which seems like a fine choice. one could argue that we should remove the second one as it feels more "duplicate" but it shouldn't really matter as they are identical. That being said, it should be documented.

library(recipes)

mtcars$mpg_dup <- mtcars$mpg

recipe(~., mtcars) |>
  step_corr(all_predictors()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 10
#>     disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_dup
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  160    110  3.9   2.62  16.5     0     1     4     4    21  
#>  2  160    110  3.9   2.88  17.0     0     1     4     4    21  
#>  3  108     93  3.85  2.32  18.6     1     1     4     1    22.8
#>  4  258    110  3.08  3.22  19.4     1     0     3     1    21.4
#>  5  360    175  3.15  3.44  17.0     0     0     3     2    18.7
#>  6  225    105  2.76  3.46  20.2     1     0     3     1    18.1
#>  7  360    245  3.21  3.57  15.8     0     0     3     4    14.3
#>  8  147.    62  3.69  3.19  20       1     0     4     2    24.4
#>  9  141.    95  3.92  3.15  22.9     1     0     4     2    22.8
#> 10  168.   123  3.92  3.44  18.3     1     0     4     4    19.2
#> # ℹ 22 more rows

mtcars <- mtcars |>
  dplyr::relocate(mpg_dup, .before = mpg)

recipe(~., mtcars) |>
  step_corr(all_predictors()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 10
#>      mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21    160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21    160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

Secondly regarding the following comment:

    # Discard columns based on whether the underlying numeric (integer)
    # representation of the factor (see above), which is meaningless, is larger
    # than its corresponding row value.

I would not call those number meaningless. If we run the following code

library(recipes)

data(biomass, package = "modeldata")

set.seed(3535)
biomass$duplicate <- biomass$carbon + rnorm(nrow(biomass))
rec <- recipe(
  HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + duplicate,
  data = biomass
)

rec |>
  step_corr(all_numeric_predictors()) |>
  prep()

we see that averageCorr takes two different types of values. First they take the mean of the absolute values of the correlations between it and the other predictors. high values -> this predictor is correlated with a lot of things, low values -> not that correlated with other variables.

After we call as.numeric(as.factor(averageCorr)). What this does is denote the order. The first value 5 states that carbon is the 5th smallest value, hydrogen is the 3rd smallest values, and so on and so forth. I would not call this meaningless

# carbon  hydrogen    oxygen  nitrogen    sulfur duplicate 
#  0.531     0.406     0.546     0.303     0.327     0.530 

# 5 3 6 1 2 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants