Skip to content

Error in step_pls using 'options = list(scale = FALSE)' #1512

Open
@SimoneCrrt

Description

@SimoneCrrt

The problem

According to the documentation for step_pls (https://recipes.tidymodels.org/reference/step_pls.html), it is possible to specify if mixOmics::pls(), or other similar functions from the same package, should not perform scaling of each predictor for its standard deviation by specify the argument 'option = list(scale = FALSE)': this could be useful in case someone wants to scale the predictors for quantities different than the sd or even avoid feature scaling completely (which is quite commmon for spectroscopic type of data).
However, when I do it like this, I am not able to prep the recipe correctly, since an error is returned, whereas the recipe with the standard PLS, i.e. with feature scaling, works smoothly.

Reproducible example

# Example taken from: https://recipes.tidymodels.org/reference/step_pls.html.

library(tidyverse)
#> Warning: package 'tidyr' was built under R version 4.3.2
#> Warning: package 'readr' was built under R version 4.3.2
#> Warning: package 'purrr' was built under R version 4.3.3
#> Warning: package 'dplyr' was built under R version 4.3.2
#> Warning: package 'lubridate' was built under R version 4.3.3
library(tidymodels)
#> Warning: package 'tidymodels' was built under R version 4.3.3
#> Warning: package 'dials' was built under R version 4.3.3
#> Warning: package 'modeldata' was built under R version 4.3.3
#> Warning: package 'parsnip' was built under R version 4.3.3
#> Warning: package 'tune' was built under R version 4.3.3
#> Warning: package 'yardstick' was built under R version 4.3.3
# NOTE: the package mixOmics needs to be installed.


# Import the dataset and divide in training set and test set.
data(biomass, package = "modeldata")

biom_tr <-
    biomass |>
    filter(dataset == "Training") |>
    select(-dataset, -sample)
biom_te <-
    biomass |>
    filter(dataset == "Testing") |>
    select(-dataset, -sample, -HHV)


# Standard PLS recipe (with both mean centering and scaling for standard deviation)
# This one works correctly
recipe_pls <-
    recipe(HHV ~ ., data = biom_tr) |>
    step_pls(all_numeric_predictors(), outcome = HHV, num_comp = 3) |>
    prep()


# PLS recipe without scaling (only mean centering)
# This one does return the error
recipe_pls_CENTERING <-
    recipe(HHV ~ ., data = biom_tr) |>
    step_pls(all_numeric_predictors(), outcome = HHV, num_comp = 3,
             options = list(scale = FALSE)) |>
    prep()
#> Warning in max(cumDim[cumDim <= lstats]): no non-missing arguments to max;
#> returning -Inf
#> Error in `step_pls()`:
#> Caused by error in `array()`:
#> ! 'data' must be of a vector type, was 'NULL'

# Created on 2025-06-03 with [reprex v2.1.1](https://reprex.tidyverse.org/)

Where and why the error occurs

By running rlang::last_trace(), it seems that the error occurs inside recipes:::pls_project, where the 'sweep' fuction is applied to divide each predictor by the sd stored in the recipe (line 321 of the file 'pls.R'), but since I requested step_pls to NOT apply scaling, such object does not exist, so a NULL object is used instead and the error is returned.

What I suggest

To avoid the error, I would suggest to replace lines 320 - 321 of the file 'pls.R' with the ones I report below, or something similar. This way, the function 'pls_project' first checks if object$sd, produced by recipes:::butcher_pls, is NULL (corresponding to the situation where no scaling was requested) or not, and produces a suitable object, which I simply called "scaling_vector", to be used by the function 'scale'. Then, the function 'scale' is applied to mean-center and, eventually, scale for the sd the predictors correctly. This way, you also avoid to apply the 'sweep' function twice in a row.
With this change, the second recipe in the previous example can be prepped as well, with no problems.

if (is.null(object$sd)) {
  scaling_vector <- FALSE
} else {
  scaling_vector <- object$sd
}

z <- scale(x, center = object$mu, scale = scaling_vector)

A small final note

Sorry if I made some mistake: this is the first time ever that I report an issue on GitHub for a function. But in the future I am going to use this recipe with some custom scaling steps and classification models, and during some trials I discovered this, so I wanted to inform you about that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions