Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem crating a lightgbm model #11

Open
Keyeoh opened this issue May 17, 2024 · 4 comments
Open

Problem crating a lightgbm model #11

Keyeoh opened this issue May 17, 2024 · 4 comments

Comments

@Keyeoh
Copy link

Keyeoh commented May 17, 2024

Hi,

First of all, thank you for your amazing work. I have been able to log a lightgbm model in mlflow using a "crated" function. Problem is, when I load the crated model in a new, clean, session, I have problems as some of the dependencies are not there.

I have started declaring each of the dependencies that were rising errors by hand, to see if I could arrive to a decent compromise, but I am encountering some problems once I arrive to some FFI code.

Say I have a fitted workflow model...

> final_fit
══ Workflow [trained] ═══════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ──────────────────
0 Recipe Steps

── Model ────────────────
LightGBM Model (1174 trees)
Objective: binary
Fitted to dataset with 25 columns
> 

...and I try to crate a function for prediction in the following way:

c_model <- crate(
    function(new_obs) get_predictions(model, new_obs),
    model = final_fit,
    get_predictions = rlang::set_env(rchurn2:::get_predictions),
    predict = rlang::set_env(workflows:::predict.workflow),
    is_trained_workflow = rlang::set_env(workflows:::is_trained_workflow),
    validate_is_workflow = rlang::set_env(workflows:::validate_is_workflow),
    check_dots_empty = rlang::set_env(rlang::check_dots_empty),
    ellipsis_dots = rlang::set_env(rlang:::ellipsis_dots),
    ffi_ellipsis_dots = rlang:::ffi_ellipsis_dots,
    caller_env = rlang::set_env(rlang::caller_env)
)

Then, if I try to call it in a clean session, it is giving me some error I just cannot understand:

callr::r(
    function (d, cmod) {
        cmod(d)
    },
    args = list(
        d = splits[['tst']],
        cmod = c_model
    )
)

Error: 
! in callr subprocess.
Caused by error in `.Call(ffi_ellipsis_dots, env)`:
! NULL value passed as symbol address
Type .Last.error to see the more details.
> 

Just wanted to know if I am trying to do too much (if it is possible to crate functions depending on FFI code), or if it would be better just to ensure that all needed dependencies are available in the machine that is going to run the inference part. The latter would be the easy and comfortable path, but I just wanted to know, because I think crate is a great tool, and this is not such a corner case.

Thanks a lot in advance.

Gus.

@simonpcouch
Copy link

Noting that there are some possibly related conversations at rstudio/bundle#55 and linked issues, though this error seems more crate/rlang-related. Here's a reprex:

library(tidymodels)
library(bonsai)
library(carrier)

fit <- 
  boost_tree("classification", engine = "lightgbm") %>%
  fit(Class ~ A + B, two_class_dat)

fit
#> parsnip model object
#> 
#> LightGBM Model (100 trees)
#> Objective: binary
#> Fitted to dataset with 2 columns

c_model <- crate(
  predict,
  model = fit,
  predict = rlang::set_env(workflows:::predict.workflow),
  is_trained_workflow = rlang::set_env(workflows:::is_trained_workflow),
  validate_is_workflow = rlang::set_env(workflows:::validate_is_workflow),
  check_dots_empty = rlang::set_env(rlang::check_dots_empty),
  ellipsis_dots = rlang::set_env(rlang:::ellipsis_dots),
  ffi_ellipsis_dots = rlang:::ffi_ellipsis_dots,
  caller_env = rlang::set_env(rlang::caller_env)
)

callr::r(
  function(d, cmod) {
    cmod(d)
  },
  args = list(
    d = two_class_dat,
    cmod = c_model
  )
)
#> Error: ! in callr subprocess.
#> Caused by error in `.Call(ffi_ellipsis_dots, env)`:
#> ! NULL value passed as symbol address

Created on 2024-05-17 with reprex v2.1.0

@simonpcouch
Copy link

Something like this should do the trick!

library(tidymodels)
library(bonsai)
library(carrier)

fit <- 
  workflow(
    Class ~ A + B,
    boost_tree("classification", engine = "lightgbm")
  ) %>%
  fit(two_class_dat)


fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Class ~ A + B
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> LightGBM Model (100 trees)
#> Objective: binary
#> Fitted to dataset with 2 columns

c_model <- crate(
  function(new_data, ...) workflows:::predict.workflow(model, new_data, ...),
  model = fit
)

callr::r(
  function(d, cmod) {
    cmod(d)
  },
  args = list(
    d = two_class_dat,
    cmod = c_model
  )
)
#> # A tibble: 791 × 1
#>    .pred_class
#>    <fct>      
#>  1 Class1     
#>  2 Class1     
#>  3 Class2     
#>  4 Class2     
#>  5 Class1     
#>  6 Class2     
#>  7 Class2     
#>  8 Class2     
#>  9 Class1     
#> 10 Class2     
#> # ℹ 781 more rows

Created on 2024-05-17 with reprex v2.1.0

@Keyeoh
Copy link
Author

Keyeoh commented May 19, 2024

Thanks for your answer, @simonpcouch! It does work out of the box.

Problem is, and maybe I am missing something, I still cannot achieve what I thought at first (this is, having a function containing all of its dependencies, so an R vanilla installation on the target system could run it as-is).

I have serialized the model and the data used in your example:

library(tidymodels)
library(bonsai)
library(callr)
library(carrier)
library(lightgbm)
#> 
#> Adjuntando el paquete: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

fit <-
    workflow(Class ~ A + B, boost_tree("classification", engine = "lightgbm")) %>%
    fit(two_class_dat)

fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Class ~ A + B
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> LightGBM Model (100 trees)
#> Objective: binary
#> Fitted to dataset with 2 columns

c_model <- crate(
    function(new_data, ...) workflows:::predict.workflow(model, new_data, ...),
    model = fit
)

callr::r(
    function(d, cmod) {
        cmod(d)
    },
    args = list(
        d = two_class_dat,
        cmod = c_model
    )
)
#> # A tibble: 791 × 1
#>    .pred_class
#>    <fct>      
#>  1 Class1     
#>  2 Class1     
#>  3 Class2     
#>  4 Class2     
#>  5 Class1     
#>  6 Class2     
#>  7 Class2     
#>  8 Class2     
#>  9 Class1     
#> 10 Class2     
#> # ℹ 781 more rows

saveRDS(c_model, 'model.rds')
saveRDS(two_class_dat, 'data.rds')

Created on 2024-05-19 with reprex v2.1.0

Then, I have de-serialized them in a new, clean, session, without the packages installed, and I receive an error about installing the dependencies. This seems logical, and of course is more informative than the ffi-related stuff we started with, but makes me think again if what I want to achieve makes any sense.

d <- readRDS('../prueba_crate/data.rds')
m <- readRDS('../prueba_crate/model.rds')
m(d)
#> Error in `check_installs()`:
#> ! This engine requires some package installs: 'lightgbm, bonsai'

Created on 2024-05-19 with reprex v2.1.0

I guess, as a conclusion, that the idea I had in mind is too complex and maybe not worth the effort. As I have a renv.lock and requirements.txt files which I am using in the training phase, I think I can also reproduce the environment for the inference in the same way.

Thanks a lot for your help. Anyway, if you have some tips on the feasibility of such approach, I'm all ears. :) Always happy to learn from you!!! :)

Regards,
Gus.

@simonpcouch
Copy link

Glad the answer was helpful!

One tool that may be helpful for you as you put together your production environment; workflows (and other objects in the tidymodels) have required_pkgs() methods, which return the packages needed to predict() with the supplied object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants