Skip to content

CRAN release 2.6.0

Latest
Compare
Choose a tag to compare
@mayer79 mayer79 released this 17 Aug 16:07
f3cd709

Major bug fix

Fixes a major bug, by which responses would be used as covariates in the random forests. Thanks for reporting @flystar233, see #78.
You can expect different and better imputations.

Major feature

Out-of-sample application is now possible! Thanks to @jeandigitale for pushing the idea in #58.

This means you can run imp <- missRanger(..., keep_forests = TRUE) and then apply its models to new data via predict(imp, newdata). The "missRanger" object can be saved/loaded as binary file, e.g, via saveRDS()/readRDS() for later use.

Note that out-of-sample imputation works best for rows in newdata with only one
missing value (counting only missings in variables used as covariates in random forests). We call this the "easy case". In the "hard case",
even multiple iterations (set by iter) can lead to unsatisfactory results.

The out-of-sample algorithm works as follows:

  1. Impute univariately all relevant columns by randomly drawing values
    from the original unimputed data. This step will only impact "hard case" rows.
  2. Replace univariate imputations by predictions of random forests. This is done
    sequentially over variables, where the variables are sorted to minimize the impact
    of univariate imputations. Optionally, this is followed by predictive mean matching (PMM).
  3. Repeat Step 2 for "hard case" rows multiple times.

Possibly breaking changes

  • Columns of special type like date/time can't be imputed anymore. You will need to convert them to numeric before imputation.
  • pmm() is more picky: xtrain and xtest must both be either numeric, logical, or factor (with identical levels).

Minor changes in output object

  • Add original data as data_raw.
  • Renamed visit_seq to to_impute.

Other changes

  • Now requires ranger >= 0.16.0.
  • More compact vignettes.
  • Better examples and README.
  • Many relevant ranger() arguments are now explicit arguments in missRanger() to improve tab-completion experience:
    • num.trees = 500
    • mtry = NULL
    • min.node.size = NULL
    • min.bucket = NULL
    • max.depth = NULL
    • replace = TRUE
    • sample.fraction = if (replace) 1 else 0.632
    • case.weights = NULL
    • num.threads = NULL
    • save.memory = FALSE
  • For variables that can't be used, more information is printed.
  • If keep_forests = TRUE, the argument data_only is set to FALSE by default.
  • "missRanger" object now stores pmm.k.
  • verbose argument is passed to ranger() as well.