diff --git a/README.md b/README.md index 4c8e9ee..0faf3d7 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,9 @@ -[![CRAN status](http://www.r-pkg.org/badges/version/missRanger)](https://cran.r-project.org/package=missRanger) -[![R-CMD-check](https://github.com/mayer79/missRanger/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mayer79/missRanger/actions) -[![Codecov test coverage](https://codecov.io/gh/mayer79/missRanger/graph/badge.svg)](https://app.codecov.io/gh/mayer79/missRanger?branch=main) +[![R-CMD-check](https://github.com/mayer79/missRanger/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mayer79/missRanger/actions/workflows/R-CMD-check.yaml) +[![Codecov test coverage](https://codecov.io/gh/mayer79/missRanger/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mayer79/missRanger?branch=main) +[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/missRanger)](https://cran.r-project.org/package=missRanger) [![](https://cranlogs.r-pkg.org/badges/missRanger)](https://cran.r-project.org/package=missRanger) [![](https://cranlogs.r-pkg.org/badges/grand-total/missRanger?color=orange)](https://cran.r-project.org/package=missRanger) @@ -13,9 +13,7 @@ ## Overview -{missRanger} uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette. - -The main function `missRanger()` offers the option to combine random forest imputation with predictive mean matching. This firstly avoids the generation of values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, this step tends to raise the variance in the resulting conditional distributions to a realistic level, a crucial element to apply multiple imputation frameworks. +{missRanger} is a **multivariate imputation algorithm** based on random forests. It is a fast alternative to the famous 'MissForest' algorithm (Stekhoven and Buehlmann, 2012), and uses the {ranger} package (Wright and Ziegler, 2017) to fit the random forests. ## Installation @@ -29,62 +27,40 @@ devtools::install_github("mayer79/missRanger") ## Usage -We first generate a data set with about 10% missing values in each column. -Then those gaps are filled by `missRanger()`. In the end, the resulting data frame is displayed. - -``` r +```r library(missRanger) - -# Generate data with missing values in all columns -irisWithNA <- generateNA(iris, seed = 347) - -# Impute missing values -irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100) - -# Check results -head(irisImputed) -head(irisWithNA) -head(iris) - -# Replace random forest by extremely randomized trees -irisImputed_et <- missRanger( - irisWithNA, - pmm.k = 3, - splitrule = "extratrees", - num.trees = 100 -) - -# Using the pipe... -iris |> - generateNA() |> - missRanger(pmm.k = 5, verbose = 0) |> - head() - -# More infos via `data_only = FALSE` -imp <- missRanger(irisWithNA, pmm.k = 3, data_only = FALSE, seed = 3) -summary(imp) - -# missRanger object. Extract imputed data via $data -# - best iteration: 3 -# - best average OOB imputation error: 0.02058243 -# -# Sequence of OOB prediction errors: -# -# Sepal.Length Sepal.Width Petal.Length Petal.Width Species -# [1,] 1.00000000 1.03868004 0.267209559 0.103679645 0.08148148 -# [2,] 0.02948771 0.05997235 0.005676231 0.007813704 0.00000000 -# [3,] 0.02709505 0.06268752 0.004921649 0.008207934 0.00000000 -# [4,] 0.02673459 0.06504868 0.005183209 0.008761418 0.00000000 -# -# Corresponding means: -# [1] 0.49821014 0.02059000 0.02058243 0.02114558 -# -# First rows of imputed data: -# -# Sepal.Length Sepal.Width Petal.Length Petal.Width Species -# 1 5.1 3.5 1.4 0.2 setosa -# 2 4.9 3.0 1.4 0.2 setosa -# 3 4.7 3.2 1.6 0.2 setosa + +set.seed(3) + +iris_NA <- generateNA(iris, p = 0.1) +head(iris_NA) +# Sepal.Length Sepal.Width Petal.Length Petal.Width Species +# 5.1 3.5 1.4 0.2 setosa +# 4.9 3.0 1.4 NA setosa +# 4.7 3.2 1.3 0.2 setosa +# 4.6 3.1 1.5 0.2 +# NA 3.6 1.4 0.2 setosa +# 5.4 3.9 1.7 0.4 + +iris_filled <- missRanger(iris_NA, pmm.k = 5, num.trees = 100) +head(iris_filled) + +# Sepal.Length Sepal.Width Petal.Length Petal.Width Species +# 5.1 3.5 1.4 0.2 setosa +# 4.9 3.0 1.4 0.2 setosa +# 4.7 3.2 1.3 0.2 setosa +# 4.6 3.1 1.5 0.2 setosa +# 5.7 3.6 1.4 0.2 setosa +# 5.4 3.9 1.7 0.4 setosa ``` -Check out the vignettes for more info. +## How it works + +The algorithm iterates until the average out-of-bag (OOB) error of the forests stops improving. The missing values are filled by OOB predictions of the best iteration, optionally followed by predictive mean matching (PMM). The PMM step avoids values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Furthermore, PMM raises the variance in the resulting conditional distributions to a more realistic level, a crucial property for **multiple imputation**. + +Check-out the vignettes for more info, and for how to use `missRanger()` in multiple imputation. + +## References + +- Stekhoven D. J., Buehlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. +- Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01