Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
mayer79 committed Jul 24, 2024
1 parent af0d58a commit 3c1de63
Showing 1 changed file with 38 additions and 62 deletions.
100 changes: 38 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

<!-- badges: start -->

[![CRAN status](http://www.r-pkg.org/badges/version/missRanger)](https://cran.r-project.org/package=missRanger)
[![R-CMD-check](https://github.com/mayer79/missRanger/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mayer79/missRanger/actions)
[![Codecov test coverage](https://codecov.io/gh/mayer79/missRanger/graph/badge.svg)](https://app.codecov.io/gh/mayer79/missRanger?branch=main)
[![R-CMD-check](https://github.com/mayer79/missRanger/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mayer79/missRanger/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/mayer79/missRanger/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mayer79/missRanger?branch=main)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/missRanger)](https://cran.r-project.org/package=missRanger)

[![](https://cranlogs.r-pkg.org/badges/missRanger)](https://cran.r-project.org/package=missRanger)
[![](https://cranlogs.r-pkg.org/badges/grand-total/missRanger?color=orange)](https://cran.r-project.org/package=missRanger)
Expand All @@ -13,9 +13,7 @@

## Overview

{missRanger} uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette.

The main function `missRanger()` offers the option to combine random forest imputation with predictive mean matching. This firstly avoids the generation of values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, this step tends to raise the variance in the resulting conditional distributions to a realistic level, a crucial element to apply multiple imputation frameworks.
{missRanger} is a **multivariate imputation algorithm** based on random forests. It is a fast alternative to the famous 'MissForest' algorithm (Stekhoven and Buehlmann, 2012), and uses the {ranger} package (Wright and Ziegler, 2017) to fit the random forests.

## Installation

Expand All @@ -29,62 +27,40 @@ devtools::install_github("mayer79/missRanger")

## Usage

We first generate a data set with about 10% missing values in each column.
Then those gaps are filled by `missRanger()`. In the end, the resulting data frame is displayed.

``` r
```r
library(missRanger)

# Generate data with missing values in all columns
irisWithNA <- generateNA(iris, seed = 347)

# Impute missing values
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)

# Check results
head(irisImputed)
head(irisWithNA)
head(iris)

# Replace random forest by extremely randomized trees
irisImputed_et <- missRanger(
irisWithNA,
pmm.k = 3,
splitrule = "extratrees",
num.trees = 100
)

# Using the pipe...
iris |>
generateNA() |>
missRanger(pmm.k = 5, verbose = 0) |>
head()

# More infos via `data_only = FALSE`
imp <- missRanger(irisWithNA, pmm.k = 3, data_only = FALSE, seed = 3)
summary(imp)

# missRanger object. Extract imputed data via $data
# - best iteration: 3
# - best average OOB imputation error: 0.02058243
#
# Sequence of OOB prediction errors:
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# [1,] 1.00000000 1.03868004 0.267209559 0.103679645 0.08148148
# [2,] 0.02948771 0.05997235 0.005676231 0.007813704 0.00000000
# [3,] 0.02709505 0.06268752 0.004921649 0.008207934 0.00000000
# [4,] 0.02673459 0.06504868 0.005183209 0.008761418 0.00000000
#
# Corresponding means:
# [1] 0.49821014 0.02059000 0.02058243 0.02114558
#
# First rows of imputed data:
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.6 0.2 setosa

set.seed(3)

iris_NA <- generateNA(iris, p = 0.1)
head(iris_NA)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 5.1 3.5 1.4 0.2 setosa
# 4.9 3.0 1.4 NA setosa
# 4.7 3.2 1.3 0.2 setosa
# 4.6 3.1 1.5 0.2 <NA>
# NA 3.6 1.4 0.2 setosa
# 5.4 3.9 1.7 0.4 <NA>

iris_filled <- missRanger(iris_NA, pmm.k = 5, num.trees = 100)
head(iris_filled)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 5.1 3.5 1.4 0.2 setosa
# 4.9 3.0 1.4 0.2 setosa
# 4.7 3.2 1.3 0.2 setosa
# 4.6 3.1 1.5 0.2 setosa
# 5.7 3.6 1.4 0.2 setosa
# 5.4 3.9 1.7 0.4 setosa
```

Check out the vignettes for more info.
## How it works

The algorithm iterates until the average out-of-bag (OOB) error of the forests stops improving. The missing values are filled by OOB predictions of the best iteration, optionally followed by predictive mean matching (PMM). The PMM step avoids values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Furthermore, PMM raises the variance in the resulting conditional distributions to a more realistic level, a crucial property for **multiple imputation**.

Check-out the vignettes for more info, and for how to use `missRanger()` in multiple imputation.

## References

- Stekhoven D. J., Buehlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
- Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01

0 comments on commit 3c1de63

Please sign in to comment.