Skip to content

Commit

Permalink
Merge pull request #44 from mayer79/update_readme
Browse files Browse the repository at this point in the history
Preparing CRAN 2.2.0
  • Loading branch information
mayer79 authored Mar 25, 2023
2 parents 2af0837 + 741596e commit 382bb46
Show file tree
Hide file tree
Showing 22 changed files with 286 additions and 236 deletions.
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@
^.*\.Rproj$
^\.Rproj\.user$
^\.github$
^revdep$
^CRAN-SUBMISSION$
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Meta
/doc/
/Meta/
inst/doc
revdep
3 changes: 3 additions & 0 deletions CRAN-SUBMISSION
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Version: 2.2.0
Date: 2023-03-24 19:48:56 UTC
SHA: 7213048cfb0727fc7f705715bf603ab86550dc61
5 changes: 1 addition & 4 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: missRanger
Title: Fast Imputation of Missing Values
Version: 2.1.5.9000
Version: 2.2.0
Authors@R:
person(given = "Michael",
family = "Mayer",
Expand Down Expand Up @@ -31,9 +31,6 @@ Imports:
stats,
utils
Suggests:
mice,
dplyr,
survival,
knitr,
rmarkdown,
testthat (>= 3.0.0)
Expand Down
17 changes: 15 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,25 @@
# missRanger 2.1.5
# missRanger 2.2.0

## Less dependencies

- Removed {mice} from "suggested" packages.
- Removed {dplyr} from "suggested" packages.
- Removed {survival} from "suggested" packages.

## Maintenance

- Adding Github pages.
- Introduction of Github actions.

# missRanger 2.1.5 (not on CRAN)

Maintenance release,

- switching to testthat 3,
- changing the package structure, and
- bringing vignettes into right order.

# missRanger 2.1.4
# missRanger 2.1.4 (not on CRAN)

## Minor changes

Expand Down
4 changes: 3 additions & 1 deletion R/generateNA.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
#' Takes a vector, matrix or \code{data.frame} and replaces some values by \code{NA}.
#'
#' @param x A vector, matrix or \code{data.frame}.
#' @param p Proportion of missing values to add to \code{x}. In case \code{x} is a \code{data.frame}, \code{p} can also be a vector of probabilities per column or a named vector (see examples).
#' @param p Proportion of missing values to add to \code{x}.
#' In case \code{x} is a \code{data.frame}, \code{p} can also be a vector of
#' probabilities per column or a named vector (see examples).
#' @param seed An integer seed.
#'
#' @return \code{x} with missing values.
Expand Down
6 changes: 4 additions & 2 deletions R/imputeUnivariate.R
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
#' Univariate Imputation
#'
#' Fills missing values of a vector, matrix or data frame by sampling with replacement from the non-missing values. For data frames, this sampling is done within column.
#' Fills missing values of a vector, matrix or data frame by sampling with replacement
#' from the non-missing values. For data frames, this sampling is done within column.
#'
#' @param x A vector, matrix or data frame.
#' @param v A character vector of column names to impute (only relevant if \code{x} is a data frame). The default \code{NULL} imputes all columns.
#' @param v A character vector of column names to impute (only relevant if \code{x}
#' is a data frame). The default \code{NULL} imputes all columns.
#' @param seed An integer seed.
#'
#' @return \code{x} with imputed values.
Expand Down
108 changes: 44 additions & 64 deletions R/missRanger.R
Original file line number Diff line number Diff line change
@@ -1,27 +1,58 @@
#' Fast Imputation of Missing Values by Chained Random Forests
#'
#' Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
#' Between the iterative model fitting, it offers the option of predictive mean matching. This firstly avoids imputation with values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting conditional distributions to a realistic level. This allows to do multiple imputation when repeating the call to missRanger().
#' The iterative chaining stops as soon as \code{maxiter} is reached or if the average out-of-bag estimate of performance stops improving. In the latter case, except for the first iteration, the second last (i.e. best) imputed data is returned.
#' Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by
#' chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
#' Between the iterative model fitting, it offers the option of predictive mean matching.
#' This firstly avoids imputation with values not present in the original data
#' (like a value 0.3334 in a 0-1 coded variable).
#' Secondly, predictive mean matching tries to raise the variance in the resulting
#' conditional distributions to a realistic level. This allows to do multiple imputation
#' when repeating the call to \code{missRanger()}.
#' The iterative chaining stops as soon as \code{maxiter} is reached or if the average
#' out-of-bag estimate of performance stops improving.
#' In the latter case, except for the first iteration, the second last (i.e. best)
#' imputed data is returned.
#'
#' A note on `mtry`: Be careful when passing a non-default `mtry` to `ranger()` because the number of available covariables might be growing during the first iteration, depending on the missing pattern. Values \code{NULL} (default) and 1 are safe choices. Additionally, recent versions of `ranger()` allow `mtry` to be a single-argument function of the number of available covariables, e.g. `mtry = function(m) max(1, m %/% 3)`.
#' A note on \code{mtry}: Be careful when passing a non-default \code{mtry} to
#' \code{ranger()} because the number of available covariates might be growing during
#' the first iteration, depending on the missing pattern.
#' Values \code{NULL} (default) and 1 are safe choices.
#' Additionally, recent versions of \code{ranger()} allow \code{mtry} to be a
#' single-argument function of the number of available covariables,
#' e.g. \code{mtry = function(m) max(1, m %/% 3)}.
#'
#' @importFrom stats var reformulate terms.formula predict setNames
#' @importFrom ranger ranger
#' @importFrom utils setTxtProgressBar txtProgressBar
#' @param data A \code{data.frame} or \code{tibble} with missing values to impute.
#' @param formula A two-sided formula specifying variables to be imputed (left hand side) and variables used to impute (right hand side). Defaults to . ~ ., i.e. use all variables to impute all variables.
#' If e.g. all variables (with missings) should be imputed by all variables except variable "ID", use . ~ . - ID. Note that a "." is evaluated separately for each side of the formula. Further note that variables
#' with missings must appear in the left hand side if they should be used on the right hand side.
#' @param pmm.k Number of candidate non-missing values to sample from in the predictive mean matching steps. 0 to avoid this step.
#' @param formula A two-sided formula specifying variables to be imputed
#' (left hand side) and variables used to impute (right hand side).
#' Defaults to \code{. ~ .}, i.e. use all variables to impute all variables.
#' If e.g. all variables (with missings) should be imputed by all variables
#' except variable "ID", use \code{. ~ . - ID}. Note that a "." is evaluated
#' separately for each side of the formula. Further note that variables with missings
#' must appear in the left hand side if they should be used on the right hand side.
#' @param pmm.k Number of candidate non-missing values to sample from in the
#' predictive mean matching steps. 0 to avoid this step.
#' @param maxiter Maximum number of chaining iterations.
#' @param seed Integer seed to initialize the random generator.
#' @param verbose Controls how much info is printed to screen. 0 to print nothing. 1 (default) to print a progress bar per iteration, 2 to print the OOB prediction error per iteration and variable (1 minus R-squared for regression).
#' Furthermore, if \code{verbose} is positive, the variables used for imputation are listed as well as the variables to be imputed (in the imputation order). This will be useful to detect if some variables are unexpectedly skipped.
#' @param returnOOB Logical flag. If TRUE, the final average out-of-bag prediction error is added to the output as attribute "oob". This does not work in the special case when the variables are imputed univariately.
#' @param verbose Controls how much info is printed to screen.
#' 0 to print nothing. 1 (default) to print a progress bar per iteration,
#' 2 to print the OOB prediction error per iteration and variable
#' (1 minus R-squared for regression).
#' Furthermore, if \code{verbose} is positive, the variables used for imputation are
#' listed as well as the variables to be imputed (in the imputation order).
#' This will be useful to detect if some variables are unexpectedly skipped.
#' @param returnOOB Logical flag. If TRUE, the final average out-of-bag prediction error
#' is added to the output as attribute "oob". This does not work in the special case
#' when the variables are imputed univariately.
#' @param case.weights Vector with non-negative case weights.
#' @param ... Arguments passed to \code{ranger()}. If the data set is large, better use less trees (e.g. \code{num.trees = 20}) and/or a low value of \code{sample.fraction}.
#' The following arguments are e.g. incompatible with \code{ranger}: \code{write.forest}, \code{probability}, \code{split.select.weights}, \code{dependent.variable.name}, and \code{classification}.
#' @param ... Arguments passed to \code{ranger()}. If the data set is large,
#' better use less trees (e.g. \code{num.trees = 20}) and/or a low value of
#' \code{sample.fraction}.
#' The following arguments are e.g. incompatible:
#' \code{write.forest}, \code{probability}, \code{split.select.weights},
#' \code{dependent.variable.name}, and \code{classification}.
#'
#' @return An imputed \code{data.frame}.
#'
Expand All @@ -38,57 +69,6 @@
#' irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
#' head(irisImputed)
#' head(irisWithNA)
#'
#' \dontrun{
#' # With extra trees algorithm
#' irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, splitrule = "extratrees")
#' head(irisImputed_et)
#'
#' # Passing `mtry` as a function of the number of covariables
# irisImputed_mtry <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100,
# mtry = function(m) max(1, m %/% 3))
# head(irisImputed_mtry)
#'
#' # Do not impute Species. Note: Since this variable contains missings, it won't be used
#' # for imputing other variables.
#' head(irisImputed <- missRanger(irisWithNA, . - Species ~ ., pmm.k = 3, num.trees = 100))
#'
#' # Impute univariately only.
#' head(irisImputed <- missRanger(irisWithNA, . ~ 1))
#'
#' # Use Species and Petal.Length to impute Species and Petal.Length.
#' head(irisImputed <- missRanger(irisWithNA, Species + Petal.Length ~ Species + Petal.Length,
#' pmm.k = 3, num.trees = 100))
#'
#' # Multiple imputation: Fill data 20 times, run 20 analyses and pool their results.
#' require(mice)
#' filled <- replicate(20, missRanger(irisWithNA, verbose = 0, num.trees = 100, pmm.k = 5),
#' simplify = FALSE)
#' models <- lapply(filled, function(x) lm(Sepal.Length ~ ., x))
#' summary(pooled_fit <- pool(models)) # Realistically inflated standard errors and p values
#'
#' # A data set with logicals, numerics, characters and factors.
#' n <- 100
#' X <- data.frame(x1 = seq_len(n),
#' x2 = log(seq_len(n)),
#' x3 = sample(LETTERS[1:3], n, replace = TRUE),
#' x4 = factor(sample(LETTERS[1:3], n, replace = TRUE)),
#' x5 = seq_len(n) > 50)
#' head(X)
#' X_NA <- generateNA(X, p = seq(0, 0.8, by = .2))
#' head(X_NA)
#'
#' head(X_imp <- missRanger(X_NA))
#' head(X_imp <- missRanger(X_NA, pmm = 3))
#' head(X_imp <- missRanger(X_NA, pmm = 3, verbose = 0))
#' head(X_imp <- missRanger(X_NA, pmm = 3, verbose = 2, returnOOB = TRUE))
#' attr(X_imp, "oob") # OOB prediction errors per column.
#'
#' # The formula interface
#' head(X_imp <- missRanger(X_NA, x2 ~ x2 + x3, pmm = 3)) # Does not use x3 because of NAs
#' head(X_imp <- missRanger(X_NA, x2 + x3 ~ x2 + x3, pmm = 3))
#' head(X_imp <- missRanger(X_NA, x2 + x3 ~ 1, pmm = 3)) # Univariate imputation
#' }
missRanger <- function(data, formula = . ~ ., pmm.k = 0L, maxiter = 10L,
seed = NULL, verbose = 1, returnOOB = FALSE,
case.weights = NULL, ...) {
Expand Down
15 changes: 11 additions & 4 deletions R/pmm.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
#' Predictive Mean Matching
#'
#' For each value in the prediction vector \code{xtest}, one of the closest \code{k} values in the prediction vector \code{xtrain} is randomly chosen and its observed value in \code{ytrain} is returned.
#' For each value in the prediction vector \code{xtest}, one of the closest \code{k}
#' values in the prediction vector \code{xtrain} is randomly chosen and its observed
#' value in \code{ytrain} is returned.
#'
#' @importFrom stats rmultinom
#' @importFrom FNN knnx.index
#'
#' @param xtrain Vector with predicted values in the training data. Can be of type logical, numeric, character, or factor.
#' @param xtest Vector as \code{xtrain} with predicted values in the test data. Missing values are not allowed.
#' @param ytrain Vector of the observed values in the training data. Must be of same length as \code{xtrain}. Missing values in either of \code{xtrain} or \code{ytrain} will be dropped in a pairwise manner.
#' @param xtrain Vector with predicted values in the training data.
#' Can be of type logical, numeric, character, or factor.
#' @param xtest Vector as \code{xtrain} with predicted values in the test data.
#' Missing values are not allowed.
#' @param ytrain Vector of the observed values in the training data.
#' Must be of same length as \code{xtrain}.
#' Missing values in either of \code{xtrain} or \code{ytrain} will be dropped
#' in a pairwise manner.
#' @param k Number of nearest neighbours to sample from.
#' @param seed Integer random seed.
#'
Expand Down
16 changes: 7 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

## Overview

The {missRanger} package uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette.
{missRanger} uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette.

The main function `missRanger()` offers the option to combine random forest imputation with predictive mean matching. This firstly avoids the generation of values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, this step tends to raise the variance in the resulting conditional distributions to a realistic level, a crucial element to apply multiple imputation frameworks.

Expand All @@ -38,28 +38,26 @@ library(missRanger)
# Generate data with missing values in all columns
irisWithNA <- generateNA(iris, seed = 347)

# Impute missing values with missRanger
# Impute missing values
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)

# Check results
head(irisImputed)
head(irisWithNA)
head(iris)

# With extra trees algorithm
# Replace random forest by extremely randomized trees
irisImputed_et <- missRanger(
irisWithNA,
pmm.k = 3,
splitrule = "extratrees",
num.trees = 100
)

# With "dplyr" syntax
library(dplyr)

iris %>%
generateNA() %>%
missRanger(verbose = 0, pmm.k = 5) %>%
# Using the pipe...
iris |>
generateNA() |>
missRanger(pmm.k = 5, verbose = 0) |>
head()
```

Expand Down
34 changes: 17 additions & 17 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
This is a maintenance release, switching to
# missRanger 2.2.0

- testthat 3,
- modifying vignette order,
- improving the way how the package is being updated/generated.
- removed suggested dependencies dplyr, mice, survival
- improved documentation

## R CMD check results seem okay
## R CMD check

checking for unstated dependencies in examples ... OK
WARNING

WARNING
'qpdf' is needed for checks on size reduction of PDFs

## Online check results seem okay (2 notes below)
checking for future file timestamps ... NOTE
unable to verify current time

## RHub

- check_win_devel()
- check_rhub()
Note: lastMiKTeXException

## Reverse dependency check of 7 packages

Found the following (possibly) invalid DOIs:
DOI: 10.1093/bioinformatics/btr597
From: DESCRIPTION
Status: Forbidden
Message: 403
* checking for detritus in the temp directory ... NOTE
Found the following files/directories:
'lastMiKTeXException'
- hdImpute 0.1.1 -- E: 0 | W: 0 | N: 0 - marginaleffects 0.11.0 -- E: 0 | W: 0 | N: 0 - mlim 0.3.0 -- E: 0 | W: 0 | N: 0 - NADIA 0.4.2 -- E: 0 | W: 0 | N: 1 - outForest 0.1.2 -- E: 0 | W: 0 | N: 0 - wiseR 1.0.1 -- E: 0 | W: 0 | N: 3
- worcs 0.1.10 -- E: 0 | W: 0 | N: 0
OK: 7
BROKEN: 0
4 changes: 3 additions & 1 deletion man/generateNA.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 4 additions & 2 deletions man/imputeUnivariate.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 382bb46

Please sign in to comment.