Merge pull request #44 from mayer79/update_readme

Preparing CRAN 2.2.0
mayer79 · Mar 25, 2023 · 382bb46 · 382bb46
2 parents 2af0837 + 741596e
commit 382bb46
Show file tree

Hide file tree

Showing 22 changed files with 286 additions and 236 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -10,3 +10,5 @@
 ^.*\.Rproj$
 ^\.Rproj\.user$
 ^\.github$
+^revdep$
+^CRAN-SUBMISSION$
diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,4 @@ Meta
 /doc/
 /Meta/
 inst/doc
+revdep
diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION
@@ -0,0 +1,3 @@
+Version: 2.2.0
+Date: 2023-03-24 19:48:56 UTC
+SHA: 7213048cfb0727fc7f705715bf603ab86550dc61
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: missRanger
 Title: Fast Imputation of Missing Values
-Version: 2.1.5.9000
+Version: 2.2.0
 Authors@R: 
     person(given = "Michael",
            family = "Mayer",
@@ -31,9 +31,6 @@ Imports:
     stats,
     utils
 Suggests: 
-    mice,
-    dplyr,
-    survival,
     knitr,
     rmarkdown,
     testthat (>= 3.0.0)

diff --git a/NEWS.md b/NEWS.md
@@ -1,12 +1,25 @@
-# missRanger 2.1.5
+# missRanger 2.2.0
+
+## Less dependencies
+
+- Removed {mice} from "suggested" packages.
+- Removed {dplyr} from "suggested" packages.
+- Removed {survival} from "suggested" packages.
+
+## Maintenance
+
+- Adding Github pages.
+- Introduction of Github actions.
+
+# missRanger 2.1.5 (not on CRAN)
 
 Maintenance release,
 
 - switching to testthat 3,
 - changing the package structure, and
 - bringing vignettes into right order.
 
-# missRanger 2.1.4
+# missRanger 2.1.4 (not on CRAN)
 
 ## Minor changes
 

diff --git a/R/generateNA.R b/R/generateNA.R
@@ -3,7 +3,9 @@
 #' Takes a vector, matrix or \code{data.frame} and replaces some values by \code{NA}. 
 #' 
 #' @param x A vector, matrix or \code{data.frame}.
-#' @param p Proportion of missing values to add to \code{x}. In case \code{x} is a \code{data.frame}, \code{p} can also be a vector of probabilities per column or a named vector (see examples).
+#' @param p Proportion of missing values to add to \code{x}. 
+#' In case \code{x} is a \code{data.frame}, \code{p} can also be a vector of 
+#' probabilities per column or a named vector (see examples).
 #' @param seed An integer seed.
 #'
 #' @return \code{x} with missing values.

diff --git a/R/imputeUnivariate.R b/R/imputeUnivariate.R
@@ -1,9 +1,11 @@
 #' Univariate Imputation
 #'
-#' Fills missing values of a vector, matrix or data frame by sampling with replacement from the non-missing values. For data frames, this sampling is done within column.
+#' Fills missing values of a vector, matrix or data frame by sampling with replacement
+#'  from the non-missing values. For data frames, this sampling is done within column.
 #' 
 #' @param x A vector, matrix or data frame.
-#' @param v A character vector of column names to impute (only relevant if \code{x} is a data frame). The default \code{NULL} imputes all columns.
+#' @param v A character vector of column names to impute (only relevant if \code{x} 
+#' is a data frame). The default \code{NULL} imputes all columns.
 #' @param seed An integer seed.
 #'
 #' @return \code{x} with imputed values.

diff --git a/R/missRanger.R b/R/missRanger.R
@@ -1,27 +1,58 @@
 #' Fast Imputation of Missing Values by Chained Random Forests
 #' 
-#' Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
-#' Between the iterative model fitting, it offers the option of predictive mean matching. This firstly avoids imputation with values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting conditional distributions to a realistic level. This allows to do multiple imputation when repeating the call to missRanger(). 
-#' The iterative chaining stops as soon as \code{maxiter} is reached or if the average out-of-bag estimate of performance stops improving. In the latter case, except for the first iteration, the second last (i.e. best) imputed data is returned.
+#' Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by 
+#' chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
+#' Between the iterative model fitting, it offers the option of predictive mean matching. 
+#' This firstly avoids imputation with values not present in the original data 
+#' (like a value 0.3334 in a 0-1 coded variable). 
+#' Secondly, predictive mean matching tries to raise the variance in the resulting 
+#' conditional distributions to a realistic level. This allows to do multiple imputation 
+#' when repeating the call to \code{missRanger()}. 
+#' The iterative chaining stops as soon as \code{maxiter} is reached or if the average 
+#' out-of-bag estimate of performance stops improving. 
+#' In the latter case, except for the first iteration, the second last (i.e. best) 
+#' imputed data is returned.
 #' 
-#' A note on `mtry`: Be careful when passing a non-default `mtry` to `ranger()` because the number of available covariables might be growing during the first iteration, depending on the missing pattern. Values \code{NULL} (default) and 1 are safe choices. Additionally, recent versions of `ranger()` allow `mtry` to be a single-argument function of the number of available covariables, e.g. `mtry = function(m) max(1, m %/% 3)`.
+#' A note on \code{mtry}: Be careful when passing a non-default \code{mtry} to 
+#' \code{ranger()} because the number of available covariates might be growing during 
+#' the first iteration, depending on the missing pattern. 
+#' Values \code{NULL} (default) and 1 are safe choices. 
+#' Additionally, recent versions of \code{ranger()} allow \code{mtry} to be a 
+#' single-argument function of the number of available covariables, 
+#' e.g. \code{mtry = function(m) max(1, m %/% 3)}.
 #' 
 #' @importFrom stats var reformulate terms.formula predict setNames
 #' @importFrom ranger ranger
 #' @importFrom utils setTxtProgressBar txtProgressBar
 #' @param data A \code{data.frame} or \code{tibble} with missing values to impute.
-#' @param formula A two-sided formula specifying variables to be imputed (left hand side) and variables used to impute (right hand side). Defaults to . ~ ., i.e. use all variables to impute all variables. 
-#' If e.g. all variables (with missings) should be imputed by all variables except variable "ID", use . ~ . - ID. Note that a "." is evaluated separately for each side of the formula. Further note that variables 
-#' with missings must appear in the left hand side if they should be used on the right hand side.
-#' @param pmm.k Number of candidate non-missing values to sample from in the predictive mean matching steps. 0 to avoid this step.
+#' @param formula A two-sided formula specifying variables to be imputed 
+#' (left hand side) and variables used to impute (right hand side). 
+#' Defaults to \code{. ~ .}, i.e. use all variables to impute all variables. 
+#' If e.g. all variables (with missings) should be imputed by all variables 
+#' except variable "ID", use \code{. ~ . - ID}. Note that a "." is evaluated 
+#' separately for each side of the formula. Further note that variables with missings 
+#' must appear in the left hand side if they should be used on the right hand side.
+#' @param pmm.k Number of candidate non-missing values to sample from in the 
+#' predictive mean matching steps. 0 to avoid this step.
 #' @param maxiter Maximum number of chaining iterations.
 #' @param seed Integer seed to initialize the random generator.
-#' @param verbose Controls how much info is printed to screen. 0 to print nothing. 1 (default) to print a progress bar per iteration, 2 to print the OOB prediction error per iteration and variable (1 minus R-squared for regression).
-#' Furthermore, if \code{verbose} is positive, the variables used for imputation are listed as well as the variables to be imputed (in the imputation order). This will be useful to detect if some variables are unexpectedly skipped.
-#' @param returnOOB Logical flag. If TRUE, the final average out-of-bag prediction error is added to the output as attribute "oob". This does not work in the special case when the variables are imputed univariately.
+#' @param verbose Controls how much info is printed to screen. 
+#' 0 to print nothing. 1 (default) to print a progress bar per iteration, 
+#' 2 to print the OOB prediction error per iteration and variable 
+#' (1 minus R-squared for regression).
+#' Furthermore, if \code{verbose} is positive, the variables used for imputation are 
+#' listed as well as the variables to be imputed (in the imputation order). 
+#' This will be useful to detect if some variables are unexpectedly skipped.
+#' @param returnOOB Logical flag. If TRUE, the final average out-of-bag prediction error
+#' is added to the output as attribute "oob". This does not work in the special case 
+#' when the variables are imputed univariately.
 #' @param case.weights Vector with non-negative case weights.
-#' @param ... Arguments passed to \code{ranger()}. If the data set is large, better use less trees (e.g. \code{num.trees = 20}) and/or a low value of \code{sample.fraction}. 
-#' The following arguments are e.g. incompatible with \code{ranger}: \code{write.forest}, \code{probability}, \code{split.select.weights}, \code{dependent.variable.name}, and \code{classification}. 
+#' @param ... Arguments passed to \code{ranger()}. If the data set is large, 
+#' better use less trees (e.g. \code{num.trees = 20}) and/or a low value of 
+#' \code{sample.fraction}. 
+#' The following arguments are e.g. incompatible: 
+#' \code{write.forest}, \code{probability}, \code{split.select.weights}, 
+#' \code{dependent.variable.name}, and \code{classification}. 
 #'
 #' @return An imputed \code{data.frame}.
 #' 
@@ -38,57 +69,6 @@
 #' irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
 #' head(irisImputed)
 #' head(irisWithNA)
-#'
-#' \dontrun{
-#' # With extra trees algorithm
-#' irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, splitrule = "extratrees")
-#' head(irisImputed_et)
-#' 
-#' # Passing `mtry` as a function of the number of covariables
-# irisImputed_mtry <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, 
-#                                mtry = function(m) max(1, m %/% 3))
-# head(irisImputed_mtry)
-#' 
-#' # Do not impute Species. Note: Since this variable contains missings, it won't be used
-#' # for imputing other variables.
-#' head(irisImputed <- missRanger(irisWithNA, . - Species ~ ., pmm.k = 3, num.trees = 100))
-#' 
-#' # Impute univariately only.
-#' head(irisImputed <- missRanger(irisWithNA, . ~ 1))
-#' 
-#' # Use Species and Petal.Length to impute Species and Petal.Length.
-#' head(irisImputed <- missRanger(irisWithNA, Species + Petal.Length ~ Species + Petal.Length, 
-#'                                pmm.k = 3, num.trees = 100))
-#'                                
-#' # Multiple imputation: Fill data 20 times, run 20 analyses and pool their results.
-#' require(mice)
-#' filled <- replicate(20, missRanger(irisWithNA, verbose = 0, num.trees = 100, pmm.k = 5), 
-#'                     simplify = FALSE)
-#' models <- lapply(filled, function(x) lm(Sepal.Length ~ ., x))
-#' summary(pooled_fit <- pool(models)) # Realistically inflated standard errors and p values
-#' 
-#' # A data set with logicals, numerics, characters and factors.
-#' n <- 100
-#' X <- data.frame(x1 = seq_len(n), 
-#'                 x2 = log(seq_len(n)), 
-#'                 x3 = sample(LETTERS[1:3], n, replace = TRUE),
-#'                 x4 = factor(sample(LETTERS[1:3], n, replace = TRUE)),
-#'                 x5 = seq_len(n) > 50)
-#' head(X)
-#' X_NA <- generateNA(X, p = seq(0, 0.8, by = .2))
-#' head(X_NA)
-#' 
-#' head(X_imp <- missRanger(X_NA))
-#' head(X_imp <- missRanger(X_NA, pmm = 3))
-#' head(X_imp <- missRanger(X_NA, pmm = 3, verbose = 0))
-#' head(X_imp <- missRanger(X_NA, pmm = 3, verbose = 2, returnOOB = TRUE))
-#' attr(X_imp, "oob") # OOB prediction errors per column.
-#' 
-#' # The formula interface
-#' head(X_imp <- missRanger(X_NA, x2 ~ x2 + x3, pmm = 3)) # Does not use x3 because of NAs
-#' head(X_imp <- missRanger(X_NA, x2 + x3 ~ x2 + x3, pmm = 3))
-#' head(X_imp <- missRanger(X_NA, x2 + x3 ~ 1, pmm = 3)) # Univariate imputation
-#' }
 missRanger <- function(data, formula = . ~ ., pmm.k = 0L, maxiter = 10L, 
                        seed = NULL, verbose = 1, returnOOB = FALSE, 
                        case.weights = NULL, ...) {

diff --git a/R/pmm.R b/R/pmm.R
@@ -1,13 +1,20 @@
 #' Predictive Mean Matching
 #'
-#' For each value in the prediction vector \code{xtest}, one of the closest \code{k} values in the prediction vector \code{xtrain} is randomly chosen and its observed value in \code{ytrain} is returned. 
+#' For each value in the prediction vector \code{xtest}, one of the closest \code{k} 
+#' values in the prediction vector \code{xtrain} is randomly chosen and its observed 
+#' value in \code{ytrain} is returned. 
 #' 
 #' @importFrom stats rmultinom
 #' @importFrom FNN knnx.index
 #' 
-#' @param xtrain Vector with predicted values in the training data. Can be of type logical, numeric, character, or factor.
-#' @param xtest Vector as \code{xtrain} with predicted values in the test data. Missing values are not allowed.
-#' @param ytrain Vector of the observed values in the training data. Must be of same length as \code{xtrain}. Missing values in either of \code{xtrain} or \code{ytrain} will be dropped in a pairwise manner.
+#' @param xtrain Vector with predicted values in the training data. 
+#' Can be of type logical, numeric, character, or factor.
+#' @param xtest Vector as \code{xtrain} with predicted values in the test data. 
+#' Missing values are not allowed.
+#' @param ytrain Vector of the observed values in the training data. 
+#' Must be of same length as \code{xtrain}. 
+#' Missing values in either of \code{xtrain} or \code{ytrain} will be dropped 
+#' in a pairwise manner.
 #' @param k Number of nearest neighbours to sample from.
 #' @param seed Integer random seed.
 #'

diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@
 
 ## Overview
 
-The {missRanger} package uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette.
+{missRanger} uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette.
 
 The main function `missRanger()` offers the option to combine random forest imputation with predictive mean matching. This firstly avoids the generation of values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, this step tends to raise the variance in the resulting conditional distributions to a realistic level, a crucial element to apply multiple imputation frameworks.
 
@@ -38,28 +38,26 @@ library(missRanger)
 # Generate data with missing values in all columns
 irisWithNA <- generateNA(iris, seed = 347)
 
-# Impute missing values with missRanger
+# Impute missing values
 irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
 
 # Check results
 head(irisImputed)
 head(irisWithNA)
 head(iris)
 
-# With extra trees algorithm
+# Replace random forest by extremely randomized trees
 irisImputed_et <- missRanger(
   irisWithNA, 
   pmm.k = 3, 
   splitrule = "extratrees", 
   num.trees = 100
 )
 
-# With "dplyr" syntax
-library(dplyr)
-
-iris %>% 
-  generateNA() %>% 
-  missRanger(verbose = 0, pmm.k = 5) %>% 
+# Using the pipe...
+iris |> 
+  generateNA() |> 
+  missRanger(pmm.k = 5, verbose = 0) |> 
   head()
 ```
 

diff --git a/cran-comments.md b/cran-comments.md
@@ -1,25 +1,25 @@
-This is a maintenance release, switching to 
+# missRanger 2.2.0
 
-- testthat 3,
-- modifying vignette order,
-- improving the way how the package is being updated/generated.
+- removed suggested dependencies dplyr, mice, survival
+- improved documentation
 
-## R CMD check results seem okay
+## R CMD check
 
 checking for unstated dependencies in examples ... OK
-   WARNING
+
+WARNING
   'qpdf' is needed for checks on size reduction of PDFs
 
-## Online check results seem okay (2 notes below)
+checking for future file timestamps ... NOTE
+  unable to verify current time
+
+## RHub 
 
-- check_win_devel()
-- check_rhub()
+Note: lastMiKTeXException
+
+## Reverse dependency check of 7 packages
 
-Found the following (possibly) invalid DOIs:
-  DOI: 10.1093/bioinformatics/btr597
-    From: DESCRIPTION
-    Status: Forbidden
-    Message: 403
-* checking for detritus in the temp directory ... NOTE
-Found the following files/directories:
-  'lastMiKTeXException'
+- hdImpute 0.1.1                         -- E: 0     | W: 0     | N: 0                       - marginaleffects 0.11.0                 -- E: 0     | W: 0     | N: 0                       - mlim 0.3.0                             -- E: 0     | W: 0     | N: 0                       - NADIA 0.4.2                            -- E: 0     | W: 0     | N: 1                       - outForest 0.1.2                        -- E: 0     | W: 0     | N: 0                       - wiseR 1.0.1                            -- E: 0     | W: 0     | N: 3              
+- worcs 0.1.10                           -- E: 0     | W: 0     | N: 0                                      
+OK: 7                                                                                                       
+BROKEN: 0
diff --git a/man/generateNA.Rd b/man/generateNA.Rd
diff --git a/man/imputeUnivariate.Rd b/man/imputeUnivariate.Rd