-
Notifications
You must be signed in to change notification settings - Fork 34
R GSoC 2015: Subsetted and parallel computations in matrixStats
Below is a project proposal for the R Project's participation in the Google Summer of Code 2015. The application deadline for students is on March 27, 2015, but please also note that there is a "soft" deadline to submit project tests on March 20.
The matrixStats package provides fast and memory-conservative functions for calculating summaries and statistics on rows and columns of matrices as well as vectors. For instance,
y <- matrixStats::colVars(X)
is much faster (~10-60 times) than
y <- apply(X, MARGIN = 2, FUN = vars)
For benchmark results on all functions of the package, see Benchmark Reports.
There are no known projects with the same goals as the matrixStats package, except from some individual functions that are part of larger packages scattered across CRAN and Bioconductor. In contrast, the matrixStats package focuses on providing a light-weight and cross-platform solution that is straightforward to use and install (with a minimal amount of package dependencies). The matrixStats package was created in 2007 and released on CRAN in 2009, it is well tested (96% code coverage), currently used by 55 CRAN and Bioconductor packages, and among the top-2% most installed packages (R Studio CRAN logs).
The idea of this project is to extend the matrixStats package with mechanisms for calculating matrixStats summaries and statistics
- on a subset of data (e.g. rows or columns), and
- chunked parallel processing.
Efficient processing of data in subsets will boost the performance of some use cases commonly seen by end users as well as developers. Furthermore, support for subsetted calculations provides a natural framework for processing data in parallel. As one of the below examples illustrate, subsetted calculations are also useful for calculating summaries across groups of rows or columns.
The following proof-of-concept examples illustrate what the outcome of this project may be.
Calculations on a subset of rows and columns:
y <- colVars(X, rows = -1:20, cols = 101:400)
Row summaries across groups of columns:
colGroups <- split(seq(along=factor), factor)
y <- sapply(colGroups, FUN=function(idxs) rowVars(X, cols = idxs))
Chunked parallel processing:
cl <- parallel::makeForkCluster(nnodes = 4)
matrixStats::useCluster(cl)
y <- colVars(X)
Below are further details and motivation for this project.
Consider the following 1000-by-1000 numeric matrix:
X <- matrix(rnorm(1000^2), ncol = 1000, nrow = 1000)
Assume that we want to find the minimum value for a subset of the columns. This can be done using
y <- matrixStats::colMins(X[, 101:400])
Although the colMins()
have been optimized to use a minimal amount
of memory, we cannot avoid the creation of the intermediate object
X[, 101:400]
in the above call. The creation of this intermediate
object introduces overhead due to (i) memory allocation of a
1000-by-300 matrix and (ii) the need to copy values. It also adds
(iii) extra work to the garbage collector which has to spend time
cleaning it out afterward. In addition, the intermediate object
(iv) increases the overall memory footprint.
The first goal of this project is to extend the API by adding
optional arguments rows
and cols
to functions, i.e.
y <- matrixStats::colMins(X, cols = 101:400)
as well as
y <- matrixStats::rowMins(X, rows = -(1:20), cols = 101:400)
This way the subsetting can instead be done internally and "on-the-fly" by native code.
Here is a proof-of-concept benchmark that illustrates the costs of
subsetting with X[, 101:400]
as well as the approach where the
computations are done on the complete object and subsetting is done
afterward:
X <- matrix(rnorm(1e6), ncol = 1000, nrow = 1000)
X_S <- X[, 101:400] ## Emulate planned subsetting
stats <- microbenchmark::microbenchmark(
colMins(X_S),
colMins(X)[101:400],
colMins(X[, 101:400]),
"apply()+min()"=apply(X[, 101:400], MARGIN = 2L, FUN = min)
)
print(stats)
Unit: microseconds
expr min lq mean median uq max
colMins(X_S) 335.3 347.6 429 406.5 419.8 684.4
colMins(X)[101:400] 3540.4 3707.7 4006 3723.5 3794.1 6610.8
colMins(X[, 101:400]) 5822.0 6060.1 7453 6267.6 9121.8 12733.8
apply()+min() 11914.3 14266.5 17614 15178.5 17035.9 92841.3
This shows that the potential speedup in this particular case is at
least 10 times compared with current colMins()
and about 40 times
compared with the traditional apply() + min()
approach. If these
calculations are done 10,000s of times (e.g. across human gene expressions or in bootstrap analyses), then the overall processing time saved with the proposed methods will be substantial.
Analogously, vector-based functions can gain from subsetting, e.g.
y <- weightedMean(x, w = w, idxs = 201:400)
Note that here subsetting is applied to two objects.
The parallel package has been part of the core R distribution since R 2.14.0 (October 2011). It provides various mechanisms for parallel processing. With the above subsetting API in place, it will immediately be possible to do efficient matrixStats calculations on data chunks in parallel, e.g.
library('matrixStats')
library('parallel')
X <- matrix(rnorm(1e6), ncol = 1000, nrow = 1000)
cl <- makeForkCluster(nnodes = 4L)
subsets <- splitIndices(ncol(X), ncl = 4L)
y <- parLapply(cl, subsets, fun = function(cols) {
matrixStats::colMins(X, cols = cols)
})
y <- Reduce(c, y)
stopCluster(cl)
The second goals of this project is to simplify this and allow for calls such as:
y <- colMins(X, cl = cl)
or better by telling matrixStats via global settings to use parallel processing, e.g.
matrixStats::useCluster(cl)
such that matrixStats calls are the same regardless of parallel processing is used or not, e.g.
yc <- colMins(X)
yr <- rowMins(X)
The latter, referred to as "implicit parallel processing", would provide any existing code and packages already using matrixStats with an immediate performance boost without the need for modifications/updates.
The overall objectives of this project and the matrixStats package is to provide the larger research community and the industry with an awesome and reliable tool.
Examples of tasks, challenges and goals (some may be optional):
-
Implement support for optional arguments 'rows' and 'cols' (and 'idxs' for vector functions) to existing matrixStats function both at the R level and the C level. This needs to be done in a way such that overhead added is kept at a minimum also when subsetting is not performed.
-
Write unit tests for common cases as well as corner cases (e.g. single-element objects, missing values, and integer-overflow calculations). Correctness of the calculations of matrixStats is critical for users.
-
Implement support for chunked parallel processing. Overhead added should be kept at a minimum, also in the case when parallel processing is not used.
-
Explore options for implementing the parallel support in native code in order to further minimize any overhead.
-
Consider how to handle early stopping and exceptions due to failed parallel processing (e.g. timeouts).
-
Sometimes the overhead of parallel processing is greater than the gain, e.g. for small data objects. Add size options/thresholds to control when parallel processing should be skipped. Explore the possibility to tune (estimate and set) these thresholds automatically for the R setup currently running.
-
Write benchmark reports.
-
Embrace bug reports and reproducible research.
-
Never be afraid to ask questions or propose alternative ideas/solutions - we all learn from asking questions and doing mistakes and behind every great idea there is a large number of "failed" ideas.
-
Pay attention to, and maybe even engage with, the "official" efforts in R and Bioconductor on parallel and distributed processing in R.
-
Learn new things and have fun!
Any interested students should get in touch with Henrik Bengtsson ([email protected]; Google Melange username/id: henrikb) or Héctor Corrada Bravo ([email protected]; Google Melange username/id: hcorrada) as soon as possible so we can begin work on a more precise project proposal with specific goals and target completion dates. If anything is unclear don't hesitate to drop us a message.
In order to succeed with this project, the following skills are required or very highly recommended:
- R package development, e.g. building, testing and testing and testing.
- C level programming, i.e. modifications to existing native code is needed.
- Experience with GitHub is a major advantage, especially since it will be how you will access existing code and share your code.
- If you already know how to use valgrind, that will also be a major advantage. If not, you will most likely learn how to use it since it is a very useful tool for validating C code.
Students, please send us results (links) of the below tests. You are not required to do all tests, but doing more tests and more difficult tests makes you more likely to be selected for this project. Also, these tests also serves as means for you (and us) to get a feeling for whether this project is a good match for you. The more of the tests that you understand and try to attack and even solve, the more confident you can be that you will be able to deliver on this project.
-
Easy: Git and R package creation
- Fork the matrixStats repository and
build the package and check it with both
R CMD check --as-cran
. - Check out the 'develop' branch and redo the same.
- If possible, also check with
R CMD check --use-valgrind
.
- Fork the matrixStats repository and
build the package and check it with both
-
Easy: Prototyping in R
- Write an R function
rowMins_R()
that takes argumentcols
such that one can call:
- Write an R function
X <- matrix(1:50, nrow = 5, ncol = 10) y <- rowMins_R(X) y <- rowMins_R(X, cols = 6:8)
Demonstrate that the output is correct.
-
Easy/Medium: Miscellaneous
- If you have a idea on how to improve this project or matrixStats, we would love to hear about it.
- If you have developed your own R package that uses C/C++ code (or other related code) which you are proud of, please don't hesitate to share it with us.
-
Medium: C programming
- Using the
inline
package add a prototype of yourrowMins_R()
function above, implemented in C, and name itrowMins_C()
.
- Using the
-
Medium: R programming
- Using your
rowMins_C()
(alternativelyrowMins_R()
) function above, verify that you can find the row mins for a set of column groups of a matrix (similar to one of the example in the 'Project ideas' section):
- Using your
X <- matrix(1:50, nrow = 5, ncol = 10) factors <- as.factor(c(3,1,1,2,2,1,1,2,3,1)) groups <- split(seq(along=factor), factor) Y <- sapply(groups, FUN=function(idxs) rowMins_R(X, cols=idxs))
Demonstrate its correctness.
-
Medium/Hard: C programming
- Implement
weightedVar(x, w)
in C, because currently it calculates the weighted variance of vectorx
using plain R. Demonstrate its correctness.
Hint: You might find it helpful to use theweightedMean()
function, with its high-level and low-level C functions, as a template and modify accordingly.
- Implement
-
Hard: Begin work on the project.