Skip to content

Latest commit

 

History

History
133 lines (86 loc) · 3.82 KB

VectorizationIntro.md

File metadata and controls

133 lines (86 loc) · 3.82 KB

Intro to Vectorization in R

A.K.A. Do Nots USES da LOOPS

Created for August 2012 DC R Users Group Meetup

For me, one of the largest stumbling blocks in R was the idea of vectorization. The idea of a for loop is one of the most intuitive ideas in programing. If you program in bash or are heavy linux user it is second nature.

In R, it is to be avoided. You do not want to explictaly call a for loop.

Here is a toy example

df = data.frame(col1 = rnorm(1000), col2 = rnorm(1000, 10), col3 = rnbinom(n = 1000, 
    size = 3, mu = 30))
# A simple data.frame
head(df)
##        col1   col2 col3
## 1 -0.626870  8.106   31
## 2  0.496667  9.577   50
## 3  0.593562 10.729   83
## 4  0.306505  8.315   45
## 5 -0.352709  9.043   41
## 6  0.006334 10.249   41
# Say we want the coloumn-wise mean of the data frame
out <- c()
for (i in 1:dim(df)[2]) {
    out[i] <- mean(df[, i])
    names(out)[i] <- colnames(df)[i]
}

print(out)
##     col1     col2     col3 
##  0.01494  9.99296 30.53100 

In the words of the Bruno, nish-nish.

We are actually commiting two R-sins here: non-vectorized code and growing objects. What we want is to vectorize using one of the apply-family functions

apply(df, MARGIN = 2, FUN = mean)
##     col1     col2     col3 
##  0.01494  9.99296 30.53100 

Or more simply with the user-friendly sapply variant

sapply(df, mean)
##     col1     col2     col3 
##  0.01494  9.99296 30.53100 

Let's look at what that just did.

'apply' functions are used to apply functions over arrays, matrixes or lists. In R you can pass functions as paramaters. The fancy-dancy CS term is that in R functions are first-class citizens. Get comfortable with it because it is used all over the place in R. This is akin to a call back function in async javascipt.

//example with Jquery
$.json('http://url/', function(data){
//do fun stuff with data
});

If this is greek to you don't worry. Just be aware that when we call sapply(df, mean). The mean that we are passing in is not an numeric object but a function. This a flavor of functional languages that is mixed into the R-soup. When we say no loops, obviously somewhere a lttle computer gnome has to loop through the data(thats how computers work right). But this looping is done in the C/FORTRAN code that underlies R and is generally faster. Now back to the apply function.

The three paramaters in the apply function are:

apply(
 X = The object we are "looping over"
 MARGIN = the 'axis' we are interest in traversing (column-wise or row-wise)
 FUN = the function we want to apply
    )

Truely if there is one point that I would like you to bring home. Regardless of the spurious lies your mother and Montessori teachers told you over the years, you are not that special or smart. What ever you are doing, it likely has been done before. So if you find yourself rewriting the R-wheel, make sure you poke around first.

# the sane way of getting column-wise means
colMeans(df)
##     col1     col2     col3 
##  0.01494  9.99296 30.53100 

So if you are thinking of a loop, don't

If you are doing this kind of data manipulation regularly it is well worth your time to investigate the plyr library. Excelent resource.

More reading

  • R Inferno - Highly recomend even if your not a Dante fan
    • Especially Ch. 3 and 4
  • Great intro the Apply() Family
  • plyr The R prophet Hadley Wickham's (peace be upon his name) excelent data manupulation package for dataframes
    • Most reshaphing a data back flipping can be done with this great package.