For me, one of the largest stumbling blocks in R was the idea of vectorization. The idea of a for loop is one of the most intuitive ideas in programing. If you program in bash or are heavy linux user it is second nature.
Here is a toy example
df = data.frame(col1 = rnorm(1000), col2 = rnorm(1000, 10), col3 = rnbinom(n = 1000,
size = 3, mu = 30))
# A simple data.frame
head(df)
## col1 col2 col3
## 1 -0.626870 8.106 31
## 2 0.496667 9.577 50
## 3 0.593562 10.729 83
## 4 0.306505 8.315 45
## 5 -0.352709 9.043 41
## 6 0.006334 10.249 41
# Say we want the coloumn-wise mean of the data frame
out <- c()
for (i in 1:dim(df)[2]) {
out[i] <- mean(df[, i])
names(out)[i] <- colnames(df)[i]
}
print(out)
## col1 col2 col3
## 0.01494 9.99296 30.53100
In the words of the Bruno, nish-nish.
We are actually commiting two R-sins here: non-vectorized code and growing objects. What we want is to vectorize using one of the apply-family functions
apply(df, MARGIN = 2, FUN = mean)
## col1 col2 col3
## 0.01494 9.99296 30.53100
Or more simply with the user-friendly sapply variant
sapply(df, mean)
## col1 col2 col3
## 0.01494 9.99296 30.53100
'apply' functions are used to apply functions over arrays, matrixes or lists. In R you can pass functions as paramaters. The fancy-dancy CS term is that in R functions are first-class citizens. Get comfortable with it because it is used all over the place in R. This is akin to a call back function in async javascipt.
//example with Jquery
$.json('http://url/', function(data){
//do fun stuff with data
});
If this is greek to you don't worry. Just be aware that when we call sapply(df, mean). The mean
that we are passing in is not an numeric object but a function. This a flavor of functional languages that is mixed into the R-soup. When we say no loops, obviously somewhere a lttle computer gnome has to loop through the data(thats how computers work right). But this looping is done in the C/FORTRAN code that underlies R and is generally faster. Now back to the apply function.
The three paramaters in the apply function are:
apply(
X = The object we are "looping over"
MARGIN = the 'axis' we are interest in traversing (column-wise or row-wise)
FUN = the function we want to apply
)
Truely if there is one point that I would like you to bring home. Regardless of the spurious lies your mother and Montessori teachers told you over the years, you are not that special or smart. What ever you are doing, it likely has been done before. So if you find yourself rewriting the R-wheel, make sure you poke around first.
# the sane way of getting column-wise means
colMeans(df)
## col1 col2 col3
## 0.01494 9.99296 30.53100
If you are doing this kind of data manipulation regularly it is well worth your time to investigate the plyr library. Excelent resource.
- R Inferno - Highly recomend even if your not a Dante fan
- Especially Ch. 3 and 4
- Great intro the Apply() Family
- plyr The R prophet Hadley Wickham's (peace be upon his name) excelent data manupulation package for dataframes
- Most reshaphing a data back flipping can be done with this great package.