Apply
The “apply family” of functions (apply, tapply, lapply and
others) and related functions such as aggregate are central to using
R. They provide an concise, elegant and efficient approach to apply
(sometimes referred to as “to map”) a function to a set of cases, be
they rows or columns in a matrix or data.frame, or elements in a list.
apply
Consider matrix m
m <- matrix(1:15, nrow=5, ncol=3)
m
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
apply
Computation with matrices is ‘vectorized’. For example you can do
m * 5 to multiply all values of m with 5 or do m^2 or m * m
to square the values of m. But often we need to compute values for
the margins of a matrix, that is, a single value for each row or column.
The apply function can be used for that:
# sum values in each row
apply(m, 1, sum)
## [1] 18 21 24 27 30
# get mean for each column
apply(m, 2, mean)
## [1] 3 8 13
apply needs at least three arguments: a matrix or
data.frame, a value that is either 1 or 2 indicating whether the
computation is for rows or for columns, and a function that computes a
new value (or values) for each row or column. You can read more about
this in the help file of the function (type ?apply ). In most cases
you will also add the argument na.rm=TRUE to remove missing values
as any computation that includes an NA will return NA. In this
case we used existing basic functions mean and sum but we could
also supply a function that we wrote ourselves.
Note that apply(and related functions such as tapply and
sapply are ways to avoid writing a loop. In the apply examples
above you could have written a loop to do the computations row by row
(or column by column) but using apply is more compact and efficient.
The rowSums and colSums functions are (fast) shorthand functions for apply( , , sum)
rowSums(m)
## [1] 18 21 24 27 30
tapply
tapply can be used to compute a summary statistic, e.g. a mean
value, for groups of rows in a data.frame. You need one column that
indicates the group, and then you can compute, for example, the mean
value for that group.
colnames(m) <- c('v1', 'v2', 'v3')
d <- data.frame(name=c('Yi', 'Yi', 'Yi', 'Er', 'Er'), m, stringsAsFactors=FALSE)
d$v2[1] <- NA
d
## name v1 v2 v3
## 1 Yi 1 NA 11
## 2 Yi 2 7 12
## 3 Yi 3 8 13
## 4 Er 4 9 14
## 5 Er 5 10 15
Imagine that you would like to compute the average value of v1,
v2 and v3 for each individual (name). You can use tapply
for that.
tapply(d$v1, d$name, mean)
## Er Yi
## 4.5 2.0
tapply(d$v1, d$name, max)
## Er Yi
## 5 3
tapply(d$v2, d$name, mean)
## Er Yi
## 9.5 NA
tapply(d$v2, d$name, mean, na.rm=TRUE)
## Er Yi
## 9.5 7.5
aggregate
aggregate is similar to tapply but more convenient if you want
to compute a summary statistic for multiple variables. It does have the
annoying problem that the second argument cannot be a vector:
aggregate(d[, c("v1", "v2", "v3")], d$name, mean, na.rm=TRUE)
## Error in aggregate.data.frame(d[, c("v1", "v2", "v3")], d$name, mean, : 'by' must be a list
You can fix that in two ways
aggregate(d[, c("v1", "v2", "v3")], d[, "name", drop=FALSE], mean, na.rm=TRUE)
## name v1 v2 v3
## 1 Er 4.5 9.5 14.5
## 2 Yi 2.0 7.5 12.0
# or
aggregate(d[, c("v1", "v2", "v3")], list(d$name), mean, na.rm=TRUE)
## Group.1 v1 v2 v3
## 1 Er 4.5 9.5 14.5
## 2 Yi 2.0 7.5 12.0
As explained before, this is why the first one works: when you extract a
single column from a matrix or data.frame, the structure (class)
“drops” to a simpler form, it becomes a vector. drop=FALSE stops
that from happening.
sapply and lapply
To iterate over a list, we can use lapply or sapply . The
difference is that lapply always returns a list while sapply
tries to simplify the result to a vector or matrix.
names <- list("Antoinette", "Mary", "Duncan", "Obalaya", "Jojo")
nchar("Jim")
## [1] 3
lapply(names, nchar)
## [[1]]
## [1] 10
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 7
##
## [[5]]
## [1] 4
sapply(names, nchar)
## [1] 10 4 6 7 4
In all cases (t)(s)(l)apply and aggregate (and many more
functions) we provided some data and a function, such as mean or
nchar. You can also provide your own custom function. For example
shortname <- function(name) {
if (nchar(name) < 5) {
name <- toupper(name)
return(name)
} else {
name <- substr(name,1,5)
return(name)
}
}
sapply(names, shortname)
## [1] "Antoi" "MARY" "Dunca" "Obala" "JOJO"
More examples: https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/