Apply
The “apply family” of functions (apply
, tapply
, lapply
and
others) and related functions such as aggregate
are central to using
R. They provide an concise, elegant and efficient approach to apply
(sometimes referred to as “to map”) a function to a set of cases, be
they rows or columns in a matrix or data.frame, or elements in a list.
apply
Consider matrix m
m <- matrix(1:15, nrow=5, ncol=3)
m
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
apply
Computation with matrices is ‘vectorized’. For example you can do
m * 5
to multiply all values of m with 5 or do m^2
or m * m
to square the values of m
. But often we need to compute values for
the margins of a matrix, that is, a single value for each row or column.
The apply
function can be used for that:
# sum values in each row
apply(m, 1, sum)
## [1] 18 21 24 27 30
# get mean for each column
apply(m, 2, mean)
## [1] 3 8 13
apply
needs at least three arguments: a matrix
or
data.frame
, a value that is either 1 or 2 indicating whether the
computation is for rows or for columns, and a function that computes a
new value (or values) for each row or column. You can read more about
this in the help file of the function (type ?apply
). In most cases
you will also add the argument na.rm=TRUE
to remove missing values
as any computation that includes an NA
will return NA
. In this
case we used existing basic functions mean
and sum
but we could
also supply a function that we wrote ourselves.
Note that apply
(and related functions such as tapply
and
sapply
are ways to avoid writing a loop. In the apply
examples
above you could have written a loop to do the computations row by row
(or column by column) but using apply
is more compact and efficient.
The rowSums and colSums functions are (fast) shorthand functions for apply( , , sum)
rowSums(m)
## [1] 18 21 24 27 30
tapply
tapply
can be used to compute a summary statistic, e.g. a mean
value, for groups of rows in a data.frame. You need one column that
indicates the group, and then you can compute, for example, the mean
value for that group.
colnames(m) <- c('v1', 'v2', 'v3')
d <- data.frame(name=c('Yi', 'Yi', 'Yi', 'Er', 'Er'), m, stringsAsFactors=FALSE)
d$v2[1] <- NA
d
## name v1 v2 v3
## 1 Yi 1 NA 11
## 2 Yi 2 7 12
## 3 Yi 3 8 13
## 4 Er 4 9 14
## 5 Er 5 10 15
Imagine that you would like to compute the average value of v1
,
v2
and v3
for each individual (name
). You can use tapply
for that.
tapply(d$v1, d$name, mean)
## Er Yi
## 4.5 2.0
tapply(d$v1, d$name, max)
## Er Yi
## 5 3
tapply(d$v2, d$name, mean)
## Er Yi
## 9.5 NA
tapply(d$v2, d$name, mean, na.rm=TRUE)
## Er Yi
## 9.5 7.5
aggregate
aggregate
is similar to tapply
but more convenient if you want
to compute a summary statistic for multiple variables. It does have the
annoying problem that the second argument cannot be a vector:
aggregate(d[, c("v1", "v2", "v3")], d$name, mean, na.rm=TRUE)
## Error in aggregate.data.frame(d[, c("v1", "v2", "v3")], d$name, mean, : 'by' must be a list
You can fix that in two ways
aggregate(d[, c("v1", "v2", "v3")], d[, "name", drop=FALSE], mean, na.rm=TRUE)
## name v1 v2 v3
## 1 Er 4.5 9.5 14.5
## 2 Yi 2.0 7.5 12.0
# or
aggregate(d[, c("v1", "v2", "v3")], list(d$name), mean, na.rm=TRUE)
## Group.1 v1 v2 v3
## 1 Er 4.5 9.5 14.5
## 2 Yi 2.0 7.5 12.0
As explained before, this is why the first one works: when you extract a
single column from a matrix
or data.frame
, the structure (class)
“drops” to a simpler form, it becomes a vector. drop=FALSE
stops
that from happening.
sapply and lapply
To iterate over a list, we can use lapply
or sapply
. The
difference is that lapply
always returns a list while sapply
tries to simplify the result to a vector or matrix.
names <- list("Antoinette", "Mary", "Duncan", "Obalaya", "Jojo")
nchar("Jim")
## [1] 3
lapply(names, nchar)
## [[1]]
## [1] 10
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 7
##
## [[5]]
## [1] 4
sapply(names, nchar)
## [1] 10 4 6 7 4
In all cases (t)(s)(l)apply
and aggregate
(and many more
functions) we provided some data and a function, such as mean
or
nchar
. You can also provide your own custom function. For example
shortname <- function(name) {
if (nchar(name) < 5) {
name <- toupper(name)
return(name)
} else {
name <- substr(name,1,5)
return(name)
}
}
sapply(names, shortname)
## [1] "Antoi" "MARY" "Dunca" "Obala" "JOJO"
More examples: https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/