# 9. Apply¶

The “apply family” of functions (`apply`

, `tapply`

, `lapply`

and
others) and related functions such as `aggregate`

are central to using
*R*. They provide an concise, elegant and efficient approach to apply
(sometimes referred to as “to map”) a function to a set of cases, be
they rows or columns in a matrix or data.frame, or elements in a list.

## apply¶

Consider matrix `m`

```
m <- matrix(1:15, nrow=5, ncol=3)
m
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
```

## apply¶

Computation with matrices is ‘vectorized’. For example you can do
`m * 5`

to multiply all values of m with 5 or do `m^2`

or `m * m`

to square the values of `m`

. But often we need to compute values for
the margins of a matrix, that is, a single value for each row or column.
The `apply`

function can be used for that:

```
# sum values in each row
apply(m, 1, sum)
## [1] 18 21 24 27 30
# get mean for each column
apply(m, 2, mean)
## [1] 3 8 13
```

Note that the apply uses at least three arguments: a matrix, a 1 or 2
indicating whether the computation is for rows or for columns, and a
function that computes a new value (or values) for each row or column.
You can read more about this in the help file of the function (type
`?apply`

). In most cases you will also add the argument
`na.rm=TRUE`

to remove `NA`

(missing) values as any computation that
includes an `NA`

value will return `NA`

. In this case we used
existing basic functions `mean`

and `sum`

but we could also supply a
function that we wrote ourselves.

Note that `apply`

(and related functions such as `tapply`

and
`sapply`

are ways to avoid writing a loop. In the `apply`

examples
above you could have written a loop to do the computations row by row
(or column by column) but using `apply`

is more compact and efficient.

The rowSums and colSums functions are (fast) shorthand functions for apply( , , sum)

```
rowSums(m)
## [1] 18 21 24 27 30
```

## tapply¶

`tapply`

can be used to compute a summary statistic, e.g. a mean
value, for groups of rows in a data.frame. You need one column that
indicates the group, and then you can compute, for example, the mean
value for that group.

```
colnames(m) <- c('v1', 'v2', 'v3')
d <- data.frame(name=c('Yi', 'Yi', 'Yi', 'Er', 'Er'), m, stringsAsFactors=FALSE)
d$v2[1] <- NA
d
## name v1 v2 v3
## 1 Yi 1 NA 11
## 2 Yi 2 7 12
## 3 Yi 3 8 13
## 4 Er 4 9 14
## 5 Er 5 10 15
```

Imagine that you would like to compute the average value of `v1`

,
`v2`

and `v3`

for each individual (`name`

). You can use `tapply`

for that.

```
tapply(d$v1, d$name, mean)
## Er Yi
## 4.5 2.0
tapply(d$v1, d$name, max)
## Er Yi
## 5 3
tapply(d$v2, d$name, mean)
## Er Yi
## 9.5 NA
tapply(d$v2, d$name, mean, na.rm=TRUE)
## Er Yi
## 9.5 7.5
```

## aggregate¶

`aggregate`

is similar to `tapply`

but more convenient if you want
to compute a summary statistic for multiple variables. It does have the
annoying problem that the second argument cannot be a vector:

```
aggregate(d[, c('v1', 'v2', 'v3')], d$name, mean, na.rm=TRUE)
## Error in aggregate.data.frame(d[, c("v1", "v2", "v3")], d$name, mean, : 'by' must be a list
```

You can fix that in two ways

```
aggregate(d[, c('v1', 'v2', 'v3')], d[, 'name', drop=FALSE], mean, na.rm=TRUE)
## name v1 v2 v3
## 1 Er 4.5 9.5 14.5
## 2 Yi 2.0 7.5 12.0
# or
aggregate(d[, c('v1', 'v2', 'v3')], list(d$name), mean, na.rm=TRUE)
## Group.1 v1 v2 v3
## 1 Er 4.5 9.5 14.5
## 2 Yi 2.0 7.5 12.0
```

As explained before, this is why the first one works: when you extract a
single column from a `matrix`

or `data.frame`

, the structure (class)
“drops” to a simpler form, it becomes a vector. `drop=FALSE`

stops
that from happening.

## sapply and lapply¶

To iterate over a list, we can use `lapply`

or `sapply`

. The
difference is that `lapply`

always returns a list while `sapply`

tries to simplify the result to a vector or matrix.

```
names <- list('Antoinette', 'Mary', 'Duncan', 'Obalaya', 'Jojo')
nchar('Jim')
## [1] 3
lapply(names, nchar)
## [[1]]
## [1] 10
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 7
##
## [[5]]
## [1] 4
sapply(names, nchar)
## [1] 10 4 6 7 4
```

In all cases `(t)(s)(l)apply`

and `aggregate`

(and many more
functions) we provided some data and a function, such as `mean`

or
`nchar`

. You can also provide your own custom function. For example

```
shortname <- function(name) {
if (nchar(name) < 5) {
name <- toupper(name)
return(name)
} else {
name <- substr(name,1,5)
return(name)
}
}
sapply(names, shortname)
## [1] "Antoi" "MARY" "Dunca" "Obala" "JOJO"
```

More examples: https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/