# Apply¶

The “apply family” of functions (`apply`, `tapply`, `lapply` and others) and related functions such as `aggregate` are central to using R. They provide an concise, elegant and efficient approach to apply (sometimes referred to as “to map”) a function to a set of cases, be they rows or columns in a matrix or data.frame, or elements in a list.

## apply¶

Consider matrix `m`

```m <- matrix(1:15, nrow=5, ncol=3)
m
##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15
```

## apply¶

Computation with matrices is ‘vectorized’. For example you can do `m * 5` to multiply all values of m with 5 or do `m^2` or `m * m` to square the values of `m`. But often we need to compute values for the margins of a matrix, that is, a single value for each row or column. The `apply` function can be used for that:

```# sum values in each row
apply(m, 1, sum)
##  18 21 24 27 30
# get mean for each column
apply(m, 2, mean)
##   3  8 13
```

`apply` needs at least three arguments: a `matrix` or `data.frame`, a value that is either 1 or 2 indicating whether the computation is for rows or for columns, and a function that computes a new value (or values) for each row or column. You can read more about this in the help file of the function (type `?apply` ). In most cases you will also add the argument `na.rm=TRUE` to remove missing values as any computation that includes an `NA` will return `NA`. In this case we used existing basic functions `mean` and `sum` but we could also supply a function that we wrote ourselves.

Note that `apply`(and related functions such as `tapply` and `sapply` are ways to avoid writing a loop. In the `apply` examples above you could have written a loop to do the computations row by row (or column by column) but using `apply` is more compact and efficient.

The rowSums and colSums functions are (fast) shorthand functions for apply( , , sum)

```rowSums(m)
##  18 21 24 27 30
```

## tapply¶

`tapply` can be used to compute a summary statistic, e.g. a mean value, for groups of rows in a data.frame. You need one column that indicates the group, and then you can compute, for example, the mean value for that group.

```colnames(m) <- c('v1', 'v2', 'v3')
d <- data.frame(name=c('Yi', 'Yi', 'Yi', 'Er', 'Er'), m, stringsAsFactors=FALSE)
d\$v2 <- NA
d
##   name v1 v2 v3
## 1   Yi  1 NA 11
## 2   Yi  2  7 12
## 3   Yi  3  8 13
## 4   Er  4  9 14
## 5   Er  5 10 15
```

Imagine that you would like to compute the average value of `v1`, `v2` and `v3` for each individual (`name`). You can use `tapply` for that.

```tapply(d\$v1, d\$name, mean)
##  Er  Yi
## 4.5 2.0
tapply(d\$v1, d\$name, max)
## Er Yi
##  5  3
tapply(d\$v2, d\$name, mean)
##  Er  Yi
## 9.5  NA
tapply(d\$v2, d\$name, mean, na.rm=TRUE)
##  Er  Yi
## 9.5 7.5
```

## aggregate¶

`aggregate` is similar to `tapply` but more convenient if you want to compute a summary statistic for multiple variables. It does have the annoying problem that the second argument cannot be a vector:

```aggregate(d[, c("v1", "v2", "v3")], d\$name, mean, na.rm=TRUE)
## Error in aggregate.data.frame(d[, c("v1", "v2", "v3")], d\$name, mean, : 'by' must be a list
```

You can fix that in two ways

```aggregate(d[, c("v1", "v2", "v3")], d[, "name", drop=FALSE], mean, na.rm=TRUE)
##   name  v1  v2   v3
## 1   Er 4.5 9.5 14.5
## 2   Yi 2.0 7.5 12.0
# or
aggregate(d[, c("v1", "v2", "v3")], list(d\$name), mean, na.rm=TRUE)
##   Group.1  v1  v2   v3
## 1      Er 4.5 9.5 14.5
## 2      Yi 2.0 7.5 12.0
```

As explained before, this is why the first one works: when you extract a single column from a `matrix` or `data.frame`, the structure (class) “drops” to a simpler form, it becomes a vector. `drop=FALSE` stops that from happening.

## sapply and lapply¶

To iterate over a list, we can use `lapply` or `sapply` . The difference is that `lapply` always returns a list while `sapply` tries to simplify the result to a vector or matrix.

```names <- list("Antoinette", "Mary", "Duncan", "Obalaya", "Jojo")
nchar("Jim")
##  3
lapply(names, nchar)
## []
##  10
##
## []
##  4
##
## []
##  6
##
## []
##  7
##
## []
##  4
sapply(names, nchar)
##  10  4  6  7  4
```

In all cases `(t)(s)(l)apply` and `aggregate` (and many more functions) we provided some data and a function, such as `mean` or `nchar`. You can also provide your own custom function. For example

```shortname <- function(name) {
if (nchar(name) < 5) {
name <- toupper(name)
return(name)
} else {
name <- substr(name,1,5)
return(name)
}
}
sapply(names, shortname)
##  "Antoi" "MARY"  "Dunca" "Obala" "JOJO"
```