3. Analysing species distribution data

2.1 Introduction

In this case-study I show some techniques that can be used to analyze species distribution data with R. Before going through this document you should at least be somewhat familiar with R and spatial data manipulation in *R*. This document is based on an analysis of the distribution of wild potato species by Hijmans and Spooner (2001). Wild potatoes (Solanaceae; Solanum sect. Petota are relatives of the cultivated potato. There are nearly 200 different species that occur in the Americas.

2.2 Import and prepare data

First download the data (in a zip file), and extract the files from the zip file.

filename <- 'WILDPOT.txt'
if (!file.exists(filename)) {
    download.file('http://rspatial.org/data/cases/diva_ex1_data.zip', 'wildpot.zip')
    unzip('wildpot.zip')
}
## Warning in download.file("http://rspatial.org/data/cases/
## diva_ex1_data.zip", : cannot open URL 'http://rspatial.org/data/cases/
## diva_ex1_data.zip': HTTP status was '404 Not Found'
## Error in download.file("http://rspatial.org/data/cases/diva_ex1_data.zip", : cannot open URL 'http://rspatial.org/data/cases/diva_ex1_data.zip'

The extracted file is a tab delimited text file. Normally, you would read such a file with something like:

d <- read.table('WILDPOT.txt', header=TRUE)
## Warning in file(file, "rt"): cannot open file 'WILDPOT.txt': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection

But that does not work in this case because some lines are incomplete. So we have to resort to some more complicated tricks.

# read all lines using UTF-8 encoding
d <- readLines('WILDPOT.txt', encoding='UTF-8')
## Warning in file(con, "r"): cannot open file 'WILDPOT.txt': No such file or
## directory
## Error in file(con, "r"): cannot open the connection
# split each line into elements using the tabs
dd <- strsplit(d, '\t')
## Error in strsplit(d, "\t"): object 'd' not found
# show that the number of elements varies
table(sapply(dd, length))
## Error in lapply(X = X, FUN = FUN, ...): object 'dd' not found

# function to complete each line to 22 items
fun <- function(x) {
    r <- rep("", 22)
    r[1:length(x)] <- x
    r
}

# apply function to each element of the list
ddd <- lapply(dd, fun)
## Error in lapply(dd, fun): object 'dd' not found
# row bind all elements (into a matrix)
v <- do.call(rbind, ddd)
## Error in do.call(rbind, ddd): object 'ddd' not found
head(v)
## Error in head(v): object 'v' not found

#set the column names and remove them from the data
colnames(v) <- v[1,]
## Error in eval(expr, envir, enclos): object 'v' not found
v <- v[-1,]
## Error in eval(expr, envir, enclos): object 'v' not found

# coerce into a data.frame and change the type of some variables
# to numeric (instead of character)
v <- data.frame(v, stringsAsFactors=FALSE)
## Error in data.frame(v, stringsAsFactors = FALSE): object 'v' not found

The coordinate data is in degrees, minutes, seconds (in separate columns, fortunately), so we need to compute longitude and latitude as single numbers.

# first coerce character values to numbers
for (i in c('LongD', 'LongM', 'LongS', 'LatD', 'LatM', 'LatS')) {
    v[, i] <- as.numeric(v[,i])
}
## Error: object 'v' not found
v$lon <- -1 * (v$LongD + v$LongM / 60 + v$LongS / 3600)
## Error in eval(expr, envir, enclos): object 'v' not found
v$lat <- v$LatD + v$LatM / 60 + v$LatS / 3600
## Error in eval(expr, envir, enclos): object 'v' not found

# Southern hemisphere gets a negative sign
v$lat[v$LatH == 'S'] <- -1 * v$lat[v$LatH == 'S']
## Error in eval(expr, envir, enclos): object 'v' not found
head(v)
## Error in head(v): object 'v' not found

Download and read a shapefile with most of the countries of the Americas.

if (!file.exists("pt_countries.shp")) {
    download.file('data/pt_countries.zip', 'pt_countries.zip')
    unzip('pt_countries.zip')
}
## Error in download.file("data/pt_countries.zip", "pt_countries.zip"): scheme not supported in URL 'data/pt_countries.zip'


library(raster)
cn <- shapefile('pt_countries.shp')
## Error: file.exists(extension(x, ".shp")) is not TRUE
proj4string(cn) <- CRS("+proj=longlat +datum=WGS84")
## Error in proj4string(cn) <- CRS("+proj=longlat +datum=WGS84"): object 'cn' not found
class(cn)
## Error in eval(expr, envir, enclos): object 'cn' not found

Make a quick map

plot(cn, xlim=c(-120, -40), ylim=c(-40,40), axes=TRUE)
## Error in plot(cn, xlim = c(-120, -40), ylim = c(-40, 40), axes = TRUE): object 'cn' not found
points(v$lon, v$lat, cex=.5, col='red')
## Error in points(v$lon, v$lat, cex = 0.5, col = "red"): object 'v' not found

And create a SpatialPointsDataFrame for the potato data with the formula approach

sp <- v
## Error in eval(expr, envir, enclos): object 'v' not found
coordinates(sp) <- ~lon + lat
## Error in coordinates(sp) <- ~lon + lat: object 'sp' not found
proj4string(sp) <- CRS("+proj=longlat +datum=WGS84")
## Error in proj4string(sp) <- CRS("+proj=longlat +datum=WGS84"): object 'sp' not found

Alternatively, you can do

sp <- SpatialPoints( v[, c('lon', 'lat')],
             proj4string=CRS("+proj=longlat +datum=WGS84") )
## Error in coordinates(coords): object 'v' not found
sp <- SpatialPointsDataFrame(sp, v)
## Error in is(coords, "SpatialPoints"): object 'sp' not found

2.3 Summary statistics

We are first going to summarize the data by country. We can use the country variable in the data, or extract that from the countries SpatialPolygonsDataFrame.

table(v$COUNTRY)
## Error in table(v$COUNTRY): object 'v' not found
# note Peru and PERU
v$COUNTRY <- toupper(v$COUNTRY)
## Error in toupper(v$COUNTRY): object 'v' not found
table(v$COUNTRY)
## Error in table(v$COUNTRY): object 'v' not found

# same fix for the SpatialPointsDataFrame
sp$COUNTRY <- toupper(sp$COUNTRY)
## Error in toupper(sp$COUNTRY): object 'sp' not found

Below we determine the country using a spatial query, using the “over” function.

ov <- over(sp, cn)
## Error in over(sp, cn): object 'sp' not found
colnames(ov) <- 'name'
## Error in colnames(ov) <- "name": object 'ov' not found
head(ov)
## Error in head(ov): object 'ov' not found
v <- cbind(v, ov)
## Error in cbind(v, ov): object 'v' not found
table(v$COUNTRY)
## Error in table(v$COUNTRY): object 'v' not found

This table is similar to the previous table, but it is not the same. Let’s find the records that are not in the same country according to the original data and the spatial query.

# some fixes first
# apparantly in the ocean (small island missing from polygon data)
v$name[is.na(v$name)] <- ''
## Error in v$name[is.na(v$name)] <- "": object 'v' not found
# some spelling differenes
v$name[v$name=="UNITED STATES, THE"] <- "UNITED STATES"
## Error in v$name[v$name == "UNITED STATES, THE"] <- "UNITED STATES": object 'v' not found
v$name[v$name=="BRASIL"] <- "BRAZIL"
## Error in v$name[v$name == "BRASIL"] <- "BRAZIL": object 'v' not found

i <- which(toupper(v$name) != v$COUNTRY)
## Error in toupper(v$name): object 'v' not found
i
## [1] "LongD"
plot(cn, xlim=c(-120, -40), ylim=c(-40,40), axes=TRUE)
## Error in plot(cn, xlim = c(-120, -40), ylim = c(-40, 40), axes = TRUE): object 'cn' not found
points(sp, cex=.25, pch='+', col='blue')
## Error in points(sp, cex = 0.25, pch = "+", col = "blue"): object 'sp' not found
points(sp[i,], col='red', pch='x', cex=1.5)
## Error in points(sp[i, ], col = "red", pch = "x", cex = 1.5): object 'sp' not found

All observations that are in a different country than their attribute data suggests are very close to an international border, or in the water. That suggests that the coordinates of the potato locations are not very precise (or the borders are inexact). Otherwise, this is reassuring (and a-typical). There are often are several inconsistencies, and it can be hard to find out whether the locality coordinates are wrong or whether the borders are wrong; but further inspection is warranted in those cases.

We can compute the number of species for each country.

spc <- tapply(v$SPECIES, sp$COUNTRY, function(x)length(unique(x)) )
## Error in tapply(v$SPECIES, sp$COUNTRY, function(x) length(unique(x))): object 'sp' not found
spc <- data.frame(COUNTRY=names(spc), nspp = spc)
## Error in data.frame(COUNTRY = names(spc), nspp = spc): object 'spc' not found

# merge with country SpatialPolygonsDataFrame
cn <- merge(cn, spc, by='COUNTRY')
## Error in merge(cn, spc, by = "COUNTRY"): object 'cn' not found
print(spplot(cn, 'nspp', col.regions=rev(terrain.colors(25))))
## Error in spplot(cn, "nspp", col.regions = rev(terrain.colors(25))): object 'cn' not found

The map shows that Peru is the country with most potato species, followed by Bolivia and Mexico. We can also tabulate the number of occurrences of each species by each country.

tb <- table(v[ c('COUNTRY', 'SPECIES')])
## Error in table(v[c("COUNTRY", "SPECIES")]): object 'v' not found
# a big table
dim(tb)
## Error in eval(expr, envir, enclos): object 'tb' not found
# show two columns
tb[,2:3]
## Error in eval(expr, envir, enclos): object 'tb' not found

Because the countries have such different sizes and shapes, the comparison is not fair (larger countries will have more species, on average, than smaller countries). Some countries are also very large, hiding spatial variation. The map the number of species, it is in most cases better to use a raster (grid) with cells of equal area, and that is what we will do next.

2.4 Projecting spatial data

To use a raster with equal-area cells, the data need to be projected to an equal-area coordinate reference system (CRS). If the longitude/latitude date were used, cells of say 1 square degree would get smaller as you move away from the equator: think of the meridians (vertical lines) on the globe getting closer to each other as you go towards the poles.

For small areas, particularly if they only span a few degrees of longitude, UTM can be a good CRS, but it this case we will use a CRS that can be used for a complete hemisphere: Lambert Equal Area Azimuthal. For this CRS, you must choose a map origin for your data. This should be somewhere in the center of the points, to minimize the distance (and hence distortion) from any point to the origin. In this case, a reasonable location is (-80, 0).

library(rgdal)
# "proj.4" notation of CRS
projection(cn) <- "+proj=longlat +datum=WGS84"
## Error in projection(cn) <- "+proj=longlat +datum=WGS84": object 'cn' not found
# the CRS we want
laea <- CRS("+proj=laea  +lat_0=0 +lon_0=-80")
clb <- spTransform(cn, laea)
## Error in spTransform(cn, laea): object 'cn' not found
pts <- spTransform(sp, laea)
## Error in spTransform(sp, laea): object 'sp' not found
plot(clb, axes=TRUE)
## Error in plot(clb, axes = TRUE): object 'clb' not found
points(pts, col='red', cex=.5)
## Error in points(pts, col = "red", cex = 0.5): object 'pts' not found

Note that the shape of the countries is now much more similar to their shape on a globe than before we projected You can also see that the coordinate system has changed by looking at the numbers of the axes. These express the distance from the origin (-80, 0) in meters.

2.5 Species richness

Let’s determine the distribution of species richness using a raster. First we need an empty ‘template’ raster that has the correct extent and resolution. Here I use 200 by 200 km cells.

r <- raster(clb)
## Error in raster(clb): object 'clb' not found
# 200 km = 200000 m
res(r) <- 200000
## Error in res(r) <- 2e+05: object 'r' not found

Now compute the number of observations and the number of species richness for each cell.

rich <- rasterize(pts, r, 'SPECIES', function(x, ...) length(unique(na.omit(x))))
## Error in rasterize(pts, r, "SPECIES", function(x, ...) length(unique(na.omit(x)))): object 'pts' not found
plot(rich)
## Error in plot(rich): object 'rich' not found
plot(clb, add=TRUE)
## Error in plot(clb, add = TRUE): object 'clb' not found

Now we make a raster of the number of observations.

obs <- rasterize(pts, r, field='SPECIES', fun=function(x, ...)length((na.omit(x))) )
## Error in rasterize(pts, r, field = "SPECIES", fun = function(x, ...) length((na.omit(x)))): object 'pts' not found
plot(obs)
## Error in plot(obs): object 'obs' not found
plot(clb, add=TRUE)
## Error in plot(clb, add = TRUE): object 'clb' not found

A cell by cell comparison of the number of species and the number of observations.

plot(obs, rich, cex=1, xlab='Observations', ylab='Richness')
## Error in plot(obs, rich, cex = 1, xlab = "Observations", ylab = "Richness"): object 'obs' not found

Clearly there is an association between the number of observations and the number of species. It may be that the number of species in some places is inflated just because more research was done there.

The problem is that this association will almost always exist. When there are only few species in an area, researchers will not continue to go there to increase the number of (redundant) observations. However, in this case, the relationship is not as strong as it can be, and there is a clear pattern in species richness maps, it is not characterized by sudden random like changes in richness (it looks like there is spatial autocorrelation, which is a good thing). Ways to correct for this ‘collector-bias’ include the use of techniques such as ‘rarefaction’ and ‘richness estimators’.

There are often gradients of species richness over latitude and altitude. Here is how you can make a plot of the latitudinal gradient in species richness.

d <- v[, c('lat', 'SPECIES')]
## Error in eval(expr, envir, enclos): object 'v' not found
d$lat <- round(d$lat)
## Error in eval(expr, envir, enclos): object 'd' not found
g <- tapply(d$SPECIES, d$lat, function(x) length(unique(na.omit(x))) )
## Error in tapply(d$SPECIES, d$lat, function(x) length(unique(na.omit(x)))): object 'd' not found
plot(names(g), g)
## Error in plot(names(g), g): object 'g' not found
# moving average
lines(names(g), movingFun(g, 3))
## Error in lines(names(g), movingFun(g, 3)): object 'g' not found

** Question ** The distribution of species richness has two peaks. What would explain the low species richness between -5 and 15 degrees?

2.6 Range size

Let’s estimate range sizes of the species. Hijmans and Spooner use two ways: (1) maxD, the maximum distance between any pair of points for a species, and CA50 the total area covered by circles of 50 km around each species. Here, I also add the convex hull. I am using the projected coordinates, but it is also possible to compute these things from the original longitude/latitude data.

# get the (Lambert AEA) coordinates from the SpatialPointsDataFrame
xy <- coordinates(pts)
## Error in coordinates(pts): object 'pts' not found
# list of species
sp <- unique(pts$SPECIES)
## Error in unique(pts$SPECIES): object 'pts' not found

Compute maxD for each species

maxD <- vector(length=length(sp))
## Error in vector(length = length(sp)): object 'sp' not found
for (s in 1:length(sp)) {
    # get the coordinates for species 's'
    p <- xy[pts$SPECIES == sp[s], ]
    # distance matrix
    d <- as.matrix(dist(p))
    # ignore the distance of a point to itself
    diag(d) <- NA
    # get max value
    maxD[s] <- max(d, na.rm=TRUE)
}
## Error in eval(expr, envir, enclos): object 'sp' not found

# Note the typical J shape
plot(rev(sort(maxD))/1000, ylab='maxD (km)')
## Error in sort(maxD): object 'maxD' not found

Compute CA

library(dismo)
library(rgeos)
CA <- vector(length=length(sp))
## Error in vector(length = length(sp)): object 'sp' not found
for (s in 1:length(sp)) {
    p <- xy[pts$SPECIES == sp[s], ,drop=FALSE]
    # run "circles" model
    m <- circles(p, d=50000, lonlat=FALSE)
    CA[s] <- area(polygons(m))
}
## Error in eval(expr, envir, enclos): object 'sp' not found
# standardize to the size of one circle
CA <- CA / (pi * 50000^2)
## Error in eval(expr, envir, enclos): object 'CA' not found
plot(rev(sort(CA)), ylab='CA50')
## Error in sort(CA): object 'CA' not found

Make convex hull range polygons

hull <- list()
for (s in 1:length(sp)) {
    p <- unique(xy[pts$SPECIES == sp[s], ,drop=FALSE])
    # need at least three points for hull
    if (nrow(p) > 3) {
        h <- convHull(p, lonlat=FALSE)
        pol <- polygons(h)
        hull[[s]] <- pol
    }
}
## Error in eval(expr, envir, enclos): object 'sp' not found

Plot the hulls. First remove the empty hulls (you cannot make a hull if you do not have at least three points).

# which elements are NULL
i <- which(!sapply(hull, is.null))
h <- hull[i]
# combine them
hh <- do.call(h, bind)
## Error in do.call(h, bind): second argument must be a list
plot(hh)
## Error in plot(hh): object 'hh' not found

Get the area for each hull, taking care of the fact that some are NULL.

a <- sapply(hull, function(i) ifelse(is.null(i), 0, area(i)))
plot(rev(sort(a))/1000, ylab='Area of convex hull')
## Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...): 'x' must be atomic

Compare all three measures

  d <- cbind(maxD,CA,hull)
## Error in cbind(maxD, CA, hull): object 'maxD' not found
  pairs(d)
## Error in pairs(d): object 'd' not found

2.7 Exercises

Exercise 1. Mapping species richness at different resolutions

Make maps of the number of observations and of species richness at 50, 100, 250, and 500 km resolution. Discuss the differences.

Exercise 2. Mapping diversity

Make a map of Shannon Diversity H for the potato data, at 200 km resolution.

  1. First make a function that computes Shannon Diversity (H) from a vector of species names

H = -SUM(p * ln(p))

Where p is proportion of each species

To get p, you can do

vv <- as.vector(table(v$SPECIES)) p <- vv / sum(vv)

  1. now use the function

Exercise 3. Mapping traits

There is information about two traits in the data set in field PRLV (tolerance to Potato Leaf Roll Virus) and frost (frost tolerance). Make a map of average frost tolerance.

2.8 References

Hijmans, R.J., and D.M. Spooner, 2001. Geographic distribution of wild potato species. American Journal of Botany 88:2101-2112