5. Spatial distribution models

This page shows how you can use the Random Forest algorithm to make spatial predictions. This approach is widely used, for example to classify remote sensing data into different land cover classes. But here our objective is to predict the entire range of a species based on a set of locations where it has been observed. As an example, we use the hominid species Imaginus magnapedum (also known under the vernacular names of “bigfoot” and “sasquatch”). This species is so hard to find (at least by scientists) that its very existence is commonly denied by the mainstream media! For more information about this controversy, see the article by Lozier, Aniello and Hickerson: Predicting the distribution of Sasquatch in western North America: anything goes with ecological niche modelling.

We want to find out

  1. What the complete range of the species might be.
  2. How good (general) our model is by predicting the range of the Eastern sub-species, with data from the Western sub-species.
  3. Predict where in Mexico the creature is likely to occur.
  4. How climate change might affect its distribution.

In this context, this type of analysis is often referred to as ‘species distribution modeling’ or ‘ecological niche modeling’. Here is a more in-depth discussion of this technique.

Data

Observations

Here is the data for the reported observations, extracted from http://www.bfro.net/ in May 2012.

bf <- read.csv("data/bigfoot.csv")
## Warning in file(file, "rt"): cannot open file 'data/bigfoot.csv': No such
## file or directory
## Error in file(file, "rt"): cannot open the connection
dim(bf)
## Error in eval(expr, envir, enclos): object 'bf' not found
head(bf)
## Error in head(bf): object 'bf' not found

Plot the locations

plot(bf[,1:2], cex=0.5, col='red')
## Error in plot(bf[, 1:2], cex = 0.5, col = "red"): object 'bf' not found
library(maptools)
## Checking rgeos availability: TRUE
data(wrld_simpl)
plot(wrld_simpl, add=TRUE)
## Error in polypath(x = mcrds[, 1], y = mcrds[, 2], border = border, col = col, : plot.new has not been called yet

Predictors

Supervised classification often uses predictor data obtained from satellite remote sensing. But here, as is common in species distribution modeling, we use climate data. Specifically, we use ‘bioclimatic variables’, see: http://www.worldclim.org/bioclim

library(raster)
wc <- getData('worldclim', res=10, var='bio')
plot(wc[[c(1, 12)]], nr=2)

image0

Now extract climate data for the locations of our observations. That is, get data about the climate that the species likes, apparently.

bfc <- extract(wc, bf[,1:2])
## Error in extract(wc, bf[, 1:2]): object 'bf' not found
head(bfc)
## Error in head(bfc): object 'bfc' not found

# Any missing values?
i <- which(is.na(bfc[,1]))
## Error in which(is.na(bfc[, 1])): object 'bfc' not found
i
## [1] 2
plot(bf[,1:2], cex=0.5, col='blue')
## Error in plot(bf[, 1:2], cex = 0.5, col = "blue"): object 'bf' not found
plot(wrld_simpl, add=TRUE)
## Error in polypath(x = mcrds[, 1], y = mcrds[, 2], border = border, col = col, : plot.new has not been called yet
points(bf[i, ], pch=20, cex=3, col='red')
## Error in points(bf[i, ], pch = 20, cex = 3, col = "red"): object 'bf' not found

Here is a plot that illustrates a component of the ecological niche of our species of interest.

plot(bfc[ ,'bio1'] / 10, bfc[, 'bio12'], xlab='Annual mean temperature (C)',
         ylab='Annual precipitation (mm)')
## Error in plot(bfc[, "bio1"]/10, bfc[, "bio12"], xlab = "Annual mean temperature (C)", : object 'bfc' not found

Background data

Normally, one would build a model that would compare the values of the predictor variables as the locations where something was observed, with those values at the locations where it was not observed. But we do not have data from a systematic survey that determined presence and absence. We have presence-only data. (And, determining absence is not that simple. It is here now, it is gone tomorrow).

The common trick to deal with this is to not model presence versus absence, but presence versus a ‘random expectation’. This random expectation (also referred to as ‘background’, or ‘random-absence’ data) is what you would get if the species had no preference for any of the predictor variables (or to other variables that are not in the model, but correlated with the predictor variables).

There is not much point in taking absence data from very far away (tropical Africa or Antarctica). Typically they are taken from more or less the entire study area for which we have presences data.

library(dismo)
# extent of all points
e <- extent(SpatialPoints(bf[, 1:2]))
## Error in coordinates(coords): object 'bf' not found
e
## Error in eval(expr, envir, enclos): object 'e' not found

# 5000 random samples (excluding NA cells) from extent e
set.seed(0)
bg <- sampleRandom(wc, 5000, ext=e)
## Error in .local(x, size, ...): object 'e' not found
dim(bg)
## Error in eval(expr, envir, enclos): object 'bg' not found
head(bg)
## Error in head(bg): object 'bg' not found

Combine presence and background

d <- rbind(cbind(pa=1, bfc), cbind(pa=0, bg))
## Error in cbind(pa = 1, bfc): object 'bfc' not found
d <- data.frame(d)
## Error in data.frame(d): object 'd' not found
dim(d)
## Error in eval(expr, envir, enclos): object 'd' not found

Fit a model

Now we have the data to fit a model. But I am going to split the data into East and West. Let’s say I believe these are actually are different, albeit related, sub-species (The Eastern Sasquatch is darker and less hairy). I am principally interested in the western sub-species.

de <- d[bf[,1] > -102, ]
## Error in eval(expr, envir, enclos): object 'd' not found
dw <- d[bf[,1] <= -102, ]
## Error in eval(expr, envir, enclos): object 'd' not found

CART

Let’s first look at a Classification and Regression Trees (CART) model.

library(rpart)
cart <- rpart(pa~., data=dw)
## Error in is.data.frame(data): object 'dw' not found
printcp(cart)
## Error in printcp(cart): object 'cart' not found
plotcp(cart)
## Error in plotcp(cart): object 'cart' not found
plot(cart, uniform=TRUE, main="Regression Tree")
## Error in plot(cart, uniform = TRUE, main = "Regression Tree"): object 'cart' not found
text(cart, cex=.8)
## Error in text(cart, cex = 0.8): object 'cart' not found
# text(cart, use.n=TRUE, all=TRUE, cex=.8)

Question 1: Describe the conditions under which you have the highest probability of finding our beloved species?

Random Forest

CART gives us a nice result to look at that can be easily interpreted (as you just illustrated with your answer to Question 1). But the approach suffers from high variance (meaning that the model will be over-fit, it is different each time a somewhat different datasets are used). Random Forest does not have that problem as much. Above, with CART, we use regression, let’s do both regression and classification here. First classification.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
# create a factor to indicated that we want classification
fpa <- as.factor(dw[, 'pa'])
## Error in as.factor(dw[, "pa"]): object 'dw' not found

Now fit the RandomForest model

crf <- randomForest(dw[, 2:ncol(dw)], fpa)
## Error in randomForest(dw[, 2:ncol(dw)], fpa): object 'dw' not found
crf
## Error in eval(expr, envir, enclos): object 'crf' not found
plot(crf)
## Error in plot(crf): object 'crf' not found

The variable importance plot shows which variables are most important in fitting the model. This is computing by randomizing each variable one by one and then computing the decline in model prediction.

varImpPlot(crf)
## Error in varImpPlot(crf): object 'crf' not found

Now we use regression, rather than classification. First we tune a parameter.

trf <- tuneRF(dw[, 2:ncol(dw)], dw[, 'pa'])
## Error in is.factor(y): object 'dw' not found
trf
## Error in eval(expr, envir, enclos): object 'trf' not found
mt <- trf[which.min(trf[,2]), 1]
## Error in eval(expr, envir, enclos): object 'trf' not found
mt
## Error in eval(expr, envir, enclos): object 'mt' not found

Question 2: What did tuneRF help us find? What does the values of mt represent?

rrf <- randomForest(dw[, 2:ncol(d)], dw[, 'pa'], mtry=mt)
## Error in randomForest(dw[, 2:ncol(d)], dw[, "pa"], mtry = mt): object 'dw' not found
rrf
## Error in eval(expr, envir, enclos): object 'rrf' not found
plot(rrf)
## Error in plot(rrf): object 'rrf' not found
varImpPlot(rrf)
## Error in varImpPlot(rrf): object 'rrf' not found

Predict

We can use the model to make predictions to any other place for which we have values for the predictor variables. Our climate data is global so we could find suitable places for bigfoot in Australia. At first I only want to predict to our study region, which I define as follows.

# Extent of the western points
ew <- extent(SpatialPoints(bf[bf[,1] <= -102, 1:2]))
## Error in coordinates(coords): object 'bf' not found
ew
## Error in eval(expr, envir, enclos): object 'ew' not found

Regression

rp <- predict(wc, rrf, ext=ew)
## Error in inherits(model, "DistModel"): object 'rrf' not found
plot(rp)
## Error in plot(rp): object 'rp' not found

Note that the regression predictions are well-behaved, in the sense that they are between 0 and 1. However, they are continuous within that range, and if you wanted presence/absence, you would need a threshold. To get the optimal threshold, you would normally have a hold out data set, but here I used the training data for simplicity.

eva <- evaluate(dw[dw$pa==1, ], dw[dw$pa==0, ], rrf)
## Error in evaluate(dw[dw$pa == 1, ], dw[dw$pa == 0, ], rrf): object 'dw' not found
eva
## Error in eval(expr, envir, enclos): object 'eva' not found

We can make a ROC plot

plot(eva, 'ROC')
## Error in plot(eva, "ROC"): object 'eva' not found

Find a good threshold to determine presence/absence and plot the prediction.

tr <- threshold(eva)
## Error in threshold(eva): object 'eva' not found
tr
## Error in eval(expr, envir, enclos): object 'tr' not found
plot(rp > tr[1, 'spec_sens'])
## Error in plot(rp > tr[1, "spec_sens"]): object 'rp' not found

Classification

We can also use the classification Random Forest model to make a prediction.

rc <- predict(wc, crf, ext=ew)
## Error in .local(object, ...): object 'crf' not found
plot(rc)
## Error in plot(rc): object 'rc' not found

You can also get probabilities for the classes

rc2 <- predict(wc, crf, ext=ew, type='prob', index=2)
## Error in .local(object, ...): object 'crf' not found
plot(rc2)
## Error in plot(rc2): object 'rc2' not found

Extrapolation

Now, let’s see if our model is general enough to predict the distribution of the Eastern species.

de <- na.omit(de)
eva2 <- evaluate(de[de$pa==1, ], de[de$pa==0, ], rrf)
## Error in de$pa: object of type 'closure' is not subsettable
eva2
## Error in eval(expr, envir, enclos): object 'eva2' not found
plot(eva2, 'ROC')
## Error in plot(eva2, "ROC"): object 'eva2' not found

We can also look at it on a map.

eus <- extent(SpatialPoints(bf[, 1:2]))
## Error in coordinates(coords): object 'bf' not found
eus
## Error in eval(expr, envir, enclos): object 'eus' not found
rcusa <- predict(wc, rrf, ext=eus)
## Error in .local(object, ...): object 'rrf' not found
plot(rcusa)
## Error in plot(rcusa): object 'rcusa' not found
points(bf[,1:2], cex=.25)
## Error in points(bf[, 1:2], cex = 0.25): object 'bf' not found

Question 4: Why would it be that the model does not extrapolate well?

An important question in the biogeography of the western species is why it does not occur in Mexico. Or if it does, where would that be?

Let’s see.

mex <- getData('GADM', country='MEX', level=1)
pm <- predict(wc, rrf, ext=mex)
## Error in .local(object, ...): object 'rrf' not found
pm <- mask(pm, mex)
## Error in mask(pm, mex): object 'pm' not found
plot(pm)
## Error in plot(pm): object 'pm' not found

Question 5: Where in Mexico are you most likely to encounter western bigfoot?

We can also estimate range shifts due to climate change

fut <- getData('CMIP5', res=10, var='bio', rcp=85, model='AC', year=70)
names(fut)
##  [1] "ac85bi701"  "ac85bi702"  "ac85bi703"  "ac85bi704"  "ac85bi705"
##  [6] "ac85bi706"  "ac85bi707"  "ac85bi708"  "ac85bi709"  "ac85bi7010"
## [11] "ac85bi7011" "ac85bi7012" "ac85bi7013" "ac85bi7014" "ac85bi7015"
## [16] "ac85bi7016" "ac85bi7017" "ac85bi7018" "ac85bi7019"
names(wc)
##  [1] "bio1"  "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8"
##  [9] "bio9"  "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16"
## [17] "bio17" "bio18" "bio19"
names(fut) <- names(wc)
futusa <- predict(fut, rrf, ext=eus, progress='window')
## Error in .local(object, ...): object 'rrf' not found
plot(futusa)
## Error in plot(futusa): object 'futusa' not found

Question 6: Make a map to show where conditions are improving for western bigfoot, and where they are not. Is the species headed toward extinction?

Further reading

More on Species distribution modeling with R; and on the use of boosted regression trees in the same context.