5. Spatial distribution models

This page shows how you can use the Random Forest algorithm to make spatial predictions. This approach is widely used, for example to classify remote sensing data into different land cover classes. But here our objective is to predict the entire range of a species based on a set of locations where it has been observed. As an example, we use the hominid species Imaginus magnapedum (also known under the vernacular names of “bigfoot” and “sasquatch”). This species is so hard to find (at least by scientists) that its very existence is commonly denied by the mainstream media! For more information about this controversy, see the article by Lozier, Aniello and Hickerson: Predicting the distribution of Sasquatch in western North America: anything goes with ecological niche modelling.

We want to find out

  1. What the complete range of the species might be.
  2. How good (general) our model is by predicting the range of the Eastern sub-species, with data from the Western sub-species.
  3. Predict where in Mexico the creature is likely to occur.
  4. How climate change might affect its distribution.

In this context, this type of analysis is often referred to as ‘species distribution modeling’ or ‘ecological niche modeling’. Here is a more in-depth discussion of this technique.

Data

Observations

Here is the data for the reported observations, extracted from http://www.bfro.net/ in May 2012.

bf <- read.csv("data/bigfoot.csv")
dim(bf)
## [1] 3092    3
head(bf)
##         lon      lat Class
## 1 -142.9000 61.50000     A
## 2 -132.7982 55.18720     A
## 3 -132.8202 55.20350     A
## 4 -141.5667 62.93750     A
## 5 -149.7853 61.05950     A
## 6 -141.3165 62.77335     A

Plot the locations

plot(bf[,1:2], cex=0.5, col='red')
library(maptools)
## Checking rgeos availability: TRUE
data(wrld_simpl)
plot(wrld_simpl, add=TRUE)

Predictors

Supervised classification often uses predictor data obtained from satellite remote sensing. But here, as is common in species distribution modeling, we use climate data. Specifically, we use ‘bioclimatic variables’, see: http://www.worldclim.org/bioclim

library(raster)
wc <- getData('worldclim', res=10, var='bio')
plot(wc[[c(1, 12)]], nr=2)

Now extract climate data for the locations of our observations. That is, get data about the climate that the species likes, apparently.

bfc <- extract(wc, bf[,1:2])
head(bfc)
##      bio1 bio2 bio3  bio4 bio5 bio6 bio7 bio8 bio9 bio10 bio11 bio12 bio13
## [1,]  -14  102   27  9672  174 -197  371   51  -11   108  -137   973   119
## [2,]   62   55   31  4136  157  -17  174   43   98   118    15  2602   385
## [3,]   62   55   31  4136  157  -17  174   43   98   118    15  2602   385
## [4,]  -57  125   23 15138  206 -332  538  127 -129   127  -256   282    67
## [5,]   10   80   25  8308  174 -140  314   66    5   119   -91   532    81
## [6,]  -59  128   23 14923  204 -334  538  122 -130   122  -255   322    75
##      bio14 bio15 bio16 bio17 bio18 bio19
## [1,]    43    30   332   156   290   210
## [2,]   128    33   953   407   556   721
## [3,]   128    33   953   407   556   721
## [4,]     6    81   163    22   163    27
## [5,]    22    41   215    72   159   117
## [6,]     8    79   183    28   183    32

# Any missing values?
i <- which(is.na(bfc[,1]))
i
## [1]  862 2667
plot(bf[,1:2], cex=0.5, col='blue')
plot(wrld_simpl, add=TRUE)
points(bf[i, ], pch=20, cex=3, col='red')

Here is a plot that illustrates a component of the ecological niche of our species of interest.

plot(bfc[ ,'bio1'] / 10, bfc[, 'bio12'], xlab='Annual mean temperature (C)',
         ylab='Annual precipitation (mm)')

Background data

Normally, one would build a model that would compare the values of the predictor variables as the locations where something was observed, with those values at the locations where it was not observed. But we do not have data from a systematic survey that determined presence and absence. We have presence-only data. (And, determining absence is not that simple. It is here now, it is gone tomorrow).

The common trick to deal with this is to not model presence versus absence, but presence versus a ‘random expectation’. This random expectation (also referred to as ‘background’, or ‘random-absence’ data) is what you would get if the species had no preference for any of the predictor variables (or to other variables that are not in the model, but correlated with the predictor variables).

There is not much point in taking absence data from very far away (tropical Africa or Antarctica). Typically they are taken from more or less the entire study area for which we have presences data.

library(dismo)
# extent of all points
e <- extent(SpatialPoints(bf[, 1:2]))
e
## class       : Extent
## xmin        : -156.75
## xmax        : -64.4627
## ymin        : 25.141
## ymax        : 69.5

# 5000 random samples (excluding NA cells) from extent e
set.seed(0)
bg <- sampleRandom(wc, 5000, ext=e)
dim(bg)
## [1] 5000   19
head(bg)
##      bio1 bio2 bio3  bio4 bio5 bio6 bio7 bio8 bio9 bio10 bio11 bio12 bio13
## [1,]  157  126   60  2935  262   55  207  124  191   197   122   379    88
## [2,]  -54  105   28  9244  142 -223  365   57  -62    68  -165   639    79
## [3,]  -57  104   20 14227  198 -317  515  106 -227   118  -247   473    71
## [4,]    1  119   24 12335  231 -251  482  138  -91   150  -168   844   104
## [5,]  208  169   44  7641  404   28  376  304  239   307   114   198    31
## [6,]  -89  111   23 12931  160 -316  476   78 -174    78  -248   476    76
##      bio14 bio15 bio16 bio17 bio18 bio19
## [1,]     0   100   225     2     4   222
## [2,]    28    30   226   101   219   138
## [3,]    17    46   197    55   194    59
## [4,]    34    33   301   128   291   137
## [5,]     2    50    73    11    52    62
## [6,]    25    40   193    79   193    82

Combine presence and background

d <- rbind(cbind(pa=1, bfc), cbind(pa=0, bg))
d <- data.frame(d)
dim(d)
## [1] 8092   20

Fit a model

Now we have the data to fit a model. But I am going to split the data into East and West. Let’s say I believe these are actually are different, albeit related, sub-species (The Eastern Sasquatch is darker and less hairy). I am principally interested in the western sub-species.

de <- d[bf[,1] > -102, ]
dw <- d[bf[,1] <= -102, ]

CART

Let’s first look at a Classification and Regression Trees (CART) model.

library(rpart)
cart <- rpart(pa~., data=dw)
printcp(cart)
##
## Regression tree:
## rpart(formula = pa ~ ., data = dw)
##
## Variables actually used in tree construction:
## [1] bio10 bio14 bio15 bio18 bio3  bio4  bio5  bio6  bio8
##
## Root node error: 762.45/3246 = 0.23489
##
## n= 3246
##
##         CP nsplit rel error  xerror     xstd
## 1 0.410197      0   1.00000 1.00048 0.008909
## 2 0.137588      1   0.58980 0.59041 0.014191
## 3 0.044259      2   0.45222 0.45474 0.016586
## 4 0.029121      3   0.40796 0.42701 0.016572
## 5 0.018954      4   0.37884 0.40239 0.016586
## 6 0.018324      5   0.35988 0.39294 0.016337
## 7 0.010113      6   0.34156 0.37926 0.015777
## 8 0.010008      7   0.33144 0.36821 0.016014
## 9 0.010000      9   0.31143 0.36821 0.016014
plotcp(cart)
plot(cart, uniform=TRUE, main="Regression Tree")
text(cart, cex=.8)
# text(cart, use.n=TRUE, all=TRUE, cex=.8)

Question 1: Describe the conditions under which you have the highest probability of finding our beloved species?

Random Forest

CART gives us a nice result to look at that can be easily interpreted (as you just illustrated with your answer to Question 1). But the approach suffers from high variance (meaning that the model will be over-fit, it is different each time a somewhat different datasets are used). Random Forest does not have that problem as much. Above, with CART, we use regression, let’s do both regression and classification here. First classification.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
# create a factor to indicated that we want classification
fpa <- as.factor(dw[, 'pa'])

Now fit the RandomForest model

crf <- randomForest(dw[, 2:ncol(dw)], fpa)
crf
##
## Call:
##  randomForest(x = dw[, 2:ncol(dw)], y = fpa)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
##
##         OOB estimate of  error rate: 9.8%
## Confusion matrix:
##      0    1 class.error
## 0 1864  158  0.07814045
## 1  160 1064  0.13071895
plot(crf)

The variable importance plot shows which variables are most important in fitting the model. This is computing by randomizing each variable one by one and then computing the decline in model prediction.

varImpPlot(crf)

Now we use regression, rather than classification. First we tune a parameter.

trf <- tuneRF(dw[, 2:ncol(dw)], dw[, 'pa'])
## Warning in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 6  OOB error = 0.07100579
## Searching left ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 3     OOB error = 0.07010267
## 0.01271897 0.05
## Searching right ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 12    OOB error = 0.0708133
## 0.002710948 0.05
trf
##    mtry   OOBError
## 3     3 0.07010267
## 6     6 0.07100579
## 12   12 0.07081330
mt <- trf[which.min(trf[,2]), 1]
mt
## [1] 3

Question 2: What did tuneRF help us find? What does the values of mt represent?

rrf <- randomForest(dw[, 2:ncol(d)], dw[, 'pa'], mtry=mt)
## Warning in randomForest.default(dw[, 2:ncol(d)], dw[, "pa"], mtry = mt):
## The response has five or fewer unique values. Are you sure you want to do
## regression?
rrf
##
## Call:
##  randomForest(x = dw[, 2:ncol(d)], y = dw[, "pa"], mtry = mt)
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
##
##           Mean of squared residuals: 0.06855845
##                     % Var explained: 70.81
plot(rrf)
varImpPlot(rrf)

Predict

We can use the model to make predictions to any other place for which we have values for the predictor variables. Our climate data is global so we could find suitable places for bigfoot in Australia. At first I only want to predict to our study region, which I define as follows.

# Extent of the western points
ew <- extent(SpatialPoints(bf[bf[,1] <= -102, 1:2]))
ew
## class       : Extent
## xmin        : -156.75
## xmax        : -102.3881
## ymin        : 30.77722
## ymax        : 69.5

Regression

rp <- predict(wc, rrf, ext=ew)
plot(rp)

Note that the regression predictions are well-behaved, in the sense that they are between 0 and 1. However, they are continuous within that range, and if you wanted presence/absence, you would need a threshold. To get the optimal threshold, you would normally have a hold out data set, but here I used the training data for simplicity.

eva <- evaluate(dw[dw$pa==1, ], dw[dw$pa==0, ], rrf)
eva
## class          : ModelEvaluation
## n presences    : 1224
## n absences     : 2022
## AUC            : 0.9993919
## cor            : 0.9646471
## max TPR+TNR at : 0.4870895

We can make a ROC plot

plot(eva, 'ROC')

Find a good threshold to determine presence/absence and plot the prediction.

tr <- threshold(eva)
tr
##                kappa spec_sens no_omission prevalence equal_sens_spec
## thresholds 0.4870895 0.4870895    0.393281  0.3814001        0.518784
##            sensitivity
## thresholds   0.7154833
plot(rp > tr[1, 'spec_sens'])

Classification

We can also use the classification Random Forest model to make a prediction.

rc <- predict(wc, crf, ext=ew)
plot(rc)

You can also get probabilities for the classes

rc2 <- predict(wc, crf, ext=ew, type='prob', index=2)
plot(rc2)

Extrapolation

Now, let’s see if our model is general enough to predict the distribution of the Eastern species.

de <- na.omit(de)
eva2 <- evaluate(de[de$pa==1, ], de[de$pa==0, ], rrf)
eva2
## class          : ModelEvaluation
## n presences    : 1866
## n absences     : 2978
## AUC            : 0.4294745
## cor            : -0.2609383
## max TPR+TNR at : 0.000208
plot(eva2, 'ROC')

We can also look at it on a map.

eus <- extent(SpatialPoints(bf[, 1:2]))
eus
## class       : Extent
## xmin        : -156.75
## xmax        : -64.4627
## ymin        : 25.141
## ymax        : 69.5
rcusa <- predict(wc, rrf, ext=eus)
plot(rcusa)
points(bf[,1:2], cex=.25)

Question 4: Why would it be that the model does not extrapolate well?

An important question in the biogeography of the western species is why it does not occur in Mexico. Or if it does, where would that be?

Let’s see.

mex <- getData('GADM', country='MEX', level=1)
pm <- predict(wc, rrf, ext=mex)
pm <- mask(pm, mex)
plot(pm)

Question 5: Where in Mexico are you most likely to encounter western bigfoot?

We can also estimate range shifts due to climate change

fut <- getData('CMIP5', res=10, var='bio', rcp=85, model='AC', year=70)
names(fut)
##  [1] "ac85bi701"  "ac85bi702"  "ac85bi703"  "ac85bi704"  "ac85bi705"
##  [6] "ac85bi706"  "ac85bi707"  "ac85bi708"  "ac85bi709"  "ac85bi7010"
## [11] "ac85bi7011" "ac85bi7012" "ac85bi7013" "ac85bi7014" "ac85bi7015"
## [16] "ac85bi7016" "ac85bi7017" "ac85bi7018" "ac85bi7019"
names(wc)
##  [1] "bio1"  "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8"
##  [9] "bio9"  "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16"
## [17] "bio17" "bio18" "bio19"
names(fut) <- names(wc)
futusa <- predict(fut, rrf, ext=eus, progress='window')
## Loading required namespace: tcltk
plot(futusa)

Question 6: Make a map to show where conditions are improving for western bigfoot, and where they are not. Is the species headed toward extinction?

Further reading

More on Species distribution modeling with R; and on the use of boosted regression trees in the same context.