Spatial distribution models

This page shows how you can use the Random Forest algorithm to make spatial predictions. This approach is widely used, for example to classify remote sensing data into different land cover classes. But here our objective is to predict the entire range of a species based on a set of locations where it has been observed. As an example, we use the hominid species Imaginus magnapedum (also known under the vernacular names of “bigfoot” and “sasquatch”). This species is believed to occur in the United States, but it is so hard to find by scientists that its very existence is commonly denied by the mainstream media — despite the many reports on Twitter! For more information about this controversy, see the article by Lozier, Aniello and Hickerson: Predicting the distribution of Sasquatch in western North America: anything goes with ecological niche modelling.

We will use “citizen-science” data to find out:

  1. What the complete range of the species might be.

  2. How good (general) our model is by predicting the range of the Eastern sub-species, with data from the Western sub-species.

  3. How climate change might affect its distribution.

In this context, this type of analysis is often referred to as ‘species distribution modeling’ or ‘ecological niche modeling’. Here is a more in-depth discussion of this technique.

First make sure we have the packages needed:

if (!require("rspat")) remotes::install_github("rspatial/rspat")
## Loading required package: rspat
## Loading required package: terra
## terra 1.8.8
if (!require("predicts")) install.packages("predicts")
## Loading required package: predicts
if (!require("geodata")) install.packages("geodata")
## Loading required package: geodata

Data

Observations

We get a data set of reported Bigfoot observations

library(terra)
library(rspat)
bf <- spat_data("bigfoot")
dim(bf)
## [1] 3092    3
head(bf)
##         lon      lat Class
## 1 -142.9000 61.50000     A
## 2 -132.7982 55.18720     A
## 3 -132.8202 55.20350     A
## 4 -141.5667 62.93750     A
## 5 -149.7853 61.05950     A
## 6 -141.3165 62.77335     A

It is always good to first plot the locations to see what we are dealing with.

plot(bf[,1:2], cex=0.5, col="red")
library(geodata)
wrld <- geodata::world(path=".")
bnds <- wrld[wrld$NAME_0 %in% c("Canada", "Mexico", "United States"), ]
lines(bnds)

image1

So the are in Canada and in the United States, but no reports from Mexico, so far.

Predictor variables

Here, as is common in species distribution modeling, we use climate data as predictor variables in our model. Specifically, we use “bioclimatic variables”, see: https://www.worldclim.org/data/bioclim.html. Here we used a spatial resolution of 10 minutes (one sixt of a degree). That is relatively coarse but it makes the download and processing faster.

wc <- geodata::worldclim_global("bio", res=10, ".")
plot(wc[[c(1, 12)]], nr=2)

image2

Now extract climate data for the locations of our observations. In that way, we can find out what the climate conditions are that the species likes, apparently.

bfc <- extract(wc, bf[,1:2])
head(bfc, 3)
##   ID wc2.1_10m_bio_1 wc2.1_10m_bio_2 wc2.1_10m_bio_3 wc2.1_10m_bio_4
## 1  1       -1.832979       12.504708        28.95899       1152.4308
## 2  2        6.360650        5.865935        32.27475        462.5731
## 3  3        6.360650        5.865935        32.27475        462.5731
##   wc2.1_10m_bio_5 wc2.1_10m_bio_6 wc2.1_10m_bio_7 wc2.1_10m_bio_8
## 1        20.34075      -22.840000        43.18075        5.327750
## 2        16.65505       -1.519947        18.17500        3.964495
## 3        16.65505       -1.519947        18.17500        3.964495
##   wc2.1_10m_bio_9 wc2.1_10m_bio_10 wc2.1_10m_bio_11 wc2.1_10m_bio_12
## 1      -0.6887083         11.80792       -16.038542              991
## 2      10.4428196         12.28183         1.467686             3079
## 3      10.4428196         12.28183         1.467686             3079
##   wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15 wc2.1_10m_bio_16
## 1              120               42         31.32536              337
## 2              448              141         35.27518             1127
## 3              448              141         35.27518             1127
##   wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19
## 1              157              288              216
## 2              468              630              873
## 3              468              630              873

I remove the first column with the ID that we do not need.

bfc <- bfc[,-1]

Now we can plot the species’ distribution in a part of the environmental space. Here is a plot of temperature vs rainfall of sites where Bigfoot was observed.

plot(bfc[ ,"wc2.1_10m_bio_1"], bfc[, "wc2.1_10m_bio_12"], col="red",
        xlab="Annual mean temperature (°C)", ylab="Annual precipitation (mm)")

image3

Background data

Normally, one would build a model that would compare the values of the predictor variables at the locations where something was observed, with those values at the locations where it was not observed. But we do not have data from a systematic survey that determined presence and absence. We have presence-only data. (And, determining absence is not that simple. You blink and Bigfoot is gone!).

The common approach to deal with these type of data is to not model presence versus absence, but presence versus “background”. The “background” is the random (or maximum entropy) expectation; it is what you would get if the species had no preference for any of the predictor variables (or to other variables that are not in the model, but correlated with the predictor variables).

There is not much point in taking absence data from very far away (tropical Africa or Antarctica). Typically they are taken from more or less the entire study area for which we have presences data.

To do so, I first get the extent of all points

ext_bf <- ext(vect(bf[, 1:2])) + 1
ext_bf
## SpatExtent : -157.75, -63.4627, 24.141, 70.5 (xmin, xmax, ymin, ymax)

And then I take 5000 random samples (excluding NA cells) from SpatExtent e, by using it as a “window” (blacking out all other areas) on the climate SpatRaster.

set.seed(0)
window(wc) <- ext_bf
bg <- spatSample(wc, 5000, "random", na.rm=TRUE, xy=TRUE)
head(bg)
##           x        y wc2.1_10m_bio_1 wc2.1_10m_bio_2 wc2.1_10m_bio_3
## 1  -99.2500 66.75000     -13.2934895        7.870646        14.96619
## 2 -111.9167 46.58333       6.7605939       14.135854        35.23372
## 3 -106.9167 54.75000       0.4086979       11.528605        24.43290
## 4 -118.2500 67.08333      -9.1363859        8.185354        16.34505
## 5 -127.2500 66.75000      -7.6868439        9.455063        18.23471
## 6 -136.5833 68.25000      -9.6288643        7.821146        17.48318
##   wc2.1_10m_bio_4 wc2.1_10m_bio_5 wc2.1_10m_bio_6 wc2.1_10m_bio_7
## 1       1638.6833        15.42850       -37.16100        52.58950
## 2        927.7927        28.14375       -11.97650        40.12025
## 3       1290.1088        22.55225       -24.63250        47.18475
## 4       1567.0846        17.46575       -32.61275        50.07850
## 5       1618.9269        19.98025       -31.87175        51.85200
## 6       1386.4282        15.41575       -29.31950        44.73525
##   wc2.1_10m_bio_8 wc2.1_10m_bio_9 wc2.1_10m_bio_10 wc2.1_10m_bio_11
## 1        6.484917      -31.617332         7.518209        -31.76942
## 2       15.638333       -4.921750        18.186209         -4.92175
## 3       15.417084      -13.864500        15.417084        -16.31392
## 4        8.609292      -21.353209        10.573625        -27.49783
## 5        9.909041       -9.595167        12.655958        -26.49579
## 6        8.850875      -20.850916         8.850875        -24.08154
##   wc2.1_10m_bio_12 wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15
## 1              171               33                4         70.29919
## 2              293               48                9         53.40759
## 3              471               86               16         58.32499
## 4              223               43                7         61.21693
## 5              255               39                8         48.71323
## 6              279               52               10         56.79042
##   wc2.1_10m_bio_16 wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19
## 1               90               13               78               13
## 2              129               35              115               35
## 3              220               53              220               56
## 4              108               27               93               29
## 5              108               29              104               46
## 6              126               34              126               35

Instead of using window you could also subset the climate data like this wc <- crop(wc, ext_bf)

Above, with spatSample, I used the argument xy=TRUE to be able to show were these points are from:

plot(bg[, c("x", "y")])

image4

But we otherwise do not need them so I remove them again,

bg <- bg[, -c(1:2)]

We can now compare the climate of the presence and background points, for example, for temperature and rainfall

plot(bg[,1], bg[,12], xlab="Annual mean temperature (°C)",
          ylab="Annual precipitation (mm)", cex=.8)
points(bfc[,1], bfc[,12], col="red", cex=.6, pch="+")
legend("topleft", c("observed", "background"), col=c("red", "black"), pch=c("+", "o"), pt.cex=c(.6, .8))

image5

So we see that while Bigfoot is widespread, it is not common in cold areas, nor in hot and dry areas.

East vs West

I am first going to split the data into East and West. This is because I believe there are two sub-species (The Eastern Sasquatch is darker, less hairy, and has more pointy ears). I am principally interested in the western sub-species. Note how I use the original coordinates to subset the climate data. We can do this because they are in the same order.

#eastern points
bfe <- bfc[bf[,1] > -102, ]
#western points
bfw <- bfc[bf[,1] <= -102, ]

And now I combine the presence (“1”) with the background (“0”) data (I use the same background data for both subspecies)

dw <- rbind(cbind(pa=1, bfw), cbind(pa=0, bg))
de <- rbind(cbind(pa=1, bfe), cbind(pa=0, bg))
dw <- data.frame(dw)
de <- data.frame(na.omit(de))
dim(dw)
## [1] 6224   20
dim(de)
## [1] 6866   20

Fit a model

Now we have the data to fit a model. Let’s first look at a Classification and Regression Trees (CART) model.

CART

library(rpart)
cart <- rpart(pa~., data=dw)

The “complexity parameter” can be used as a stopping parameter to avoid overfitting.

printcp(cart)
##
## Regression tree:
## rpart(formula = pa ~ ., data = dw)
##
## Variables actually used in tree construction:
## [1] wc2.1_10m_bio_15 wc2.1_10m_bio_19 wc2.1_10m_bio_3  wc2.1_10m_bio_9
##
## Root node error: 983.29/6224 = 0.15798
##
## n= 6224
##
##         CP nsplit rel error  xerror     xstd
## 1 0.550212      0   1.00000 1.00019 0.019351
## 2 0.056952      1   0.44979 0.46018 0.016884
## 3 0.035083      2   0.39284 0.40372 0.014433
## 4 0.014062      3   0.35775 0.37740 0.013960
## 5 0.011122      4   0.34369 0.36276 0.014174
## 6 0.010419      5   0.33257 0.35991 0.014393
## 7 0.010000      6   0.32215 0.35034 0.014476
plotcp(cart)

image6

Fit the model again, with fewer splits

cart <- rpart(pa~., data=dw, cp=0.02)

And here is the tree

library(rpart.plot)
rpart.plot(cart, uniform=TRUE, main="Regression Tree")

image7

Question 1: Describe the environmental conditions that Bigfoot appears to enjoy most?

And now we can use the model to show how attractive the climate is for this species.

x <- predict(wc, cart)
x <- mask(x, wc[[1]])
x <- round(x, 2)
plot(x, type="class", plg=list(x="bottomleft"))

image8

Notice that there are six values, because the regression tree has six leaves.

Random Forest

CART gives us a nice result to look at that can be easily interpreted (as you just illustrated with your answer to Question 1). But the approach suffers from high variance (meaning that the model tends to be over-fit, it is different each time a somewhat different datasets are used); and the quality of its predictions suffers from that. Random Forest does not have that problem as much. Above, with CART, we use regression, let’s do both regression and classification here.

But first I set some points aside for validation (normally we would do k-fold cross-validation, but we keep it simple here).

set.seed(123)
i <- sample(nrow(dw), 0.2 * nrow(dw))
test <- dw[i,]
train <- dw[-i,]

First we do classification, by making a categorical variable for presence/background.

fpa <- as.factor(train[, 'pa'])

Now fit the RandomForest model

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
crf <- randomForest(train[, 2:ncol(train)], fpa)
crf
##
## Call:
##  randomForest(x = train[, 2:ncol(train)], y = fpa)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
##
##         OOB estimate of  error rate: 5.46%
## Confusion matrix:
##      0   1 class.error
## 0 3868 129  0.03227421
## 1  143 840  0.14547304

The Out-Of-Bag error rate is very small.

The variable importance plot shows which variables are most important in fitting the model. This is computed by randomizing each variable one by one, and then evaluating the decline in model prediction.

varImpPlot(crf)

image9

Now we use regression, rather than classification. First we tune a parameter.

trf <- tuneRF(train[, 2:ncol(train)], train[, "pa"])
## Warning in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, : The
## response has five or fewer unique values.  Are you sure you want to do
## regression?
## mtry = 6  OOB error = 0.04419892
## Searching left ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, : The
## response has five or fewer unique values.  Are you sure you want to do
## regression?
## mtry = 3     OOB error = 0.04280804
## 0.03146878 0.05
## Searching right ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, : The
## response has five or fewer unique values.  Are you sure you want to do
## regression?
## mtry = 12    OOB error = 0.04296318
## 0.02795879 0.05

image10

trf
##    mtry   OOBError
## 3     3 0.04280804
## 6     6 0.04419892
## 12   12 0.04296318
mt <- trf[which.min(trf[,2]), 1]
mt
## [1] 3

Question 2: What did tuneRF help us find? What does the values of mt represent?

rrf <- randomForest(train[, 2:ncol(train)], train[, "pa"], mtry=mt, ntree=250)
## Warning in randomForest.default(train[, 2:ncol(train)], train[, "pa"], mtry =
## mt, : The response has five or fewer unique values.  Are you sure you want to
## do regression?
rrf
##
## Call:
##  randomForest(x = train[, 2:ncol(train)], y = train[, "pa"], ntree = 250,      mtry = mt)
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 3
##
##           Mean of squared residuals: 0.04118506
##                     % Var explained: 74
plot(rrf)

image11

Question 3: What does ``plot(rrf)`` show us?

Predict

We can use the model to make predictions to any other place for which we have values for the predictor variables. Our climate data is global so we could find suitable areas for Bigfoot in Australia, but let’s stick to North America for now.

Regression

rp <- predict(wc, rrf, na.rm=TRUE)
plot(rp)

image12

Note that the regression predictions are well-behaved, in the sense that they are between 0 and 1. However, they are continuous within that range, and if you wanted presence/absence, you would need a threshold. To get the optimal threshold, you would normally have a hold out data set, but here I use the training data for simplicity.

library(predicts)
eva <- pa_evaluate(predict(rrf, test[test$pa==1, ]), predict(rrf, test[test$pa==0, ]))
eva
## @stats
##    np   na prevalence   auc   cor pcor   ODP
## 1 241 1003      0.194 0.978 0.853    0 0.806
##
## @thresholds
##   max_kappa max_spec_sens no_omission equal_prevalence equal_sens_spec
## 1     0.538         0.176       0.007            0.193           0.244
##
## @tr_stats
##     treshold kappa  CCR  TPR  TNR  FPR  FNR  PPP  NPP  MCR    OR
## 1          0     0 0.19    1    0    1    0 0.19  NaN 0.81   NaN
## 2          0  0.27 0.59    1 0.49 0.51    0 0.32    1 0.41   Inf
## 3          0  0.27 0.59    1 0.49 0.51    0 0.32    1 0.41   Inf
## 4        ...   ...  ...  ...  ...  ...  ...  ...  ...  ...   ...
## 569        1  0.13 0.82 0.08    1    0 0.92 0.95 0.82 0.18 90.68
## 570        1  0.13 0.82 0.08    1    0 0.92 0.95 0.82 0.18 90.68
## 571        1     0 0.81    0    1    0    1  NaN 0.81 0.19   NaN

We can make a ROC plot

plot(eva, "ROC")

image13

This suggests that the model is (very near) perfect in disinguising presence from background points. This is perhaps better illustrated with these plots:

par(mfrow=c(1,2))
plot(eva, "boxplot")
plot(eva, "density")

image14

To get a good threshold to determine presence/absence and plot the prediction, we can use the “max specificity + sensitivity” threshold.

tr <- eva@thresholds
tr
##   max_kappa max_spec_sens no_omission equal_prevalence equal_sens_spec
## 1 0.5381667     0.1763667  0.00717467         0.193259          0.2439
plot(rp > tr$max_spec_sens)

image15

Classification

We can also use the classification Random Forest model to make a prediction.

rc <- predict(wc, crf, na.rm=TRUE)
plot(rc)

image16

They are different because the classification used a threshold of 0.5, which is not necessarily appropriate.

You can get probabilities for the classes (in this case there are 2 classes, presence and absence, and I only plot presence)

rc2 <- predict(wc, crf, type="prob", na.rm=TRUE)
plot(rc2, 2)

image17

Extrapolation

Now, let’s see if our model is general enough to predict the distribution of the Eastern species.

eva2 <- pa_evaluate(predict(rrf, de[de$pa==1, ]), predict(rrf, de[de$pa==0, ]))
eva2
## @stats
##     np   na prevalence   auc   cor pcor   ODP
## 1 1866 5000      0.272 0.947 0.738    0 0.728
##
## @thresholds
##   max_kappa max_spec_sens no_omission equal_prevalence equal_sens_spec
## 1     0.134         0.107           0            0.272            0.07
##
## @tr_stats
##      treshold kappa  CCR  TPR  TNR  FPR  FNR  PPP  NPP  MCR    OR
## 1           0     0 0.27    1    0    1    0 0.27  NaN 0.73   NaN
## 2           0  0.36 0.65 0.98 0.52 0.48 0.02 0.43 0.98 0.35 45.33
## 3           0  0.37 0.66 0.98 0.54 0.46 0.02 0.44 0.98 0.34 47.01
## 4         ...   ...  ...  ...  ...  ...  ...  ...  ...  ...   ...
## 1249        1     0 0.73    0    1    0    1    0 0.73 0.27     0
## 1250        1     0 0.73    0    1    0    1    0 0.73 0.27     0
## 1251        1     0 0.73    0    1    0    1  NaN 0.73 0.27   NaN
par(mfrow=c(1,2))
plot(eva2, "ROC")
plot(eva2, "boxplot")

image18

By this measure, it is a terrible model – as we already saw on the map. So our model is really good in predicting the range of the West, but it cannot extrapolate at all to the East.

plot(rc)
points(bf[,1:2], cex=.25)

image19

Question 4: Why would it be that the model does not extrapolate well?

An important question in the biogeography of the Bigfoot would be if it can survive in other parts of the world (it has been spotted trying to get on commerical flights leaving North America).

Let’s see.

window(wc) <- NULL
pm <- predict(wc, rrf, na.rm=TRUE)
plot(pm)
lines(wrld)

image20

Question 5: What are some countries that should consider Bigfoot as a potential invasive species?

Climate change

We can also estimate range shifts due to climate change. We can use the same model, but now extrapolate in time (and space).

fut <- cmip6_world("CNRM-CM6-1", "585", "2061-2080", var="bio", res=10, path=".")
names(fut)
##  [1] "bio01" "bio02" "bio03" "bio04" "bio05" "bio06" "bio07" "bio08" "bio09"
## [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
## [19] "bio19"
names(wc)
##  [1] "wc2.1_10m_bio_1"  "wc2.1_10m_bio_2"  "wc2.1_10m_bio_3"  "wc2.1_10m_bio_4"
##  [5] "wc2.1_10m_bio_5"  "wc2.1_10m_bio_6"  "wc2.1_10m_bio_7"  "wc2.1_10m_bio_8"
##  [9] "wc2.1_10m_bio_9"  "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12"
## [13] "wc2.1_10m_bio_13" "wc2.1_10m_bio_14" "wc2.1_10m_bio_15" "wc2.1_10m_bio_16"
## [17] "wc2.1_10m_bio_17" "wc2.1_10m_bio_18" "wc2.1_10m_bio_19"
names(fut) <- names(wc)
window(fut) <- ext_bf
pfut <- predict(fut, rrf, na.rm=TRUE)
plot(pfut)

image21

Question 6: Make a map to show where conditions are improving for western bigfoot, and where they are not. Is the species headed toward extinction?

Further reading

More on Species distribution modeling with R.