4. Environmental data

4.1 Raster data

In species distribution modeling, predictor variables are typically organized as raster (grid) type files. Each predictor should be a ‘raster’ representing a variable of interest. Variables can include climatic, soil, terrain, vegetation, land use, and other variables. These data are typically stored in files in some kind of GIS format. Almost all relevant formats can be used (including ESRI grid, geoTiff, netCDF, IDRISI). Avoid ASCII files if you can, as they tend to considerably slow down processing speed. For any particular study the layers should all have the same spatial extent, resolution, origin, and projection. If necessary, use functions like crop, extend, aggregate, resample, and projectRaster from the raster package. See the Introduction to spatial data manipulation for more information about function you can use to prepare your predictor variable data. See the help files and the vignette of the raster package for more info on how to do this.

The set of predictor variables (rasters) can be used to make a RasterStack, which is a collection of `RasterLayer’ objects (see the Spatial Data tutorial for more info.

Here we make a list of files that are installed with the dismo package and then create a rasterStack from these, show the names of each layer, and finally plot them all.

First get the folder with our files. In this example it is the (“ex”) directory of the dismo package. You do not need such a complex statement to get your own files (you just type the string).

path <- file.path(system.file(package="dismo"), 'ex')

Now get the names of all the files with extension ”.grd” in this folder. The $ sign indicates that the files must end with the characters ‘grd’. By using full.names=TRUE, the full path names are retunred.

library(dismo)
files <- list.files(path, pattern='grd$', full.names=TRUE )
files
## [1] "C:/soft/R/R-3.3.1/library/dismo/ex/bio1.grd"
## [2] "C:/soft/R/R-3.3.1/library/dismo/ex/bio12.grd"
## [3] "C:/soft/R/R-3.3.1/library/dismo/ex/bio16.grd"
## [4] "C:/soft/R/R-3.3.1/library/dismo/ex/bio17.grd"
## [5] "C:/soft/R/R-3.3.1/library/dismo/ex/bio5.grd"
## [6] "C:/soft/R/R-3.3.1/library/dismo/ex/bio6.grd"
## [7] "C:/soft/R/R-3.3.1/library/dismo/ex/bio7.grd"
## [8] "C:/soft/R/R-3.3.1/library/dismo/ex/bio8.grd"
## [9] "C:/soft/R/R-3.3.1/library/dismo/ex/biome.grd"

Now create a RasterStack of predictor variables.

predictors <- stack(files)
predictors
## class       : RasterStack
## dimensions  : 192, 186, 35712, 9  (nrow, ncol, ncell, nlayers)
## resolution  : 0.5, 0.5  (x, y)
## extent      : -125, -32, -56, 40  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names       : bio1, bio12, bio16, bio17, bio5, bio6, bio7, bio8, biome
## min values  :  -23,     0,     0,     0,   61, -212,   60,  -66,     1
## max values  :  289,  7682,  2458,  1496,  422,  242,  461,  323,    14
names(predictors)
## [1] "bio1"  "bio12" "bio16" "bio17" "bio5"  "bio6"  "bio7"  "bio8"  "biome"
plot(predictors)

We can also make a plot of a single layer in a RasterStack, and plot some additional data on top of it. First get the world boundaries and the bradypus data:

library(maptools)
## Checking rgeos availability: TRUE
data(wrld_simpl)
file <- paste(system.file(package="dismo"), "/ex/bradypus.csv", sep="")
bradypus <- read.table(file,  header=TRUE,  sep=',')
# we do not need the first column
bradypus  <- bradypus[,-1]

And now plot:

# first layer of the RasterStack
plot(predictors, 1)

# note the "add=TRUE" argument with plot
plot(wrld_simpl, add=TRUE)
# with the points function, "add" is implicit
points(bradypus, col='blue')

The example above uses data representing ‘bioclimatic variables’ from the WorldClim database (Hijmans et al., 2004) and ‘terrestrial biome’ data from the WWF (Olsen et al., 2001). You can go to these websites if you want higher resolution data. You can also use the getData function from the raster package to download WorldClim climate data.

Predictor variable selection can be important, particularly if the objective of a study is explanation. See, e.g., Austin and Smith (1987), Austin (2002), Mellert et al., (2011). The early applications of species modeling tended to focus on explanation (Elith and Leathwick 2009). Nowadays, the objective of SDM tends to be prediction. For prediction within the same geographic area, variable selection might arguably be relatively less important, but for many prediction tasks (e.g. to new times or places, see below) variable selection is critically important. In all cases it is important to use variables that are relevant to the ecology of the species (rather than with the first dataset that can be found on the web!). In some cases it can be useful to develop new, more ecologically relevant, predictor variables from existing data. For example, one could use land cover data and the focal function in the raster package to create a new variable that indicates how much forest area is available within x km of a grid cell, for a species that might have a home range of x.

4.2 Extracting values from rasters

We now have a set of predictor variables (rasters) and occurrence points. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the dismo package). This is a very straightforward thing to do using the ‘extract’ function from the raster package. In the example below we use that function first for the Bradypus occurrence points, then for 500 random background points. We combine these into a single data.frame in which the first column (variable ‘pb’) indicates whether this is a presence or a background point. ‘biome’ is categorical variable (called a ‘factor’ in R) and it is important to explicitly define it that way, so that it won’t be treated like any other numerical variable.

presvals <- extract(predictors, bradypus)
# setting random seed to always create the same
# random set of points for this example
set.seed(0)
backgr <- randomPoints(predictors, 500)
absvals <- extract(predictors, backgr)
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
sdmdata[,'biome'] = as.factor(sdmdata[,'biome'])
head(sdmdata)
##   pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 1  1  263  1639   724    62  338  191  147  261     1
## 2  1  263  1639   724    62  338  191  147  261     1
## 3  1  253  3624  1547   373  329  150  179  271     1
## 4  1  243  1693   775   186  318  150  168  264     1
## 5  1  243  1693   775   186  318  150  168  264     1
## 6  1  252  2501  1081   280  326  154  172  270     1
tail(sdmdata)
##     pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 611  0  253  2341   928   166  318  178  141  256     1
## 612  0  251  1191   528    74  329  160  169  271     9
## 613  0  175   161    67     7  335   35  300  179    13
## 614  0  263  2789   943   418  318  211  107  263     1
## 615  0  269  1859   840    32  351  184  168  269     1
## 616  0  117  1136   341   243  297  -67  365  201     4
summary(sdmdata)
##        pb              bio1           bio12            bio16
##  Min.   :0.0000   Min.   : -3.0   Min.   :   3.0   Min.   :   2.0
##  1st Qu.:0.0000   1st Qu.:177.0   1st Qu.: 752.2   1st Qu.: 327.5
##  Median :0.0000   Median :242.0   Median :1489.5   Median : 655.5
##  Mean   :0.1883   Mean   :214.3   Mean   :1574.5   Mean   : 643.2
##  3rd Qu.:0.0000   3rd Qu.:261.0   3rd Qu.:2250.8   3rd Qu.: 911.8
##  Max.   :1.0000   Max.   :286.0   Max.   :7682.0   Max.   :2458.0
##
##      bio17             bio5            bio6              bio7
##  Min.   :   0.0   Min.   : 67.0   Min.   :-207.00   Min.   : 68.0
##  1st Qu.:  37.0   1st Qu.:302.8   1st Qu.:  42.75   1st Qu.:117.0
##  Median :  98.5   Median :321.0   Median : 160.00   Median :159.5
##  Mean   : 159.2   Mean   :309.8   Mean   : 119.32   Mean   :190.4
##  3rd Qu.: 230.5   3rd Qu.:333.2   3rd Qu.: 201.25   3rd Qu.:251.5
##  Max.   :1496.0   Max.   :405.0   Max.   : 232.00   Max.   :443.0
##
##       bio8           biome
##  Min.   :-33.0   1      :280
##  1st Qu.:218.8   13     : 79
##  Median :250.5   7      : 63
##  Mean   :225.0   8      : 52
##  3rd Qu.:262.0   2      : 46
##  Max.   :316.0   (Other): 95
##                  NA's   :  1

There are alternative approaches possible here. For example, one could extract multiple points in a radius as a potential means for dealing with mismatch between location accuracy and grid cell size. If one would make 10 datasets that represent 10 equally valid “samples” of the environment in that radius, that could be then used to fit 10 models and explore the effect of uncertainty in location.

To visually investigate colinearity in the environmental data (at the presence and background points) you can use a pairs plot. See Dormann et al. (2013) for a discussion of methods to remove colinearity.

pairs(sdmdata[,2:5], cex=0.1)

A pairs plot of the values of the climate data at the Bradypus occurrence sites.

To use the sdmdata and presvals in the next chapter, we save it to disk.

saveRDS(sdmdata, "sdm.Rds")
saveRDS(presvals, "pvals.Rds")