2. Spatial data

2.1 Introduction

Spatial phenomena can generally be thought of as either discrete locations (objects with boundaries) or to a continuous phenomenon that can be observed everywhere, but does not have natural boundaries. Discrete locations, or “spatial objects” may refer to a river or road, country or town, or a research site. Examples of continuous phenomena, or “spatial fields” include elevation, temperature, and air quality.

Spatial objects are usually represented by vector data. Such data consists of a description of the “geometry” or “shape” of the locations, and normally also includes variables with additional information about the locations. For example, a vector data set may describe the borders of the countries of the world, and also store their names and the size of their population in 2015; or the roads in an area, and their type and names. These variables are often referred to as “attributes”. Spatial fields are usually represented by raster. We discuss these two data types in turn.

2.2 Vector data

The main vector data types are points, lines and polygons. In all cases, the geometry of these data structures consists of sets of coordinate pairs (x, y). Points are the simplest case. Each point has one coordinate pair, and n associated variables. For example, a point might represent a place where a rat was trapped, and the attributes could include the date it was captured, the person who captured it, the species size and sex, and information about the habitat. It is also possible to combine several points into a multi-point structure, with a single attribute record. For example, all the coffee shops in a town could be considered as a single geometry.

The geometry of lines is a just a little bit more complex. First note that in this context, the term ‘line’ refers to a set of one or more polylines (connected series of line segments). For example, in spatial analysis, a river and all its tributaries could be considered as a single ‘line’ (but they could also also be several lines, perhaps one for each tributary river). Lines are represented as ordered sets of coordinates (nodes). The actual line segments can be computed (and drawn on a map) by connecting the points. Thus, the representation of a line is very similar to that of a multi-point structure. The main difference is that the ordering of the points is important, because we need to know which points should be connected. A network (e.g. a road or river network), or spatial graph, is a special type of lines geometry where there is additional information about things like flow, connectivity, direction, and distance.

A polygon refers to a set of closed polylines. The geometry is very similar to that of lines, but to close a polygon the last coordinate pair coincides with the first pair. A complication with polygons is that they can have holes (that is a polygon entirely encolsed by another polygon, that serves to remove parts of the encolosing polygon (for examplel to show an island inside a lake. Also, valid polygons do not self-intersect (but it is OK for a line to self-cross). Again, multiple polygons can be considered as a single geometry. For example the United States state of Hawaii consists of several islands. Each can be represented by a single polygon, but together then can be represent a single (multi-) polygon of the Hawaiian islands.

2.3 Raster data

Raster data is commonly used to represent continuous variables. A raster divides the world into a grid of equally sized rectangles (referred to as cells or, in the context of remote sensing, pixels) that all have a values (or a missing value) for the variables of interest. A raster cell value should normally represeent the average (or majority) value for the area it covers. However, in some cases the values are actually estiamtes for the center of the cell (in essensce becoming a regular set of points with an attribute).

In contrast to vector data, in raster data the geometry is not explicitly stored as coordinates. It is implicitly set by knowing the spatial extent and the number or rows and columns in which the area is divided. From the extent and number of rows and columns, the size of the raster cells (spatial resolution) can be computed. While raster cells can be thought of as a set of regular polygons, it would be very inefficient to represent the data that way as coordiantes for each cell would have to be stored explictly. It would also dramatically increase processing speed in most cases.

Continuous surface data are sometimes stored as triangulated irregular networks (TINs); these are not discussed here.

2.4 Simple representation of spatial data

The basic data types in R are numbers, characters, logical (TRUE or FALSE) and factor values. Values of a single type can be combined in vectors and matrices, and variables of multiple types can be combined into a data.frame. We can represent (only very) basic spatial data with these data types. Let’s say we have the location (represented by longitude and latitude) of ten weather stations (named A to J) and their annual precipitation.

In the example below we make a very simple map. Note that a map is special type of plot (like a scatter plot, barplot, etc.). A map is a plot of geospatial data that also has labels and other graphical objects such as a scale bar or legend. The spatial data itself should not be referred to as a map.

name <- LETTERS[1:10]
longitude <- c(-116.7, -120.4, -116.7, -113.5, -115.5,
               -120.8, -119.5, -113.7, -113.7, -110.7)
latitude <- c(45.3, 42.6, 38.9, 42.1, 35.7, 38.9,
              36.2, 39, 41.6, 36.9)
stations <- cbind(longitude, latitude)
# Simulated rainfall data
precip <- (runif(length(latitude))*10)^3

A map of point locations is not that different from a basic x-y scatter plot. Here I make a plot (a map in this case) that shows the location of the weather stations, and the size of the dots is proportional to the amount of precipitation. The point size is set with argument cex.

psize <- 1 + precip/500
plot(stations, cex=psize, pch=20, col='red', main='Precipitation')

# add names to plot
text(stations, name, pos=4)

# add a legend
breaks <- c(100, 500, 1000, 2000)
legend("topright", legend=breaks, pch=20, pt.cex=psize, col='red', bg='gray')

Note that the data are represented by “longitude, latitude”, in that order, do not use “latitude, longitude” because on most maps latitude (North/South) is used for the vertical axis and longitude (East/West) for the horizontal axis. This is important to keep in mind, as it is a very common source of mistakes!

We can add multiple sets of points to the plot, and even draw lines and polygons:

lon <- c(-116.8, -114.2, -112.9, -111.9, -114.2, -115.4, -117.7)
lat <- c(41.3, 42.9, 42.4, 39.8, 37.6, 38.3, 37.6)
x <- cbind(lon, lat)

plot(stations, main='Precipitation')

polygon(x, col='blue', border='light blue')
lines(stations, lwd=3, col='red')
points(x, cex=2, pch=20)
points(stations, cex=psize, pch=20, col='red', main='Precipitation')

The above illustrates how numeric vectors representing locations can be used to draw simple maps. It also shows how points can (and typically are) represented by pairs of numbers, and a line and a polygons by a number of these points. Polygons is that they are “closed”, i.e. the first point coincides with the last point, but the polygon function took care of that for us.

There are cases where a simple approach like this may suffice and you may come across this in older R code or packages. Likewise, raster data could be represented by a matrix or higher-order array. Particularly when only dealing with point data such an approach may be practical. For example, a spatial data set representing points and attributes could be made by combining geometry and attributes in a single ‘data.frame`.

wst <- data.frame(longitude, latitude, name, precip)
##    longitude latitude name     precip
## 1     -116.7     45.3    A 721.003613
## 2     -120.4     42.6    B  18.716993
## 3     -116.7     38.9    C  51.530302
## 4     -113.5     42.1    D 187.988119
## 5     -115.5     35.7    E 749.127376
## 6     -120.8     38.9    F   8.203534
## 7     -119.5     36.2    G 725.093932
## 8     -113.7     39.0    H 843.038944
## 9     -113.7     41.6    I 288.539816
## 10    -110.7     36.9    J 248.993575

However, wst is a data.frame and R does not automatically understand the special meaning of the first two columns, or to what coordinate reference system it refers (longitude/latitude, or perhaps UTM zone 17S, or ....?).

Moreover, it is non-trivial to do some basic spatial operations. For example, the blue polygon drawn on the map above might represent a state, and a next question might be which of the 10 stations fall within that polygon. And how about any other operation on spatial data, including reading from and writing data to files? To facilitate such operation a number of R packages have been developed that define new spatial data types that can be used for this type of specialzed operations. The most important packages that define such spatial data structures are sp and raster. These datatypes are discussed in the next chapters.