8. Remote sensing image analysis

Part I

Introduction

This page provides a short introduction to satellite data analysis with R. Before reading this you should first learn the basics of the raster package.

Getting satellite images for a specific project remains one of the most challenging steps in the workflow. You have to find the data most suitable for you particular objective. A few important properties to consider while searching the remotely sensed (satellite) data include:

  1. Spatial resolution (pixel size) *reference*
  2. Temporal resolution (return time; availability of histrotical images, and for a particular moment in time) *reference*
  3. Spectral resolution (parts of the electromagnetic spectrum (wavelengths) that are sensed; also known as “bands”) *reference*
  4. Radiometric resolution (sensitivity; ability to measure small differences)
  5. Quality (e.g. cloud-cover or artefacts in data (read about problems in Landsat ETM+)

There are numerous sources of remotely sensed data from satellites. Generally, the very high spatial resolution data is available as (costly) commercial products. Lower spatial resolution data is freely available from NASA, ESA and other missions. In this tutorial we’ll use freely available Landsat 8, Landsat 7, Landsat 5, Sentinel and MODIS data. Landsat program is the longest running Earth-observation satellite program with series of satellites launched over the last four decades.

You can access these data from several sources, including:

  1. http://earthexplorer.usgs.gov/
  2. https://lpdaacsvc.cr.usgs.gov/appeears/
  3. https://search.earthdata.nasa.gov/search
  4. https://lpdaac.usgs.gov/data_access/data_pool
  5. https://scihub.copernicus.eu/
  6. https://aws.amazon.com/public-data-sets/landsat/

This web site lists 15 sources of freely available remote sensing data.

It is possible to download some satellite data using R-packages. You can use the MODIS or MODISTools package to search, download and pre-process different MODIS products.

Data

You can download all the data required for completing this tutorial here. Save the data to your working directory in your local computer.

Landsat data used in this tutorial

For the first exercise, we will primarily use a spatial subset of Landsat 8 scene collected on June 14, 2017.The subset covers the area between Concord and Stockton, CA and has multiple land use land cover classes.
All Landsat image scenes have unique product ID and metadata. You can find the information on Landsat sensor, satellite, location on Earth (WRS path, WRS row) and Acquisition date from the product id. E.g. the product identifier of the image tile you are using is ‘LC08_044034_20170614’. Based on this guide, you can see that the Sensor-Satellite is OLI/TIRS combined Landsat 8, WRS Path 44, WRS Row 34 and collected on June 14, 2017. Landsat scenes are most commonly delivered as zipped file, which contains single image layers representing the different bands captured by the sensors at various wavelengths.

We will start by exploring and visualizing these data.

Basic image manipulation and visualization

In this section we describe how to access and explore attributes of remote sensing images with R. All the images used here are in a raster format. You also learn to plot the raster data in R and understand the difference between single- and multi-band rasters.

Read image and display basic image properties

All these products consist of sensors that measure reflectance of the sun’s radiation in different wavelengths (e.g. in the red, green, or blue wavelengths). Thus one image consists of multiple spectral layers. In remote sensing jargon, such layers are referred to as “bands” (shorthand for “bandwidth”” in the electromagnetic spectrum). From such multi-spectral data you can create a RasterBrick or RasterStack in R (as long as each layer has the same extent and resolution).

Another piece of jargon is “pixel”. This is the term that is used for grid cell in remote sensing literature.

library(raster)

#load a single Landsat band
# Blue band
b2 <- raster('data/rs/LC08_044034_20170614_B2.tif')

# Green band
b3 <- raster('data/rs/LC08_044034_20170614_B3.tif')

# Red band
b4 <- raster('data/rs/LC08_044034_20170614_B4.tif')

# Near Infrared (NIR) band
b5 <- raster('data/rs/LC08_044034_20170614_B5.tif')

# Now print the variables
b2
## class       : RasterLayer
## dimensions  : 1245, 1497, 1863765  (nrow, ncol, ncell)
## resolution  : 30, 30  (x, y)
## extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : e:\bitbucket\rspatial-web\source\analysis\_R\data\rs\LC08_044034_20170614_B2.tif
## names       : LC08_044034_20170614_B2
## values      : 0.0748399, 0.7177562  (min, max)
# Also check b3,b4,b5

You can see the spatial resolution, extent, number of layers, coordinate reference system and more.

Image information and statistics

The below shows how you can access various properties from the Raster layer (this is the same for any raster data set).

# coordinate reference system (CRS)
crs(b2)
## CRS arguments:
##  +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0

# Number of rows, columns, or cells
ncell(b2)
## [1] 1863765
dim(b2)
## [1] 1245 1497    1

# spatial resolution
res(b2)
## [1] 30 30

# Number of bands
nlayers(b2)
## [1] 1

# Do the bands have the same extent, number of rows and columns, projection, resolution, and origin
compareRaster(b2,b3)
## [1] TRUE

You can create RasterStack/RasterBrick (multiband image) using the raster layers (single band image).

landsatRGB <- stack(b4,b3,b2)
landsatFCC <- stack(b5,b4,b3)

# Check the properties of the RasterStack
landsatRGB
## class       : RasterStack
## dimensions  : 1245, 1497, 1863765, 3  (nrow, ncol, ncell, nlayers)
## resolution  : 30, 30  (x, y)
## extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## names       : LC08_044034_20170614_B4, LC08_044034_20170614_B3, LC08_044034_20170614_B2
## min values  :              0.02084067,              0.04259216,              0.07483990
## max values  :               0.7861769,               0.6924697,               0.7177562

Notice the order of the layers. These are suitable for plotting different color composites. You can learn more about color composites in remote sensing here and also in the section below.

You can also create the RasterStack/RasterBrick using layers from disk.

# first create a list of raster layers to use
raslist <- paste0("data/rs/LC08_044034_20170614_B", 1:11, ".tif")
raslist
##  [1] "data/rs/LC08_044034_20170614_B1.tif"
##  [2] "data/rs/LC08_044034_20170614_B2.tif"
##  [3] "data/rs/LC08_044034_20170614_B3.tif"
##  [4] "data/rs/LC08_044034_20170614_B4.tif"
##  [5] "data/rs/LC08_044034_20170614_B5.tif"
##  [6] "data/rs/LC08_044034_20170614_B6.tif"
##  [7] "data/rs/LC08_044034_20170614_B7.tif"
##  [8] "data/rs/LC08_044034_20170614_B8.tif"
##  [9] "data/rs/LC08_044034_20170614_B9.tif"
## [10] "data/rs/LC08_044034_20170614_B10.tif"
## [11] "data/rs/LC08_044034_20170614_B11.tif"

landsat <- stack(raslist)
landsat
## class       : RasterStack
## dimensions  : 1245, 1497, 1863765, 11  (nrow, ncol, ncell, nlayers)
## resolution  : 30, 30  (x, y)
## extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## names       : LC08_044034_20170614_B1, LC08_044034_20170614_B2, LC08_044034_20170614_B3, LC08_044034_20170614_B4, LC08_044034_20170614_B5, LC08_044034_20170614_B6, LC08_044034_20170614_B7, LC08_044034_20170614_B8, LC08_044034_20170614_B9, LC08_044034_20170614_B10, LC08_044034_20170614_B11
## min values  :            9.641791e-02,            7.483990e-02,            4.259216e-02,            2.084067e-02,            8.457669e-04,           -7.872183e-03,           -5.052945e-03,            3.931751e-02,           -4.337332e-04,             2.897978e+02,             2.885000e+02
## max values  :              0.73462820,              0.71775615,              0.69246972,              0.78617686,              1.01243150,              1.04320455,              1.11793602,              0.82673049,              0.03547901,             322.43139648,             317.99530029

This will create a RasterStack with 11 layers. The layers are Ultra Blue, Blue, Green, Red, Near Infrared (NIR), Shortwave Infrared (SWIR) 1, Shortwave Infrared (SWIR) 2, Panchromatic, Cirrus, Thermal Infrared (TIRS) 1, Thermal Infrared (TIRS) 2. There is no need for the last four layers and we will learn how to remove those in following sections.

Visualize single and multi-band imagery

You can plot individual layers of a multi-spectral image.

par(mfrow = c(2,2))
plot(b2, main = "Landsat Blue Wavelength", col = gray(0:100 / 100))
plot(b3, main = "Landsat Green Wavelength", col = gray(0:100 / 100))
plot(b4, main = "Landsat Red Wavelength", col = gray(0:100 / 100))
plot(b5, main = "Landsat NIR Wavelength", col = gray(0:100 / 100))

Check the legends. They represent the range of each layers and are scaled between 0 and 1. Notice the difference in shading and range of legends between the different bands. This is because different Earth surface features reflect the incident solar radiation differently. Each layer represent how much incident solar radiation is reflected for that particular wavelength. E.g. vegetation reflects more energy in NIR than other wavelengths and thus appears brighter in NIR wavelength. However water absorbs most of the incident energy in NIR regions and appears dark.

However we did not get much information from these grey-scale plots, but these are often combined to create more interesting plots. To combine three bands, we can use plotRGB. To make a “true/natural color” image (that is, something that looks like a normal photograph), we need to select the bands that we want to render in the red, green and blue regions. For this Landsat image, r = 3 (red), g = 2(green), b = 1(blue) will plot the true color composite (vegetation in green, water blue etc). You can also supply additional arguments to plotRGB to improve the visualization (e.g. a linear stretch of the values, using strecth = "lin").

plotRGB(landsatRGB, r = 1, g = 2, b = 3, axes = TRUE, stretch = "lin", main = "Landsat True Color Composite")

This RGB-plot has lot more information than the previous ones, although it has been created using the same bands. Another popular image visualization method in remote sensing is known “false color” image where red, green and blue bands are replaced by other bands. Selecting r = NIR, g = red, b = green will plot a “false color”” composite. This representation is popular as it makes it easy to see the vegetation (in red).

par(mfrow = c(1,2))
plotRGB(landsatRGB, r = 1, g = 2, b = 3, axes = TRUE, stretch = "lin", main = "Landsat True Color Composite")
plotRGB(landsatFCC, r = 1, g = 2, b = 3, axes = TRUE, stretch = "lin", main = "Landsat False Color Composite")

Note: Always check for package documentation (help(plotRGB)) for other arguments that can be added (like scale) to improve or modify the image.

Exercise Use the RasterStack landsat to create a true and false color composite (hint remember the position of the bands in the stack).

Subset and rename spectral bands

You can select specific layers (bands) using subset function, or via indexing.

# select first 3 bands only
landsatsub1 <- subset(landsat, 1:3)
# same
landsatsub2 <- landsat[[1:3]]

# Number of bands in orginal and new data
nlayers(landsat)
## [1] 11
nlayers(landsatsub1)
## [1] 3
nlayers(landsatsub2)
## [1] 3

As mentioned before, there is no need for the last four bands inlandsat. You can remove those using

landsat <- subset(landsat, 1:7)

Set the names of the bands using the following:

names(landsat)
## [1] "LC08_044034_20170614_B1" "LC08_044034_20170614_B2"
## [3] "LC08_044034_20170614_B3" "LC08_044034_20170614_B4"
## [5] "LC08_044034_20170614_B5" "LC08_044034_20170614_B6"
## [7] "LC08_044034_20170614_B7"
names(landsat) <- c('ultra-blue','blue','green','red','NIR','SWIR1','SWIR2')
names(landsat)
## [1] "ultra.blue" "blue"       "green"      "red"        "NIR"
## [6] "SWIR1"      "SWIR2"

Spatial subset or crop

Spatial subsetting can be used to limit analysis to a geographic subset of the image. Spatial subsets can be created with the crop function, using an extent object, or another spatial object from which an Extent can be extracted/created,.

# Using extent
extent(landsat)
## class       : Extent
## xmin        : 594090
## xmax        : 639000
## ymin        : 4190190
## ymax        : 4227540
e <- extent(624387, 635752, 4200047, 4210939)

# crop landsat by the extent
landsatcrop <- crop(landsat, e)

Exercise Interactive selection from the image is also possible. Use drawExtent and drawPoly to select an area of interest. Note: drawing will not work for plots generated within the markdown document. Please run the plot and draw commands from RStudio console.

Exercise Use the RasterStack landsatcrop to create a true and false color composite (hint remember the position of the bands in the stack).

Saving results to disk

At this stage we may want to save the raster to disk using the function writeRaster. Multiple file types are supported. Most commonly we will use GeoTiff format which saves the CRS, extent and resolution information so the data can be used in a different GIS program as well. While layer order is preserved, layer names are lost in GeoTiff format.

writeRaster(landsatcrop,filename = "cropped-landsat.tif", format = "GTiff", overwrite = TRUE)

Note: Check for package documentation (help(writeRaster)) for additional helpful arguments that can be added.

Relation between bands

A scatterplot matrix can be helpful in exploring relationships between raster layers. This can be done with the pairs() function of the raster package.

# Plot between ultra-blue and blue
pairs(landsatcrop[[1:2]], main = "Scatterplot between Ultra-blue and Blue")
# Plot between ultra-blue and blue
pairs(landsatcrop[[4:5]], main = "Scatterplot between Red and NIR")

The first plot reveals high correlations between the blue wavelength regions. Because of the high correlation, we can just use one of the Blue bands without losing much information.

This distribution of points in second plot (between NIR and Red) is unique due to its triangular shape. Vegetation reflects very highly in the NIR range than red and creates the upper corner close to NIR (y) axis. Water absorbs energy from all the bands and occupies the location close to origin. The furthest corner is created due to highly reflecting surface features like bright soil, concrete.

Extract raster values

Often we require value(s) of raster cell(s) for a geographic location/area. The extract function is used to get raster values at the locations of other spatial data. You can use points, lines, polygons or an Extent (rectangle) object. You can also use cell numbers to extract values.If using points, extract returns the values of a Raster* object for the cells in which a set of points fall.

# load the polygons with land use land cover information
samp <- readRDS('data/rs/samples.rds')

# generate 300 point samples from the polygons;
ptsamp <- spsample(samp, 300, type = 'random')
# add the class information to the point samples from polygons
ptsamp$class <- over(ptsamp, samp)$class

# extract values with points
df <- extract(landsat, ptsamp)

# To see some of the reflectance values
head(df)
##      ultra.blue       blue      green        red        NIR       SWIR1
## [1,]  0.1385764 0.11977421 0.10047328 0.07937237 0.05684016 0.040401835
## [2,]  0.1328511 0.11394056 0.09500829 0.09771910 0.13987754 0.246097729
## [3,]  0.1415908 0.12389463 0.10251180 0.08429519 0.06798699 0.051505294
## [4,]  0.1159357 0.09737211 0.08305907 0.05916061 0.01953948 0.005573411
## [5,]  0.1476847 0.14430158 0.15241231 0.19398521 0.29747289 0.382527143
## [6,]  0.1310945 0.13076924 0.14709912 0.20628142 0.34054217 0.341279507
##           SWIR2
## [1,] 0.03292001
## [2,] 0.23796532
## [3,] 0.04287409
## [4,] 0.00344814
## [5,] 0.26268786
## [6,] 0.19194669

Spectral profiles

A plot of the spectrum (all bands) for pixels representing a certain earth surface features (e.g. water) is known as a spectral profile. Such profiles demonstrate the differences in spectral properties of various earth surface features and constitute the basis for image analysis. Spectral values can be extracted from any multispectral data set using extract function. In the above example, we extracted values of Landsat data for the samples. These samples include: cropland, water, fallow, built and open. First we compute the mean reflectance values for each class and each band.

ms <- aggregate(df, list(ptsamp$class), mean)

# instead of the first column, we use rownames
rownames(ms) <- ms[,1]
ms <- ms[,-1]
ms
##          ultra.blue       blue      green        red        NIR      SWIR1
## built     0.1783854 0.16950935 0.16457388 0.17336592 0.21106328 0.21559484
## cropland  0.1133169 0.09157923 0.08522944 0.05855339 0.45845483 0.15963042
## fallow    0.1327631 0.11694913 0.10429801 0.11285206 0.17614686 0.23043178
## open      0.1393757 0.13847506 0.15429326 0.20952121 0.34989013 0.36128512
## water     0.1336519 0.11644859 0.09873222 0.07845756 0.04952641 0.03396385
##               SWIR2
## built    0.18320976
## cropland 0.07551479
## fallow   0.19381964
## open     0.21437596
## water    0.02759617

Now we will plot the mean spectra of these features.

# Create a vector of color for the land cover classes for use in plotting
mycolor <- c('darkred','yellow','burlywood','cyan','blue')

#transform ms from a data.frame to a matrix
ms <- as.matrix(ms)

# First create an empty plot
plot(0, ylim=c(0,0.6), xlim = c(1,7), type='n', xlab="Bands", ylab = "Reflectance")

# add the different classes
for (i in 1:nrow(ms)){
  lines(ms[i,], type = "l", lwd = 3, lty = 1, col = mycolor[i])
}

# Title
title(main="Spectral Profile from Landsat", font.main = 2)

# Legend
legend("topleft", rownames(ms),
       cex=0.8, col=mycolor, lty = 1, lwd =3, bty = "n")

The spectral profile shows (dis)similarity in the reflectance of different features on the earth’s surface (or above it). Crop shows similar spectral signatures as vegetation. Water shows relatively low reflection in all wavelengths. Built, fallow and open show relatively higher reflectance in the longer wavelength regions.

Basic mathematical operations

The raster package supports many mathematical operations. Math operations are generally performed per pixel. First we will learn about basic arithmetic operations on bands. First example is a custom math function that calculates the Normalized Difference Vegetation Index (NDVI). Learn more about [vegetation indices here] (http://www.un-spider.org/links-and-resources/data-sources/daotm/daotm-vegetation) and NDVI.

Compute vegetation indices

Let’s define a general function for ratio based vegetation index.

# i and k are the index of bands to be used for the indices computation

VI <- function(img, k, i) {
  bk <- img[[k]]
  bi <- img[[i]]
  vi <- (bk - bi) / (bk + bi)
  return(vi)
}

# For Landsat NIR = 5, red = 4.
ndvi <- VI(landsat,5, 4)
plot(ndvi, col = rev(terrain.colors(10)), main = 'Landsat-NDVI')

You can see the variation in greenness from the plot.

Exercise Adapt the function to compute indices which will highlight i) water and ii) built-up. Hint: Use the spectral profile plot to find the bands having maximum and minimum reflectance for these two classes.

Histogram

We can explore the distribution of values contained within our raster using the hist() function which produces a histogram. Histograms are often useful in identifying outliers and bad data values in our raster data.

# view histogram of data
hist(ndvi,
     main = "Distribution of NDVI values",
     xlab = "NDVI",
     ylab="Frequency",
     col = "wheat",
     xlim = c(-0.5, 1),
     breaks = 30,
     xaxt = 'n')
axis(side=1, at = seq(-0.5,1, 0.05), labels = seq(-0.5,1, 0.05))

image0 We will refer to this histogram for the following sub-section on thresholding.

Exercise Plot histogram of other vegetation indices you derived.

Thresholding

We can apply basic rules to get an estimate of spatial extent of different Earth surface features. Note that NDVI values are standardized and ranges between -1 to +1. Higher values indicate more green cover.

Pixels having NDVI values greater than 0.4 are definitely vegetation. Following operation masks all non-vegetation pixels.

veg <- calc(ndvi, function(x){x[x < 0.4] <- NA; return(x)})
plot(veg, main = 'Veg cover')

Let’s find the surface feature corresponding to the peak between 0.25 and 0.3 in the NDVI histogram.

land <- reclassify(ndvi, c(-Inf,0.25,NA,0.25,0.3,1,0.3,Inf,NA))
plot(land, main = 'What is it?')

My best guess is these are the open areas. You can plot land over original landsatFCC raster to find out more.

plotRGB(landsatRGB, r = 1, g = 2, b = 3, axes = TRUE, stretch = "lin", main = "Landsat False Color Composite")
plot(land, add = TRUE, legend = FALSE)

You can also create classes for different amount of vegetation presence.

vegc <- reclassify(veg, c(-Inf,0.25,1, 0.25,0.3,2, 0.3,0.4,3, 0.4,0.5,4, 0.5,Inf, 5))
plot(vegc,col = rev(terrain.colors(4)), main = 'NDVI based thresholding')

Exercise Is it possible to find water using thresholding of NDVI or any other indices?

Principal component analysis

Multi-spectral data are sometimes transformed to helps to reduce the dimensioanlity and noise in the data. The principal components transform is a generic data reduction method that can be used to create a few uncorrelated bands from a larger set of correlated bands.

You can calculate the same number of principal components as the number of input bands. The first principal component (PC) explains the largest percentage of variance and other PCs explain additional the variance in decreasing order.

set.seed(1)
sr <- sampleRandom(landsat, 10000)
plot(sr[,c(4,5)], main = "NIR-Red plot")

This is known as vegetation and soil-line plot (Same as the scatter plot in earlier section).

Exercise Can you guess the directions of ‘principal components’ from this scatter plot?

pca <- prcomp(sr, scale = TRUE)
pca
## Standard deviations (1, .., p=7):
## [1] 2.20737373 1.17428246 0.73633573 0.42388639 0.12500714 0.09389890
## [7] 0.04741454
##
## Rotation (n x k) = (7 x 7):
##                  PC1           PC2         PC3         PC4         PC5
## ultra.blue 0.3662344 -0.4524940727 -0.15415991  0.52595682 -0.21276947
## blue       0.4125282 -0.3336052918 -0.15782033  0.09509849 -0.21537031
## green      0.4372284 -0.1056685710 -0.26457625 -0.20443752  0.49716419
## red        0.4333279 -0.0003693985 -0.05074204 -0.67055674  0.01651199
## NIR        0.1762472  0.6825385571 -0.58744574  0.32192859  0.09371265
## SWIR1      0.3815340  0.4119308383  0.29523433 -0.08811728 -0.65630314
## SWIR2      0.3743258  0.1929889704  0.66820379  0.33388919  0.47051465
##                   PC6         PC7
## ultra.blue  0.1337062  0.54550658
## blue        0.0836529 -0.79447772
## green      -0.6553920  0.09570391
## red         0.5619375  0.20966468
## NIR         0.2090330 -0.04325163
## SWIR1      -0.3878540  0.09301774
## SWIR2       0.1889099 -0.08709876
screeplot(pca)
pci <- predict(landsat, pca, index = 1:2)
plot(pci[[1]])

The first principal component highlights the boundaries between land use classes or spatial details, which is the most common information among all wavelengths. it is difficult to understand what the second principal component is highlighting. Lets try thresholding again:

pc2 <- reclassify(pci[[2]], c(-Inf,0,1,0,Inf,NA))
par(mfrow = c(1,2))
plotRGB(landsatFCC, r = 1, g = 2, b = 3, axes = TRUE, stretch = "lin", main = "Landsat False Color Composite")
plotRGB(landsatFCC, r = 1, g = 2, b = 3, axes = TRUE, stretch = "lin", main = "Landsat False Color Composite")
plot(pc2, legend = FALSE, add = TRUE)

Exercise Can you guess the surface features dominating second principal component? (hint: Inspect the unmasked areas; there could be multiple features).

To learn more about the information contained in the vegetation and soil line plots read this paper by [Gitelson et al] (http://www.tandfonline.com/doi/abs/10.1080/01431160110107806#.V6hp_LgrKhd).
An extension of PCA in remote sensing is known as Tasseled-cap Transformation.

Part II

Image classification

We will explore two groups of classification methods: unsupervised and supervised. Various unsupervised and supervised classification algorithms exist, and the choice of algorithm can affect the results. We will explore one unsupervised (k-means) and two supervised (decision tree, random forest) algorithms.

For this example, we will follow the National Land Cover Database 2011 (NLCD 2011) classification scheme for a subset of the Central Valley regions. You will use cloud-free composite image from Landsat 5 with 6 bands.

landsat5 <- stack('data/rs/centralvalley-2011LT5.tif')
names(landsat5) <- c('blue','green','red','NIR','SWIR1','SWIR2')

Exercise 1 Make a 3-band False Color Composite plot of landsat5.

Unsupervised classification

In unsupervised classification, we don’t supply any training data. This particularly is useful when we don’t have prior knowledge of the study area. The algorithm groups pixels with similar spectral characteristics into unique clusters/classes/groups following some statistically determined conditions (e.g. minimizing mean square root error within each cluster). You have to re-label and combine these spectral clusters into information classes (for e.g. land-use land-cover). Unsupervised algorithms are often referred as clustering.

To get satisfactory results from unsupervised classification, a good practice is to start with large number of centers (more clusters) and merge/group/recode similar clusters by inspecting the original imagery.

Learn more about K-means and other unsupervised-supervised algorithms here.

We will perform unsupervised classification on a spatial subset of the ndvi layer.

Exercise 2 Please generate ndvi layer for landsat5 (hint last week’s lab instructions)

Now we will perform unsupervised kmeans clustering on the ndvi layer you generated in exercise 2. First we will crop the ndvi layer for smaller extent for faster processing (you can select any extent using the drawExtent() function). Please DO NOT use landsat5 RasterStack for the clustering.

kmeans classification

# Extent to crop ndvi layer
e <- extent(-121.807, -121.725, 38.004, 38.072)

# crop landsat by the extent
ndvi <- crop(ndvi, e)
ndvi
## class       : RasterLayer
## dimensions  : 252, 304, 76608  (nrow, ncol, ncell)
## resolution  : 0.0002694946, 0.0002694946  (x, y)
## extent      : -121.807, -121.725, 38.00413, 38.07204  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names       : layer
## values      : -0.3360085, 0.7756007  (min, max)

# convert the raster to vecor/matrix
nr <- getValues(ndvi)
str(nr)
##  num [1:76608] 0.245 0.236 0.272 0.277 0.277 ...

Please note that getValues converted the ndvi RasterLayer to array. Now we will perform the kmeans clustering on the array object and inspect the output.

# It is important to set the seed generator because `kmeans` initiates the centers in random locations
set.seed(99)

# We want to create 10 clusters, allow 500 iterations, start with 5 random sets using "Lloyd" method
kmncluster <- kmeans(na.omit(nr), centers = 10, iter.max = 500, nstart = 5, algorithm="Lloyd")

# kmeans returns an object of class "kmeans"
str(kmncluster)
## List of 9
##  $ cluster     : int [1:76608] 6 6 3 3 3 3 3 6 6 6 ...
##  $ centers     : num [1:10, 1] 0.38986 0.00498 0.29997 -0.14263 -0.20902 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "1" "2" "3" "4" ...
##   .. ..$ : NULL
##  $ totss       : num 6459
##  $ withinss    : num [1:10] 5.18 6.13 4.91 6.11 5.75 ...
##  $ tot.withinss: num 55.8
##  $ betweenss   : num 6403
##  $ size        : int [1:10] 8736 4550 7156 8624 11672 6807 8932 5198 5040 9893
##  $ iter        : int 221
##  $ ifault      : NULL
##  - attr(*, "class")= chr "kmeans"

kmeans returns an object with 9 elements. The length of the cluster element within kmncluster is 76608 which same as length of nr created from the ndvi. The cell values of kmncluster$cluster range between 1 to 10 corresponding to the input number of cluster we provided in the kmeans function. kmncluster$cluster indicates the cluster label for corresponding pixel. We need to convert the kmncluster$cluster array back to RasterLayer of the same dimension as the ndvi.

# First create a copy of the ndvi layer
knr <- ndvi

# Now replace raster cell values with kmncluster$cluster array
knr[] <- kmncluster$cluster

# Alternative
values(knr) <- kmncluster$cluster
knr
## class       : RasterLayer
## dimensions  : 252, 304, 76608  (nrow, ncol, ncell)
## resolution  : 0.0002694946, 0.0002694946  (x, y)
## extent      : -121.807, -121.725, 38.00413, 38.07204  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names       : layer
## values      : 1, 10  (min, max)

We can see that knr is a RasterLayer but we do not know which cluster (1-10) belongs what LULC class. You can find that out by plotting them side-by-side with other reference layer and using unique color for each cluster.

# Define a color vector for 10 clusters (learn more about setting the color later)
mycolor <- c("#fef65b","#ff0000", "#daa520","#0000ff","#0000ff","#00ff00","#cbbeb5",
             "#c3ff5b", "#ff7373", "#00ff00", "#808080")

par(mfrow = c(1,2))
plot(ndvi, col = rev(terrain.colors(10)), main = 'Landsat-NDVI')
plot(knr, main = 'Unsupervised classification', col = mycolor )

While for other purposes it is usually better to define more classes (and possibly fuse classes later), a simple classification like this one could be useful, e.g., merge cluster 4 and 5 to construct a water mask for the year 2011.

Important These 10 clusters are not useful unless you re-label them with LULC information class. You can save the raster output of clustering and open it in GIS platform (e.g. QGIS) and assign classes to each clusters. Multiple clusters can have the same class! A tutorial can be found [here] (http://wiki.awf.forst.uni-goettingen.de/wiki/index.php/Unsupervised_classification_(Tutorial)). You can skip step 1 in the tutorial which provides guidelines for generating clusters.

You can change the colors in my mycolor. Learn more about selecting colors in R here and here.

Exercise 3 Plot 3-band RGB of landsat5 for the subset (extent e) and result of kmeans clustering side-by-side and make a table of land-use land-cover labels for the clusters. E.g. cluster 4 and 5 are water.

Supervised classification

In a supervised classification, we have prior knowledge about some of the land-cover types through, for example, fieldwork, reference GIS layers or interpretation of high resolution imagery (such as available on Google maps). Specific sites in the study area that represent homogeneous examples of these known land-cover types are identified. These areas are commonly referred to as training sites because the spectral properties of these sites are used to train the classification algorithm.

The following examples uses a Classification and Regression Trees (CART) classifier (Breiman et al. 1984) (further reading to predict land use land cover classes in the study area.

We will perform the following steps before the classification:

  • Read NLCD 2011 reference raster
  • Generate sample sites based on the NLCD 2011 reference raster
  • Extract layervalues from landsat5 for the sample sites
  • Split the sample sites into training and validation samples
  • Train the classifier using training samples
  • Classify landsat5 RasterStack using the trained model
  • Test the accuracy of the classified map using validation samples

Read NLCD 2011 reference raster

National Land Cover Database 2011 (NLCD 2011) is the most recent national land cover product created by the Multi-Resolution Land Characteristics (MRLC) Consortium covering entire USA. NLCD is a 30-m Landsat-based land cover database spanning 4 epochs (1992, 2001, 2006 and 2011). NLCD 2011 is based primarily on a decision-tree classification of circa 2011 Landsat satellite data.

You can find the classnames in NCLD 2011 (here)[https://www.mrlc.gov/nlcd11_leg.php]. It has two pairs of classvalues and classnames that correspond to the levels of land use and land cover classification system. These levels usually represent the level of complexity, level I being the simplest with broad land use land cover categories. Read this report by Anderson et al to learn more about this land use and land cover classification system for use with Remote Sensor data.

nlcd <- brick('data/rs/nlcd-L1.tif')
names(nlcd) <- c("nlcd2001","nlcd2011")

# Now we will supply the classnames and colors for plotting
nlcdclass <- c("Water", "Developed", "Barren", "Forest", "Shrubland", "Herbaceous", "Planted/Cultivated", "Wetlands")
classdf <- data.frame(classvalue1 = c(1,2,3,4,5,7,8,9), classnames1 = nlcdclass)

# Hex codes of colors
classcolor <- c("#5475A8", "#B50000", "#D2CDC0", "#38814E", "#AF963C", "#D1D182", "#FBF65D", "#C8E6F8")

# Now we ratify (RAT = "Raster Attribute Table") the ncld2011 (define RasterLayer as a categorical variable). This is helpful for plotting.
nlcd2011 <- nlcd[[2]]
nlcd2011 <- ratify(nlcd2011)
rat <- levels(nlcd2011)[[1]]

#
rat$landcover <- nlcdclass
levels(nlcd2011) <- rat

We did a lot of things here. Take a step back and read more about ratify.

Note There is no class with value 6.

Generate sample sites based on the NLCD 2011 reference raster

As we discussed in the class, training and/or validation data can come from a variety of sources. In this example we will generate the training and validation sample sites using the NLCD reference RasterLayer. Alternatively, you can use predefined sites that you may have collected from other sources. We will generate the sample sites following a stratified random sampling to ensure samples from each LULC class.

# Load the training sites locations
# Set the random number generator to reproduce the results
set.seed(99)

# Sampling
samp2011 <- sampleStratified(nlcd2011, size = 200, na.rm = TRUE, sp = TRUE)
samp2011
## class       : SpatialPointsDataFrame
## features    : 1600
## extent      : -121.9257, -121.4207, 37.85415, 38.18536  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## variables   : 2
## names       :    cell, nlcd2011
## min values  :     829,        1
## max values  : 2308569,        9
# Number of samples in each class
table(samp2011$nlcd2011)
##
##   1   2   3   4   5   7   8   9
## 200 200 200 200 200 200 200 200

You can see there are two variables in samp2011. The cell column contains cell numbers of nlcd2011 sampled. nlcd2011 column contains the class values (1-9). We will drop the cell column later.

Here nlcd has integer values between 1-9. You will often find classnames are provided as string labels (e.g. water, crop, vegetation). You will need to ‘relabel’ class names to integer or factors if only string labels are supplied before using them as response variable in the classification. There are several approaches that could be used to convert these classes to integer codes. We can make a function that will reclassify the character strings representing land cover classes into integers based on the existing factor levels.

Let’s plot the training sites over the nlcd2011 RasterLayer to visualize the distribution of sampling locations.

require(rasterVis)
## Loading required package: rasterVis
## Loading required package: lattice
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
levelplot(nlcd2011, col.regions = classcolor, main = 'Distribution of Training Sites')+
  layer(sp.points(samp2011, pch = 3, cex = 0.5, col = 1))

rasterVis offers more advanced (trellis/lattice) plotting of Raster* objects. Please install the package if it is not available for your machine.

Extract layervalues from landsat5 for the sample sites

Once we have the sites, we can extract the band values from landsat5 RasterStack. These band values will be the predictor variables and classvalues (nlcd2011) will be the response variable.

# Extract the layer values for the locations
sampvals <- extract(landsat5, samp2011, df = TRUE)

# sampvals no longer has the spatial information. To keep the spatial information you use `sp = TRUE` argument in `extract` function.

# drop the ID column
sampvals <- sampvals[, -1]

# combine the class information with extracted values
sampdata <- data.frame(classvalue = samp2011@data$nlcd2011, sampvals)

Split the sample sites into training and validation samples

We will now split the sample sites into training and validation. Using the function dismo::kfold we can split the dataset into k groups. One of the groups will be used for validation while rest of them will be used for training.

library(dismo)
set.seed(99)

j <- kfold(sampdata, k = 5, by = sampdata$classvalue)

training2011 <- sampdata[j!= 1, ] # selected the rows where j equals 1
validation2011 <- sampdata[j == 1, ] # selected the rows where j equals [2:k]

# you can check the number of samples in each category
# uncomment and run the following
#table(samp2011$classnames)
#table(training2011$classnames)
#table(validation2011$classnames)

We have performed a 80-20 split for the training and validation samples. Number of validation samples are generally lower than training samples. You can try different partition ratio or use cross-validation.

Train the classifier using training samples

Now we will train the classification algorithm using training2011 dataset.

library('rpart')
#Please install the package if it is not available for your machine.

# Train the model
cart <- rpart(as.factor(classvalue)~., data = training2011, method = 'class', minsplit = 5)

# print(model.class)

# Much cleaner way is to plot the trained classification tree
plot(cart, uniform=TRUE, main="Classification Tree")
text(cart, cex = 0.8)

In the classification tree plot classvalues are printed at the leaf node. You can find the corresponding land use land cover names from the classdf dataframe.

See ?rpart.control to set different parameters for building the model.

You can print/plot more about the cart model created in the previous example. E.g. you can use plotcp(cart) to learn about the cost-complexity (cp argument in rpart).

Classify landsat5 RasterStack using the trained model

Once we have a trained classification model (cart), we can use that to classify the landsat5 RasterStack.

Important The names in the Raster object to be classified should exactly match those expected by the model. This will be the case if the same Raster object was used (via extract) to obtain the values to fit the model (previous example).

# Now predict the subset data based on the model; prediction for entire area takes longer time
pr2011 <- predict(landsat5, cart, type='class', progress = 'text')
##   |                                                                         |                                                                 |   0%  |                                                                         |================                                                 |  25%  |                                                                         |================================                                 |  50%  |                                                                         |=================================================                |  75%  |                                                                         |=================================================================| 100%
##

# Check the type of predicted variable
pr2011
## class       : RasterLayer
## dimensions  : 1230, 1877, 2308710  (nrow, ncol, ncell)
## resolution  : 0.0002694946, 0.0002694946  (x, y)
## extent      : -121.9258, -121.42, 37.85402, 38.1855  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names       : layer
## values      : 1, 9  (min, max)
## attributes  :
##        ID value
##  from:  1     1
##  to  :  8     9

Now plot the classification result using rasetrVis. See will set the classnames for the classvalues.

pr2011 <- ratify(pr2011)

rat <- levels(pr2011)[[1]]

rat$legend <- classdf$classnames

levels(pr2011) <- rat

levelplot(pr2011, maxpixels = 1e6,
          col.regions = classcolor,
          scales=list(draw=FALSE),
          main = "Decision Tree classification of Landsat 5")

Exercise 4 Plot nlcd2011 and pr2011 side-by-side and comment about the accuracy of the prediction (e.g. mixing between cultivated crops, pasture, grassland and shrubs).

You may need to select more samples and use additional predictor variables. The choice of classifier also plays an important role.

Test the accuracy of the classified map using validation samples

Accuracy assessment of the classified map is required to test the quality of the product. Two widely reported measures in remote sensing community are overall accuracy and kappa statistics value. You can perform the accuracy assessment using the independent samples (validation2011).

First we will predict the classes using predictor variable in the validation2011 and cart model

# predict (see the position of the arguments is different from raster predictions)
prclass <- predict(cart, validation2011[,2:ncol(validation2011)], type='class')

# create a dataframe using the reference and prediction
conmat <- data.frame(prediction = prclass, reference = validation2011$classvalue)

# create the confusion matrix
conmat <- table(conmat)

# change the name of the classes
colnames(conmat) <- classdf$classnames
rownames(conmat) <- classdf$classnames
print(conmat)
##                     reference
## prediction           Water Developed Barren Forest Shrubland Herbaceous
##   Water                 30         1      1      0         0          0
##   Developed              4        21     12      1         2          3
##   Barren                 0         4      6      0         0          0
##   Forest                 1         4      2     22         8          0
##   Shrubland              1         1      0     11        24          1
##   Herbaceous             1         7     18      4         3         34
##   Planted/Cultivated     0         2      0      1         3          1
##   Wetlands               3         0      1      1         0          1
##                     reference
## prediction           Planted/Cultivated Wetlands
##   Water                               0        0
##   Developed                           6        3
##   Barren                              0        0
##   Forest                             10        8
##   Shrubland                           6        5
##   Herbaceous                          6        4
##   Planted/Cultivated                 10        5
##   Wetlands                            2       15

Exercise 5 Comment on the miss-classification between different LULC classes.

Exercise 6 List few suggestions to improve the accuracy.

Compute overall accuracy and Kappa statistics

# number of instances
n <- sum(conmat)

# Total number of validation samples
print(n)
## [1] 320

# Total number of classes
nc <- nrow(conmat)
print(nc)
## [1] 8

# number of correctly classified instances per class
diag <- diag(conmat)

# number of instances per class
rowsums <- apply(conmat, 1, sum)

# number of predictions per class
colsums <- apply(conmat, 2, sum)

# distribution of instances over the actual classes
p <- rowsums / n

# distribution of instances over the predicted classes
q <- colsums / n

# Overall Accuracy
OA <- sum(diag) / n

# Calculate Kappa statistics
expAccuracy <- sum(p*q)
kappa <- (OA - expAccuracy) / (1 - expAccuracy)

print(OA)
## [1] 0.50625
print(kappa)
## [1] 0.4357143

Compute producer and use accuracy

# Producer accuracy
PA <- diag / colsums

# User accuracy
UA <- diag / rowsums

outAcc <- data.frame(producerAccuracy = PA, userAccuracy = UA)

print(outAcc)
##                    producerAccuracy userAccuracy
## Water                         0.750    0.9375000
## Developed                     0.525    0.4038462
## Barren                        0.150    0.6000000
## Forest                        0.550    0.4000000
## Shrubland                     0.600    0.4897959
## Herbaceous                    0.850    0.4415584
## Planted/Cultivated            0.250    0.4545455
## Wetlands                      0.375    0.6521739

Exercise 7 Please perform the classification using Random Forest classifiers from the package `randomForest <https://cran.r-project.org/web/packages/randomForest/index.html>`__. For help see this discussion.

Exercise 8 Plot the results of rpart and ‘rf` classifier side-by-side.

Exercise (optional) Please repeat the steps for the year 2001 using rpart. You will use cloud-free composite image from Landsat 7 with 6-bands. The reference data will be the National Land Cover Database 2001 (NLCD 2001) for the subset of the Central Valley regions.

Exercise(optional) You have been training your classifiers using 160 samples for each class. Now we want to see the effect of sample size on classification. Please repeat the steps with different subset of the sample size (e.g. 120, 100). Use the same holdout samples for accuracy assessment.

R packages

Here is a list of some R packages for analyzing remote sensing data:

-Raster visualization: rasterVis