Supervised Classification

Here we explore supervised classification for a simple land use land cover (LULC) mapping task. Various supervised classification algorithms exist, and the choice of algorithm can affect the results. Here we explore two related algorithms (CART and RandomForest).

In supervised classification, we have prior knowledge about some of the land-cover types through, for example, fieldwork, reference spatial data or interpretation of high resolution imagery (such as available on Google maps). Specific sites in the study area that represent homogeneous examples of these known land-cover types are identified. These areas are commonly referred to as training sites because the spectral properties of these sites are used to train the classification algorithm.

The following examples uses a Classification and Regression Trees (CART) classifier (Breiman et al. 1984) (further reading to predict land use land cover classes in the study area.

We will take the following steps:

  • Create sample sites used for classification

  • Extract cell values from Landsat data for the sample sites

  • Train the classifier using training samples

  • Classify the Landsat data using the trained model

  • Evaluate the accuracy of the model

Landsat data to classify

Here is our Landsat data.

library(terra)
# We read the 6 bands from the Landsat image we previously used
raslist <- paste0('data/rs/LC08_044034_20170614_B', 2:7, ".tif")
landsat <- rast(raslist)
names(landsat) <- c('blue', 'green', 'red', 'NIR', 'SWIR1', 'SWIR2')

Reference data

Training and/or validation data can come from a variety of sources. In this example, we use some training polygons we have already collected from other sources. We have already used this for making the spectral plots.There are 5 distinct classes – built,cropland,fallow,open and, water and we hope to find the pixels under this categroies based on our knowledge of training sites.

# load polygons with land use land cover information
samp <- readRDS("data/rs/lcsamples.rds")
# check the distribution of the polygons
plot(samp)
text(samp, samp$class)

image0

Next we generate random points within each polygons.

set.seed(1)
# generate point samples from the polygons
ptsamp <- spatSample(samp, 1000, method="random")
plot(ptsamp, "class")

image1

Alternatively, we can generate the training and validation sample sites using a reference land use land cover data. For example, the National Land Cover Database 2011 (NLCD 2011) is a land cover product for the United States. NLCD is a 30-m Landsat-based land cover database spanning 4 epochs (1992, 2001, 2006 and 2011). NLCD 2011 is based primarily on a decision-tree classification of circa 2011 Landsat data.

Detailes of the class mapped in NCLD 2011 can be found here (here)[https://www.mrlc.gov/nlcd11_leg.php]. It has two pairs of class values and names that correspond to the levels of land use and land cover classification system. These levels usually represent the level of complexity, level I being the simplest with broad land use land cover categories. Read this report by Anderson et al to learn more about this land use and land cover classification system.

nlcd <- rast('data/rs/nlcd-L1.tif')
names(nlcd) <- c("nlcd2001", "nlcd2011")
# The class names and colors for plotting
nlcdclass <- c("Water", "Developed", "Barren", "Forest", "Shrubland", "Herbaceous", "Cultivated", "Wetlands")
classdf <- data.frame(value = c(1,2,3,4,5,7,8,9), names = nlcdclass)
# Hex codes of colors
classcolor <- c("#5475A8", "#B50000", "#D2CDC0", "#38814E", "#AF963C", "#D1D182", "#FBF65D", "#C8E6F8")
nlcd2011 <- nlcd[[2]]
levels(nlcd2011) <- classdf
# plot the locations over the original nlcd raster
plot(nlcd2011, col=classcolor)
ptlonlat <- project(ptsamp, crs(nlcd2011))
points(ptlonlat)

image2

Generate sample sites from NLCD raster

# Set the random number generator to reproduce the results
set.seed(99)
# Sampling
samp2011 <- spatSample(nlcd2011, size = 500)
# Number of samples in each class
table(samp2011[,1])
##
##      Water  Developed     Barren     Forest  Shrubland Herbaceous Cultivated
##         43         58          3         25          7        106        231
##   Wetlands
##         27

Extract spectral values for the training sites

Once we have the training sites, we can extract the cell values from landsat layers. These band values will be the predictor variables and “class” from ptsamp will be the response variable.

# We use the x-y coordinates to extract the spectral values for the locations
xy <- as.matrix(geom(ptsamp)[,c('x','y')])
df <- extract(landsat, xy)[,-1]
# Quick check for the extracted values
head(df)
##        green        red        NIR       SWIR1       SWIR2
## 1 0.16633500 0.23427862 0.37628144 0.365156293 0.210380167
## 2 0.08446869 0.05525705 0.43509507 0.170433730 0.091083050
## 3 0.09722031 0.07436280 0.03638984 0.022814132 0.017912995
## 4 0.09967088 0.07987116 0.05991963 0.044348769 0.036758512
## 5 0.11767063 0.09841307 0.05887868 0.041507844 0.034264572
## 6 0.08416507 0.05955096 0.02077561 0.007850488 0.004901131
# combine lulc class information with extracted values
sampdata <- data.frame(class = ptsamp$class, df)

We often find classnames are provided as string labels (e.g. water, crop, vegetation) that need to be ‘relabelled’ to integer or factors if only string labels are supplied before using them as response variable in the classification. There are several approaches that could be used to convert these classes to integer codes. We can make a function that will reclassify the character strings representing land cover classes into integers based on the existing factor levels.

Train the classifier

Now we will train the classification algorithm using sampdata dataset.

library(rpart)
# Train the model
cartmodel <- rpart(as.factor(class)~., data = sampdata, method = 'class', minsplit = 5)

One of the primary reasons behind choosing cart model is to have a closer look at the classification model. Unlike other models, cart provides very simple way of inspecting and plotting the model structure.

# print trained model
print(cartmodel)
## n= 1000
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##  1) root 1000 566 water (0.069 0.091 0.16 0.25 0.43)
##    2) NIR>=0.08187716 566 316 open (0.12 0.16 0.28 0.44 0)
##      4) SWIR1< 0.2861418 286 148 fallow (0.18 0.32 0.48 0.017 0)
##        8) SWIR2< 0.1124984 91   0 cropland (0 1 0 0 0) *
##        9) SWIR2>=0.1124984 195  57 fallow (0.27 0 0.71 0.026 0)
##         18) green>=0.1252934 54   2 built (0.96 0 0 0.037 0) *
##         19) green< 0.1252934 141   3 fallow (0 0 0.98 0.021 0) *
##      5) SWIR1>=0.2861418 280  35 open (0.061 0 0.064 0.87 0)
##       10) SWIR2>=0.2630891 17   3 built (0.82 0 0.059 0.12 0) *
##       11) SWIR2< 0.2630891 263  20 open (0.011 0 0.065 0.92 0) *
##    3) NIR< 0.08187716 434   0 water (0 0 0 0 1) *
# Plot the trained classification tree
plot(cartmodel, uniform=TRUE, main="Classification Tree")
text(cartmodel, cex = 1)

image3

See ?rpart.control to set different parameters for building the model.

You can print/plot more about the cartmodel created in the previous example. E.g. you can use plotcp(cartmodel) to learn about the cost-complexity (cp argument in rpart).

Classify

Now that we have our trained classification model (cartmodel), we can use it to make predictions, that is, to classify all cells in the landsat5 SpatRaster.

Important The layer names in the SpatRaster should exactly match those that were used to train the model. This will be the case if the same SpatRaster object was used (via extract) to obtain the values to fit the model. Otherwise you need to specify the matching names.

classified <- predict(landsat, cartmodel, na.rm = TRUE)
classified
## class       : SpatRaster
## dimensions  : 1245, 1497, 5  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610)
## source(s)   : memory
## names       :    built, cropland,    fallow,      open, water
## min values  : 0.000000,        0, 0.0000000, 0.0000000,     0
## max values  : 0.962963,        1, 0.9787234, 0.9239544,     1
plot(classified)

image4

Observe that there are 5 layers in the classified object, each of the layer representing the probability of a particular LULC class. Bewlow, we make a single layer SpatRaster showing, for each grid cell, the LULC class with the highest probability.

lulc <- which.max(classified)
lulc
## class       : SpatRaster
## dimensions  : 1245, 1497, 1  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610)
## source(s)   : memory
## name        : which.max
## min value   :         1
## max value   :         5

To make a nice map, we make the raster categorical, using levels<-; and we provide custom colors.

cls <- c("built","cropland","fallow","open","water")
df <- data.frame(id = 1:5, class=cls)
levels(lulc) <- df
lulc
## class       : SpatRaster
## dimensions  : 1245, 1497, 1  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610)
## source(s)   : memory
## categories  : class
## name        : class
## min value   : built
## max value   : water
mycolor <- c("darkred", "yellow", "burlywood", "cyan", "blue")
plot(lulc, col=mycolor)

image5

If you are not satisfied with the results, you can select more samples and use additional predictor variables to see if you can improve the classification. The choice of classifier (algorithm) also plays an important role. Next we show how to test the performance the classification model.

Model evaluation

This section discusses how to assess the accuracy of the model to get an idea of how accurate the classified map might be. Two widely used measures in remote sensing are “overall accuracy” and “kappa”. You can perform the accuracy assessment using the independent samples.

To evaluate any model, you can use k-fold cross-validation (you can also do single-fold). In this technique the data used to fit the model is split into k groups (typically 5 groups). In turn, one of the groups will be used for model testing, while the rest of the data is used for model training (fitting).

set.seed(99)
# number of folds
k <- 5
j <- sample(rep(1:k, each = round(nrow(sampdata))/k))
# or using `dismo` library to get equal number of samples
# library(dismo)
# j <- kfold(sampdata, k = 5, by=sampdata$classvalue)
table(j)
## j
##   1   2   3   4   5
## 200 200 200 200 200

Now we train and test the model five times, each time computing the predictions and storing that with the actual values in a list. Later we use the list to compute the final accuarcy.

x <- list()
for (k in 1:5) {
    train <- sampdata[j!= k, ]
    test <- sampdata[j == k, ]
    cart <- rpart(as.factor(class)~., data=train, method = 'class',
                  minsplit = 5)
    pclass <- predict(cart, test, na.rm = TRUE)
    # assign class to maximum probablity
    pclass <- apply(pclass, 1, which.max)
    # create a data.frame using the reference and prediction
    x[[k]] <- cbind(test$class, as.integer(pclass))
}

Now combine the five list elements into a single data.frame, using do.call and compute a confusion matrix.

y <- do.call(rbind, x)
y <- data.frame(y)
colnames(y) <- c('observed', 'predicted')
# confusion matrix
conmat <- table(y)
# change the name of the classes
colnames(conmat) <- classdf$classnames
rownames(conmat) <- classdf$classnames
print(conmat)
##         predicted
## observed [,1] [,2] [,3] [,4] [,5]
##     [1,]   65    0    1    3    0
##     [2,]    0   90    1    0    0
##     [3,]    1    1  136   18    0
##     [4,]    6    0    3  241    0
##     [5,]    0    0    0    0  434

Question 1:Comment on the miss-classification between different classes.

Question 2:Can you think of ways to to improve the accuracy.

Compute the overall accuracy and the “kappa” statistic.

Overall accuracy:

# number of total cases/samples
n <- sum(conmat)
n
## [1] 1000
# number of correctly classified cases per class
diag <- diag(conmat)
# Overall Accuracy
OA <- sum(diag) / n
OA
## [1] 0.966

Kappa:

# observed (true) cases per class
rowsums <- apply(conmat, 1, sum)
p <- rowsums / n
# predicted cases per class
colsums <- apply(conmat, 2, sum)
q <- colsums / n
expAccuracy <- sum(p*q)
kappa <- (OA - expAccuracy) / (1 - expAccuracy)
kappa
## [1] 0.9521732

Producer and user accuracy

# Producer accuracy
PA <- diag / colsums
# User accuracy
UA <- diag / rowsums
outAcc <- data.frame(producerAccuracy = PA, userAccuracy = UA)
outAcc
##   producerAccuracy userAccuracy
## 1        0.9027778    0.9420290
## 2        0.9890110    0.9890110
## 3        0.9645390    0.8717949
## 4        0.9198473    0.9640000
## 5        1.0000000    1.0000000

Question 3:Perform the classification using Random Forest classifiers from the ``randomForest`` package

Question 4:Plot the results of rpart and Random Forest classifier side-by-side.

Question 5 (optional):Repeat the steps for other years using Random Forest. For example you can use the cloud-free composite image data/centralvalley-2001LE7.tif. This data is collected by the Landsat 7 platform. You can use the National Land Cover Database 2001 (NLCD 2001) subset of the California Central Valley for generating training sites.

Question 6 (optional):We have trained the classifiers using unequal samples for each class. Investigate the effect of sample size on classification. Repeat the steps with different subsets, e.g. a sample size of 100, 50, 25 per class, and compare the results. Use the same holdout samples for model evaluation.