# 9. Satellite data analysis¶

By Ani Ghosh

## Introduction¶

This page provides a short introduction to satellite data analysis with R. Before reading this you should first learn the basics of the raster package.

Getting satellite images for a specific project remains a one of the most challenging steps in the workflow. You have to find the data most suitable for you particular objective. A few important properties to consider while searching the remotely sensed (satellite) data include:

1. Spatial resolution (pixel size)
2. Temporal resolution (return time; availability of histrotical images, and for a particular moment in time)
4. Quality (e.g. cloud-cover or artifacts in data (read about problems in Landsat ETM+)

There are numerous sources of remotely sensed data from satellites. Generally, the very high spatial resolution data is available as (costly) commericial products. Lower spatial resolution data is freely available from NASA and ESA. In this this tutorial we’ll use freely available Sentinel, Landsat 8 and MODIS data.

You can access these data from several sources, including:

This web site lists 15 sources of freely available remote sensing data.

You can also use the MODISTools package to access different satellite data.

## Basic image manipulation and visualization¶

In this section we describe how to access remote sensing images with R. All the images used here are in a raster format. Download the example data here.

### Read image and display basic image properties¶

All these products consist of sensors that measure reflectance of the sun’s radiation in different wavelengths (e.g. in the red, green, or blue wavelengths). Thus one image consists of multiple spectral layers. In remote sensing jargon, such layers are referred to as “bands” (shorthand for “bandwidth”” in the electromagnetic spectrum). From such multi-spectral data you can create a RasterBrick or RasterStack in R (as long as each layer has the same extent and resolution).

Another piece of jargon is “pixel”. This is the term that is used for grid cell in remote sensing literature.

library(raster)
# get Landsat data
r <- brick('rsdata/landsat8-2016march.tif')

# get Sentinel data
s <- brick('rsdata/sentinel.tif')
s
## class       : RasterBrick
## dimensions  : 1934, 1786, 3454124, 10  (nrow, ncol, ncell, nlayers)
## resolution  : 10, 10  (x, y)
## extent      : 251900.9, 269760.9, -382182.3, -362842.3  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=37 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : e:\bitbucket\rweb\source\analysis\_R\rsdata\sentinel.tif
## names       : sentinel.1, sentinel.2, sentinel.3, sentinel.4, sentinel.5, sentinel.6, sentinel.7, sentinel.8, sentinel.9, sentinel.10
## min values  :        576,        401,        235,        307,        464,        489,        471,        371,        149,          89
## max values  :       6398,       6279,       6372,       5482,       6586,       7242,      11990,       7095,       7738,        5612


The last argument prints the properties of RasterBrick s that was made from file sentinel.tif. You can see the spatial resolution, extent, number of layers, coordinate reference system and more.

### Image information and statistics¶

The below shows how you can access various properties from the RasterBrick (this is the same for any raster data set).

nlayers(s)
## [1] 10

# coordinate reference system (CRS)
crs(s)
## CRS arguments:
##  +proj=utm +zone=37 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0

# Number of rows, columns, or cells
ncell(s)
## [1] 3454124
dim(s)
## [1] 1934 1786   10

# spatial resolution
res(r)
## [1] 30 30
res(s)
## [1] 10 10


You can see the effect of differences in spatial resolution. Sentinel image has a 10m resolution whereas the resolution of Landsat is 30m.

### Visualize single and multi-band imagery¶

You can plot indivudal layers of a multi-spectral image, but they are often combined to create more interesting plots. To combine three bands, we can use plotRGB. To make a “true color” image (that is, something that looks like a normal photograph), we need to select the bands that we want to render in the red, green and blue regions. For this Landsat image, r = 3 (red), g = 2(green), b = 1(blue) will plot the true color composite (vegetation in green, water blue etc). You can also supply additional arguments to plotRGB to improve the visualization (e.g. a linear stretch of the values, using strecth = "lin").

plotRGB(r, r = 3, g = 2, b = 1, axes = TRUE, stretch = "lin",
main = "Landsat True Color Composite")


Selecting r = 4 (NIR), g = 3 (red), b = 2(green) will plot a “false color”” composite. This representation is popular as it makes it easy to see the vegetation (in red). You can find more about the visualization here.

plotRGB(r, r = 4, g = 3, b = 2, axes = TRUE, stretch = "lin",
main = "Landsat False Color Composite")


The same for the Sentinel image, using the layout function to both reprentations next to each other:

nf <- layout(matrix(c(1,0,2), 1, 3, byrow = TRUE), width = c(1,0.2,1), respect = TRUE)

plotRGB(s, r = 3, g = 2, b = 1, axes = TRUE, stretch = "lin",
main = "Senitnel True Color Composite")

plotRGB(s, r = 7, g = 3, b = 2, axes = TRUE, stretch = "lin",
main = "Sentinel False Color Composite")


### Subset and rename spectral bands¶

You can select specific layers (bands) using subset function, or via indexing.

# select first 3 bands only
rsub <- subset(r, 1:3)

# same result
rsub <- r[[1:3]]

# Check the number of bands in orginal and new data
nlayers(r)
## [1] 6
nlayers(rsub)
## [1] 3


Set the names of the bands using the following:

# For LANDSAT
names(r)
## [1] "landsat8.2016march.1" "landsat8.2016march.2" "landsat8.2016march.3"
## [4] "landsat8.2016march.4" "landsat8.2016march.5" "landsat8.2016march.6"
names(r) <- c('blue','green','red','NIR','SWIR1','SWIR2')
names(r)
## [1] "blue"  "green" "red"   "NIR"   "SWIR1" "SWIR2"

# For SENTINEL
names(s)
##  [1] "sentinel.1"  "sentinel.2"  "sentinel.3"  "sentinel.4"  "sentinel.5"
##  [6] "sentinel.6"  "sentinel.7"  "sentinel.8"  "sentinel.9"  "sentinel.10"
names(s) <- c('blue','green','red','rededge1','rededge2',
'rededge3','NIR','rededge4','SWIR1','SWIR2')


### Spatial subset¶

Spatial subsetting can be used to limit analysis to a geographic subset of the image. Spatial subsets can be created wtith the crop function, using an extent object, or another spatial object from which an Extent can be extracted/created,.

# Using extent
extent(r)
## class       : Extent
## xmin        : 251940
## xmax        : 269730
## ymin        : -382140
## ymax        : -362880
e <- extent(255000, 260000,  -375000,  -370000)

# crop LANDSAT
rr <- crop(r, e)

# crop SENTINEL
ss <- crop(s, e)

# Now plot the subsets side-by-side to compare the resolution
nf <- layout(matrix(c(1,0,2), 1, 3, byrow = TRUE), width = c(1,0.2,1), respect = TRUE)
plotRGB(rr, r = 4, g = 3, b = 2, axes = TRUE, stretch = "lin", main = "Landsat Subset")
plotRGB(ss, r = 7, g = 3, b = 2, axes = TRUE, stretch = "lin", main = "Sentinel Subset")


Question 1a:Interactive selection from the image is also possible. Use drawExtent and drawPoly to select an area of interest.

Question 1b:Show how you can use the data in the boundary.rds file to spatially subset the Landsat and Sentinel data and plot them.

### Extract raster values¶

Often we require value(s) of raster cell(s) for a geographic location/area. The extract function is used to get raster values at the locations of other spatial data. You can use points, lines, polygons or an Extent (rectangle) object. You can also use cell numbers to extract values.If using points, extract returns the values of a Raster* object for the cells in which a set of points fall.

Below, we extract values of Sentinel data for a set of samples. These land cover at the sample locations have been classified into different classes including cloud, forest, crop, fallow, built-up, open-soil, water and grassland.

# extract values with points
df <- extract(s, samp)

# To see some of the reflectance values
##      blue green  red rededge1 rededge2 rededge3  NIR rededge4 SWIR1 SWIR2
## [1,] 1531  1466 1399     1428     2236     2477 2626     2698  1831  1083
## [2,] 2735  2643 2737     2908     3597     3957 3979     4163  3667  2560
## [3,] 2487  2424 2505     2699     3583     3971 3929     4230  3371  2221
## [4,] 2262  2195 2176     2424     3416     3811 3733     4067  3254  2198
## [5,] 2299  2219 2194     2381     3268     3622 3632     3860  2988  1980
## [6,] 1962  2012 2078     2366     3553     4056 3810     4398  3109  1949


### Spectral profiles¶

A plot of the spectrum (all bands) for pixels representing a certain earth surface features (e.g. forest) is known as a spectral profile. Such profiles demonstrate the differences in spectral properties of various earth surface features and constitute the basis for image analysis. Spectral values can be extracted from any multispectral data set using extract function.

First we compute the mean reflectance values for each class and each band.

ms <- aggregate(df, list(samp$class), mean) head(ms) ## Group.1 blue green red rededge1 rededge2 rededge3 ## 1 built-up 1479.9531 1449.4844 1449.6406 1580.3281 2280.156 2597.688 ## 2 cloud 1928.1081 1855.0270 1823.8378 2057.6757 3023.108 3410.189 ## 3 crop 920.7736 892.0000 714.0189 1086.5849 2147.962 2530.245 ## 4 fallow 960.8980 850.9796 896.5918 991.9592 1257.714 1434.490 ## 5 forest 754.7959 656.6531 410.6735 705.5306 1906.429 2356.918 ## 6 grassland 855.8696 844.6667 596.0290 1010.8116 2338.812 2836.594 ## NIR rededge4 SWIR1 SWIR2 ## 1 2477.031 2786.531 2222.562 1470.4844 ## 2 3331.703 3642.703 2643.676 1663.2703 ## 3 2475.811 2815.019 1908.340 1048.3774 ## 4 1421.184 1631.245 1990.837 1553.7755 ## 5 2311.816 2580.857 1178.163 506.5102 ## 6 2797.203 3176.652 1872.058 899.0725 # instead of the first column, we use rownames rownames(ms) <- ms[,1] ms <- ms[,-1] ms ## blue green red rededge1 rededge2 rededge3 ## built-up 1479.9531 1449.4844 1449.6406 1580.3281 2280.1562 2597.688 ## cloud 1928.1081 1855.0270 1823.8378 2057.6757 3023.1081 3410.189 ## crop 920.7736 892.0000 714.0189 1086.5849 2147.9623 2530.245 ## fallow 960.8980 850.9796 896.5918 991.9592 1257.7143 1434.490 ## forest 754.7959 656.6531 410.6735 705.5306 1906.4286 2356.918 ## grassland 855.8696 844.6667 596.0290 1010.8116 2338.8116 2836.594 ## open-soil 1090.7714 963.8000 931.9429 1014.1714 1163.7429 1294.400 ## water 784.2245 609.4898 449.5306 593.8776 984.3469 1136.408 ## NIR rededge4 SWIR1 SWIR2 ## built-up 2477.0312 2786.531 2222.5625 1470.4844 ## cloud 3331.7027 3642.703 2643.6757 1663.2703 ## crop 2475.8113 2815.019 1908.3396 1048.3774 ## fallow 1421.1837 1631.245 1990.8367 1553.7755 ## forest 2311.8163 2580.857 1178.1633 506.5102 ## grassland 2797.2029 3176.652 1872.0580 899.0725 ## open-soil 1233.8857 1392.486 989.0571 640.1429 ## water 931.7551 1216.878 774.0816 407.4694  Now we will plot the mean spectra of these features. # Create a vector of color for the land cover classes for use in plotting mycolor <- c('cyan', 'darkgreen', 'yellow', 'burlywood', 'darkred', 'darkgray', 'blue', 'lightgreen') #transform ms from a data.frame to a matrix ms <- as.matrix(ms) # First create an empty plot plot(0, ylim=c(300, 4000), xlim = c(1,10), type='n', xlab="Bands", ylab = "Reflectance") # add the different classes for (i in 1:nrow(ms)){ lines(ms[i,], type = "l", lwd = 3, lty = 1, col = mycolor[i]) } # Title title(main="Spectral Profile from Sentinel", font.main = 2) # Legend legend("topleft", rownames(ms), cex=0.8, col=mycolor, lty = 1, lwd =3, bty = "n")  The spectral profile shows (dis)similarity in the reflectance of different features on the earth’s surface (or above it). Clouds are generally bright and highly reflective in all wavelengths. Crop and forest show similar spectral signatures; also compare built-up and open areas. Water shows relatively low reflection in all wavelengths. Question 2:Make a similar spectral plot for the Landsat data. ## Basic mathematical operations¶ The raster package supports many mathematical operations. Math operations are generally performed per pixel. First we will learn about basic arithmetic operations on bands. First example is a custom math function that calculates the Normalized Difference Vegetation Index (NDVI). Learn more about [vegetation indices here] (http://www.un-spider.org/links-and-resources/data-sources/daotm/daotm-vegetation) and NDVI. ### Compute vegetation indices¶ Let’s define a general function for ratio based vegetation index. # i and k are the index of bands to be used for the indices computation NDVI <- function(img, i, k) { bi <- img[[i]] bk <- img[[k]] vi <- (bk - bi) / (bk + bi) return(vi) } # For Sentinel NIR = 7, red = 3. ndvi <- NDVI(ss, 3, 7) plot(ndvi, col = rev(terrain.colors(30)), main = 'NDVI from Sentinel')  You can see the variation in greenness from the plot. Question 3:Repeat this steps for Landsat (For the current Landsat data, band 3 is red and band 4 is NIR) Question 4:Adapt the function to compute indices which will highlight i) water and ii) built-up water. Hint: Use the spectral profile plot to find the bands having maximum and minimum reflectance. ### Thresholding¶ We can apply basic rules to get an estimate of spatial extent of different Earth surface features. Note that NDVI values are standardized and ranges between -1 to +1. Higher values indicate more green cover. Pixels having NDVI values greater than 0.4 are definitely vegetation. Following operation masks all non-vegetation pixels. veg <- calc(ndvi, function(x){x[x < 0.4] <- NA; return(x)}) plot(veg, main = 'Veg cover')  You can also create classes for different density of vegetation, 0 being no-vegetation. vegc <- reclassify(veg, c(-Inf,0.2,0, 0.2,0.3,1, 0.3,0.4,2, 0.4,0.5,3, 0.5, Inf, 4)) plot(vegc, col = rev(terrain.colors(4)), main = 'NDVI based thresholding')  ### Principal component analysis¶ Multi-spectral data are sometimes transformed to helps to reduce the dimensioanlity and noise in the data. The principal components transform is a generic data reduction method that can be used tocreate a few uncorrelated bands from a larger set of correlated bands. You can calculate the same number of principal components as the number of input bands. The first principal component (PC) explains the largest percentage of variance and other PCs explain additional the variance in decreasing order. set.seed(1) sr <- sampleRandom(ss, 10000) plot(sr[,c(3,7)])  # This is known as vegetation and soil-line plot. Can you guess the directions of 'principal components' from this scatter plot? pca <- prcomp(sr, scale = TRUE) pca ## Standard deviations: ## [1] 2.46718321 1.73461211 0.75466268 0.35783219 0.29161083 0.20466101 ## [7] 0.16812151 0.16055789 0.13573994 0.08471236 ## ## Rotation: ## PC1 PC2 PC3 PC4 PC5 ## blue 0.3397870 -0.1927263 -0.517495944 -0.35140305 -0.20045639 ## green 0.3168024 -0.3034101 -0.401079517 0.10886749 0.21254058 ## red 0.3805442 -0.1486047 -0.186808662 -0.11675053 -0.11044006 ## rededge1 0.3210710 -0.3076848 0.126484619 0.74113763 0.16399764 ## rededge2 -0.2509983 -0.4427522 -0.010632905 0.23512424 -0.22176296 ## rededge3 -0.2921165 -0.3927223 -0.027903690 -0.05491684 -0.31540109 ## NIR -0.2754901 -0.3958718 -0.005442135 -0.31153923 0.78735095 ## rededge4 -0.2890068 -0.3964949 0.034365758 -0.08951118 -0.31666380 ## SWIR1 0.3058515 -0.2649567 0.594638570 -0.21607285 -0.07876173 ## SWIR2 0.3674010 -0.1402161 0.405881766 -0.30280752 -0.02222532 ## PC6 PC7 PC8 PC9 PC10 ## blue 0.41867617 0.43421072 -0.11210100 0.19775456 -0.02074192 ## green 0.16325548 -0.58882482 0.18143991 -0.42596218 0.04423700 ## red -0.80045853 -0.08672508 0.10484232 0.33600797 -0.01514088 ## rededge1 -0.00715952 0.45751819 0.01186256 -0.02109476 -0.01024757 ## rededge2 0.12631016 -0.38197140 -0.50925097 0.45632434 0.08348765 ## rededge3 -0.15079138 0.11509311 0.04094331 -0.35506110 -0.70244976 ## NIR -0.08342582 0.15620594 -0.02767537 0.13380250 -0.01880695 ## rededge4 -0.09264240 0.17990328 0.34191043 -0.21498489 0.66762908 ## SWIR1 0.30079729 -0.17599026 0.44257070 0.29166692 -0.16545966 ## SWIR2 -0.11206694 0.03186873 -0.60731367 -0.42733069 0.15301488 plot(pca) screeplot(pca)  pci <- predict(ss, pca, index = 1:2) spplot(pci, col.regions = rev(heat.colors(20)), cex=1, main = list(label="First 2 principal components from Sentinel"))  To learn more about the information contained in the vegetation and soil line plots read this paper by [Gitelson et al] (http://www.tandfonline.com/doi/abs/10.1080/01431160110107806#.V6hp_LgrKhd). An extension of PCA in remote sensing is known as Tasseled-cap Transformation. ## Image classification¶ We will explore two classification methods: unsupervised and supervised. Various unsupervised and supervised classification algorithms exist, and the choice of algoritm can affect the results. We will explore two k-means (unsupervised) and decision tree (supervised) algorithms. ### Unsupervised classification¶ In unsupervised classification, we don’t supply any training data. This particularly is useful when we don’t have prior knowledge of the study area. The algorithm groups pixels with similar spectral characteristics into unique clusters/classes/groups following some statistically determined conditions. You have to re-label and combine these spectral clusters into information classes (for e.g. land-use land-cover). Learn more about K-means and other un-supervised algorithms here. We will perform unsupervised classification of the ndvi layer generated previously. km <- kmeans(values(ss), centers=5, iter.max=500, nstart=3, algorithm="Lloyd") kmr <- setValues(ndvi, km$cluster)
plot(kmr, main = 'Unsupervised classification of Sentinel data')


To perform unsupervised classification, a good practice is to start with large number of centers (more clusters) and merge/group/recode similar clusters by inspecting the original imagery. Unsupervised algorithms are often referred as clustering.

### Supervised classification¶

In a supervised classification, we have prior knowledge about some of the land-cover types through, for example, fieldwork or interpretation of high resolution imagery (such as avaialble on Google maps). Specific sites in the study area that represent homogeneous examples of these known land-cover types are identified. These areas are commonly referred to as training sites because the spectral properties of these sites are used to train the classification algorithm. After that, the entire image is classified using the trained algorithm.

We will use a training site already collected through interpretation of high resolution image. The following example uses a Classification and Regression Trees (CART) classifier (Breiman et al. 1984) (further reading ) to predict different land cover classes around Arusha, Tanzania.

library(rpart)
# For training the model, we use the data used for plotting the spectra

df <- data.frame(samp$class, df) # Train the model model.class <- rpart(as.factor(samp.class)~., data = df, method = 'class') # Print the trained classification tree model.class ## n= 405 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 405 336 grassland (0.16 0.091 0.13 0.12 0.12 0.17 0.086 0.12) ## 2) green>=1155.5 101 37 built-up (0.63 0.37 0 0 0 0 0 0) ## 4) rededge1< 1759.5 73 11 built-up (0.85 0.15 0 0 0 0 0 0) * ## 5) rededge1>=1759.5 28 2 cloud (0.071 0.93 0 0 0 0 0 0) * ## 3) green< 1155.5 304 235 grassland (0 0 0.17 0.16 0.16 0.23 0.12 0.16) ## 6) green>=748 207 138 grassland (0 0 0.26 0.24 0.0048 0.33 0.17 0) ## 12) rededge3>=1940 123 54 grassland (0 0 0.42 0.0081 0.0081 0.56 0 0) ## 24) blue>=891 41 1 crop (0 0 0.98 0.024 0 0 0 0) * ## 25) blue< 891 82 13 grassland (0 0 0.15 0 0.012 0.84 0 0) ## 50) SWIR2< 744 7 1 crop (0 0 0.86 0 0.14 0 0 0) * ## 51) SWIR2>=744 75 6 grassland (0 0 0.08 0 0 0.92 0 0) * ## 13) rededge3< 1940 84 36 fallow (0 0 0.012 0.57 0 0 0.42 0) ## 26) SWIR1>=1288 51 3 fallow (0 0 0.02 0.94 0 0 0.039 0) * ## 27) SWIR1< 1288 33 0 open-soil (0 0 0 0 0 0 1 0) * ## 7) green< 748 97 48 water (0 0 0 0 0.49 0 0 0.51) ## 14) NIR>=1708 52 7 forest (0 0 0 0 0.87 0 0 0.13) ## 28) SWIR2< 654.5 45 0 forest (0 0 0 0 1 0 0 0) * ## 29) SWIR2>=654.5 7 0 water (0 0 0 0 0 0 0 1) * ## 15) NIR< 1708 45 3 water (0 0 0 0 0.067 0 0 0.93) * # Much cleaner way is to plot the trained classification tree plot(model.class, uniform=TRUE, main="Classification Tree") text(model.class, cex=.8)  # Now predict the subset data based on the model; prediction for entire area takes longer time pr <- predict(ss, model.class, type='class', progress = 'text') ## | | | 0% | |================ | 25% | |================================ | 50% | |================================================= | 75% | |=================================================================| 100% ## pr ## class : RasterLayer ## dimensions : 500, 500, 250000 (nrow, ncol, ncell) ## resolution : 10, 10 (x, y) ## extent : 255000.9, 260000.9, -375002.3, -370002.3 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=utm +zone=37 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 ## data source : in memory ## names : layer ## values : 1, 8 (min, max) ## attributes : ## ID value ## from: 1 built-up ## to : 8 water  Now plot the classification result library(rasterVis) pr <- ratify(pr) rat <- levels(pr)[[1]] rat$legend <- c("cloud","forest","crop","fallow","built-up","open-soil","water","grassland")
levels(pr) <- rat
levelplot(pr, maxpixels = 1e6,
col.regions = c("darkred","cyan","yellow","burlywood","darkgreen","lightgreen","darkgrey","blue"),
scales=list(draw=FALSE),
main = "Supervised Classification of Sentinel data")


If you know this region you will see that the result is not entirely satisfactory. For example, cropland is over-predicted and buildings with high reflection are wrongly labeled as cloud. One possible reason is that the training data is not adequate. Land cover classes have complex patterns with a lot of intermixing. So you have to carefully select a large number of samples. The choice of classifier also plays an important role.

Bonus Question:Repeat the supervised classification with a randomForest classifier.

Accuracy assessment of the classified map is required to test the quality of the product. Two widely reported measures in remote sensing community are overall accuracy and kappa statistics value. You can perform the accuracy assessment using hold-out samples.