# 6. Local regression¶

Regression models are typically “global”. That is, all date are used simultaneously to fit a single model. In some cases it can make sense to fit more flexible “local” models. Such models exist in a general regression framework (e.g. generalized additive models), where “local” refers to the values of the predictor values. In a spatial context local refers to location. Rather than fitting a single regression model, it is possible to fit several models, one for each location (out of possibly very many) locations. This technique is sometimes called “geographically weighted regression” (GWR). GWR is a data exploration technique that allows to understand changes in importance of different variables over space (which may indicate that the model used is misspecified and can be improved).

There are two examples here. One short example with California precipitation data, and than a more elaborate example with house price data.

## Get the data¶

```
# Make sure the data directory exists if not make it
dir.create('data', showWarnings = FALSE)
files <- c('precipitation.csv', 'counties.rds', 'houses1990.csv')
for (filename in files) {
localfile <- file.path('data', filename)
if (!file.exists(localfile)) {
download.file(paste0('https://biogeo.ucdavis.edu/data/rspatial/', filename), dest=localfile)
}
}
```

### California precipitation¶

Here is an example of GWR with California precipitation data. Get the data with the script above, or download it here: (precipitation data](https://biogeo.ucdavis.edu/data/rspatial/precipitation.csv) and counties.

```
cts <- readRDS('data/counties.rds')
## Warning in readRDS("data/counties.rds"): invalid or incomplete compressed
## data
## Error in readRDS("data/counties.rds"): error reading from connection
p <- read.csv('data/precipitation.csv')
head(p)
## ID NAME LAT LONG ALT JAN FEB MAR APR MAY JUN
## 1 ID741 DEATH VALLEY 36.47 -116.87 -59 7.4 9.5 7.5 3.4 1.7 1.0
## 2 ID743 THERMAL/FAA AIRPORT 33.63 -116.17 -34 9.2 6.9 7.9 1.8 1.6 0.4
## 3 ID744 BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1
## 4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0
## 5 ID754 NILAND 33.28 -115.51 -18 9.0 8.0 9.0 3.0 0.0 1.0
## 6 ID758 EL CENTRO/NAF 32.82 -115.67 -13 9.8 1.6 3.7 3.0 0.4 0.0
## JUL AUG SEP OCT NOV DEC
## 1 3.7 2.8 4.3 2.2 4.7 3.9
## 2 1.9 3.4 5.3 2.0 6.3 5.5
## 3 1.9 9.2 6.5 5.0 4.8 9.7
## 4 2.4 2.6 8.3 5.4 7.7 7.3
## 5 8.0 9.0 7.0 8.0 7.0 9.0
## 6 3.0 10.8 0.2 0.0 3.3 1.4
plot(cts)
## Error in plot(cts): object 'cts' not found
points(p[,c('LONG', 'LAT')], col='red', pch=20)
## Error in plot.xy(xy.coords(x, y), type = type, ...): plot.new has not been called yet
```

Compute annual average precipitation

```
p$pan <- rowSums(p[,6:17])
```

Global regression model

```
m <- lm(pan ~ ALT, data=p)
m
##
## Call:
## lm(formula = pan ~ ALT, data = p)
##
## Coefficients:
## (Intercept) ALT
## 523.60 0.17
```

Create `Spatial*`

objects with a planar crs.

```
alb <- CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
## Error in CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"): could not find function "CRS"
sp <- p
coordinates(sp) = ~ LONG + LAT
## Error in coordinates(sp) = ~LONG + LAT: could not find function "coordinates<-"
crs(sp) <- "+proj=longlat +datum=NAD83"
## Error in crs(sp) <- "+proj=longlat +datum=NAD83": could not find function "crs<-"
spt <- spTransform(sp, alb)
## Error in spTransform(sp, alb): could not find function "spTransform"
ctst <- spTransform(cts, alb)
## Error in spTransform(cts, alb): could not find function "spTransform"
```

Get the optimal bandwidth

```
library( spgwr )
## Loading required package: sp
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source'))`
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
bw <- gwr.sel(pan ~ ALT, data=spt)
## Error in is(data, "Spatial"): object 'spt' not found
bw
## Error in eval(expr, envir, enclos): object 'bw' not found
```

Create a regular set of points to estimate parameters for.

```
r <- raster(ctst, res=10000)
## Error in raster(ctst, res = 10000): could not find function "raster"
r <- rasterize(ctst, r)
## Error in rasterize(ctst, r): could not find function "rasterize"
newpts <- rasterToPoints(r)
## Error in rasterToPoints(r): could not find function "rasterToPoints"
```

Run the `gwr`

function

```
g <- gwr(pan ~ ALT, data=spt, bandwidth=bw, fit.points=newpts[, 1:2])
## Error in is(data, "SpatialPolygonsDataFrame"): object 'spt' not found
g
## Error in eval(expr, envir, enclos): object 'g' not found
```

Link the results back to the raster

```
slope <- r
## Error in eval(expr, envir, enclos): object 'r' not found
intercept <- r
## Error in eval(expr, envir, enclos): object 'r' not found
slope[!is.na(slope)] <- g$SDF$ALT
## Error in eval(expr, envir, enclos): object 'g' not found
intercept[!is.na(intercept)] <- g$SDF$'(Intercept)'
## Error in eval(expr, envir, enclos): object 'g' not found
s <- stack(intercept, slope)
## Error in stack(intercept, slope): object 'intercept' not found
names(s) <- c('intercept', 'slope')
## Error in names(s) <- c("intercept", "slope"): object 's' not found
plot(s)
## Error in plot(s): object 's' not found
```

### California House Price Data¶

We will use house prices data from the 1990 census, taken from “Pace, R.K. and R. Barry, 1997. Sparse Spatial Autoregressions. Statistics and Probability Letters 33: 291-297.” You can download the data here

```
houses <- read.csv("data/houses1990.csv")
dim(houses)
## [1] 20640 9
head(houses)
## houseValue income houseAge rooms bedrooms population households latitude
## 1 452600 8.3252 41 880 129 322 126 37.88
## 2 358500 8.3014 21 7099 1106 2401 1138 37.86
## 3 352100 7.2574 52 1467 190 496 177 37.85
## 4 341300 5.6431 52 1274 235 558 219 37.85
## 5 342200 3.8462 52 1627 280 565 259 37.85
## 6 269700 4.0368 52 919 213 413 193 37.85
## longitude
## 1 -122.23
## 2 -122.22
## 3 -122.24
## 4 -122.25
## 5 -122.25
## 6 -122.25
```

Each record represents a census “blockgroup”. The longitude and latitude of the centroids of each block group are available. We can use that to make a map and we can also use these to link the data to other spatial data. For example to get county-membership of each block group. To do that, let’s first turn this into a SpatialPointsDataFrame to find out to which county each point belongs.

```
library(sp)
coordinates(houses) <- ~longitude+latitude
```

```
plot(houses, cex=0.5, pch=1, axes=TRUE)
```

Now get the county boundaries and assign CRS of the houses data matches that of the counties (because they are both in longitude/latitude!).

```
library(raster)
counties <- readRDS("data/counties.rds")
## Warning in readRDS("data/counties.rds"): invalid or incomplete compressed
## data
## Error in readRDS("data/counties.rds"): error reading from connection
crs(houses) <- crs(counties)
## Error in crs(counties): object 'counties' not found
```

Do a spatial query (points in polygon)

```
cnty <- over(houses, counties)
## Error in over(houses, counties): object 'counties' not found
head(cnty)
## Error in head(cnty): object 'cnty' not found
```

### Summarize¶

We can summarize the data by county. First combine the extracted county data with the original data.

```
hd <- cbind(data.frame(houses), cnty)
## Error in cbind(data.frame(houses), cnty): object 'cnty' not found
```

Compute the population by county

```
totpop <- tapply(hd$population, hd$NAME, sum)
## Error in tapply(hd$population, hd$NAME, sum): object 'hd' not found
totpop
## Error in eval(expr, envir, enclos): object 'totpop' not found
```

Income is harder because we have the median household income by blockgroup. But it can be approximated by first computing total income by blockgroup, summing that, and dividing that by the total number of households.

```
# total income
hd$suminc <- hd$income * hd$households
## Error in eval(expr, envir, enclos): object 'hd' not found
# now use aggregate (similar to tapply)
csum <- aggregate(hd[, c('suminc', 'households')], list(hd$NAME), sum)
## Error in aggregate(hd[, c("suminc", "households")], list(hd$NAME), sum): object 'hd' not found
# divide total income by number of housefholds
csum$income <- 10000 * csum$suminc / csum$households
## Error in eval(expr, envir, enclos): object 'csum' not found
# sort
csum <- csum[order(csum$income), ]
## Error in eval(expr, envir, enclos): object 'csum' not found
head(csum)
## Error in head(csum): object 'csum' not found
tail(csum)
## Error in tail(csum): object 'csum' not found
```

### Regression¶

Before we make a regression model, let’s first add some new variables that we might use, and then see if we can build a regression model with house price as dependent variable. The authors of the paper used a lot of log tranforms, so you can also try that.

```
hd$roomhead <- hd$rooms / hd$population
## Error in eval(expr, envir, enclos): object 'hd' not found
hd$bedroomhead <- hd$bedrooms / hd$population
## Error in eval(expr, envir, enclos): object 'hd' not found
hd$hhsize <- hd$population / hd$households
## Error in eval(expr, envir, enclos): object 'hd' not found
```

Ordinary least squares regression:

```
# OLS
m <- glm( houseValue ~ income + houseAge + roomhead + bedroomhead + population, data=hd)
## Error in is.data.frame(data): object 'hd' not found
summary(m)
##
## Call:
## lm(formula = pan ~ ALT, data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -638.4 -281.2 -115.7 187.4 1793.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 523.60251 26.50338 19.756 < 2e-16 ***
## ALT 0.16997 0.03505 4.849 1.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 425.4 on 454 degrees of freedom
## Multiple R-squared: 0.04925, Adjusted R-squared: 0.04715
## F-statistic: 23.52 on 1 and 454 DF, p-value: 1.704e-06
coefficients(m)
## (Intercept) ALT
## 523.6025073 0.1699685
```

### Geographicaly Weighted Regression¶

## By county¶

Of course we could make the model more complex, with e.g. squared income, and interactions. But let’s see if we can do Geographically Weighted regression. One approach could be to use counties.

First I remove records that were outside the county boundaries

```
hd2 <- hd[!is.na(hd$NAME), ]
## Error in eval(expr, envir, enclos): object 'hd' not found
```

Then I write a function to get what I want from the regression (the coefficients in this case)

```
regfun <- function(x) {
dat <- hd2[hd2$NAME == x, ]
m <- glm(houseValue~income+houseAge+roomhead+bedroomhead+population, data=dat)
coefficients(m)
}
```

And now run this for all counties using sapply:

```
countynames <- unique(hd2$NAME)
## Error in unique(hd2$NAME): object 'hd2' not found
res <- sapply(countynames, regfun)
## Error in lapply(X = X, FUN = FUN, ...): object 'countynames' not found
```

Plot of a single coefficient

```
dotchart(sort(res['income', ]), cex=0.65)
## Error in res["income", ]: object of type 'closure' is not subsettable
```

There clearly is variation in the coefficient (\(beta\)) for income. How does this look on a map?

First make a data.frame of the results

```
resdf <- data.frame(NAME=colnames(res), t(res))
## Error in t.default(res): argument is not a matrix
head(resdf)
## Error in head(resdf): object 'resdf' not found
```

Fix the counties object. There are too many counties because of the presence of islands. I first aggregate (‘dissolve’ in GIS-speak’) the counties such that a single county becomes a single (multi-)polygon.

```
dim(counties)
## Error in eval(expr, envir, enclos): object 'counties' not found
dcounties <- aggregate(counties, vars='NAME')
## Error in aggregate(counties, vars = "NAME"): object 'counties' not found
dim(dcounties)
## Error in eval(expr, envir, enclos): object 'dcounties' not found
```

Now we can merge this SpatialPolygonsDataFrame with data.frame with the regression results.

```
cnres <- merge(dcounties, resdf, by='NAME')
## Error in merge(dcounties, resdf, by = "NAME"): object 'dcounties' not found
spplot(cnres, 'income')
## Error in spplot(cnres, "income"): object 'cnres' not found
```

To show all parameters in a ‘conditioning plot’, we need to first scale the values to get similar ranges.

```
# a copy of the data
cnres2 <- cnres
## Error in eval(expr, envir, enclos): object 'cnres' not found
# scale all variables, except the first one (county name)
# assigning values to a "@data" slot is risky, but (I think) OK here
cnres2@data = data.frame(scale(data.frame(cnres)[, -1]))
## Error in data.frame(cnres): object 'cnres' not found
spplot(cnres2)
## Error in spplot(cnres2): object 'cnres2' not found
```

Is this just random noise, or is there spatial autocorrelation?

```
library(spdep)
## Loading required package: Matrix
nb <- poly2nb(cnres)
## Error in extends(class(pl), "SpatialPolygons"): object 'cnres' not found
plot(cnres)
## Error in plot(cnres): object 'cnres' not found
plot(nb, coordinates(cnres), add=T, col='red')
## Error in plot(nb, coordinates(cnres), add = T, col = "red"): object 'nb' not found
lw <- nb2listw(nb)
## Error in nb2listw(nb): object 'nb' not found
moran.test(cnres$income, lw)
## Error in moran.test(cnres$income, lw): object 'lw' not found
moran.test(cnres$roomhead, lw, na.action=na.omit)
## Error in moran.test(cnres$roomhead, lw, na.action = na.omit): object 'lw' not found
```

## By grid cell¶

An alternative approach would be to compute a model for grid cells. Let’s use the ‘Teale Albers’ projection (often used when mapping the entire state of California).

```
TA <- CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000
+datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0")
countiesTA <- spTransform(counties, TA)
## Error in spTransform(counties, TA): object 'counties' not found
```

Create a RasteLayer using the extent of the counties, and setting an arbitrary resolution of 50 by 50 km cells

```
library(raster)
r <- raster(countiesTA)
## Error in raster(countiesTA): object 'countiesTA' not found
res(r) <- 50000
## Error in res(r) <- 50000: object 'r' not found
```

Get the xy coordinates for each raster cell:

```
xy <- xyFromCell(r, 1:ncell(r))
## Error in xyFromCell(r, 1:ncell(r)): object 'r' not found
```

For each cell, we need to select a number of observations, let’s say within 50 km of the center of each cell (thus the data that are used in different cells overlap). And let’s require at least 50 observations to do a regression.

First transform the houses data to Teale-Albers

```
housesTA <- spTransform(houses, TA)
## Error in spTransform(xSP, CRSobj, ...): No transformation possible from NA reference system
crds <- coordinates(housesTA)
## Error in coordinates(housesTA): object 'housesTA' not found
```

Set up a new regression function.

```
regfun2 <- function(d) {
m <- glm(houseValue~income+houseAge+roomhead+bedroomhead+population, data=d)
coefficients(m)
}
```

Run the model for al cells if there are at least 50 observations within a radius of 50 km.

```
res <- list()
for (i in 1:nrow(xy)) {
d <- sqrt((xy[i,1]-crds[,1])^2 + (xy[i,2]-crds[,2])^2)
j <- which(d < 50000)
if (length(j) > 49) {
d <- hd[j,]
res[[i]] <- regfun2(d)
} else {
res[[i]] <- NA
}
}
## Error in nrow(xy): object 'xy' not found
```

For each cell get the income coefficient:

```
inc <- sapply(res, function(x) x['income'])
```

Use these values in a RasterLayer

```
rinc <- setValues(r, inc)
## Error in setValues(r, inc): object 'r' not found
plot(rinc)
## Error in plot(rinc): object 'rinc' not found
plot(countiesTA, add=T)
## Error in plot(countiesTA, add = T): object 'countiesTA' not found
Moran(rinc)
## Error in Moran(rinc): object 'rinc' not found
```

So that was a lot of ‘home-brew-GWR’.

**Question 1**: *Can you comment on weaknesses (and perhaps strengths)
of the approaches I have shown?*

**Question 2**: *Can you do it the easier and more professional way for
these data, using the spgwr package?*

### spgwr package¶

Now use the spgwr package (and the the `gwr`

function) to fit the
model. You can do this with all data, as long as you supply and argument
`fit.points`

(to avoid estimating a model for each observation point.
You can use a raster similar to the one I used above (perhaps
disaggregate with a factor 2 first).

This is how you can get the points to use:

Create a RasterLayer with the correct extent

```
r <- raster(countiesTA)
## Error in raster(countiesTA): object 'countiesTA' not found
```

Set to a desired resolution. I choose 25 km

```
res(r) <- 25000
## Error in res(r) <- 25000: object 'r' not found
```

I only want cells inside of CA, so I add some more steps.

```
ca <- rasterize(countiesTA, r)
## Error in rasterize(countiesTA, r): object 'countiesTA' not found
```

Extract the coordinates that are not `NA`

.

```
fitpoints <- rasterToPoints(ca)
## Error in nlayers(x): object 'ca' not found
```

I don’t want the third column

```
fitpoints <- fitpoints[,-3]
## Error in eval(expr, envir, enclos): object 'fitpoints' not found
```

Now specificy the model

```
gwr.model <- ______
```

`gwr`

returns a list-like object that includes (as first element) a
`SpatialPointsDataFrame`

that has the model coeffients. Plot these
using `spplot`

, and after that, transfer them to a `RasterBrick`

object.

To extract the SpatialPointsDataFrame:

```
sp <- gwr.model$SDF
spplot(sp)
```

To reconnect these values to the raster structure (etc.)

```
cells <- cellFromXY(r, fitpoints)
dd <- as.matrix(data.frame(sp))
b <- brick(r, values=FALSE, nl=nrow(dd))
b[cells] <- dd
names(b) <- colnames(dd)
plot(b)
```

**Question 3**: *Briefly comment on the results and the differences (if
any) with the two home-brew examples.*