# Statistical models¶

There are many type of statistical models. Here we show how to make simple regression models with *R*. Other modeling approaches tend to use similar syntax.

The most common way to specify a regression model in *R* is by creating a formula. For example `y ~ x`

means `y`

is a function of `x`

. `y ~ a + b`

means that `y`

is a function of `a`

and `b`

.

Let’s use the cars data that come with *R*. This dataset has measurements on the distance needed to stop given the speed a car was driven when the driver stepped on the breaks. We use the `lm`

(linear model) function.

```
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
m <- lm(dist ~ speed, data=cars)
m
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
```

Note that the data is provided by data.frame `cars`

, and that the names in formula are column names in this `data.frame`

. The functions returned a model (`lm`

) object. When printed it shows the coefficients of the regression model (`dist = -17.579 + 3.932 * speed`

). `m`

has quite a bit more information, but that is not shown, by default.

There are several functions that can be used to extract this information.

```
summary(m)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
anova(m)
## Analysis of Variance Table
##
## Response: dist
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21186 21185.5 89.567 1.49e-12 ***
## Residuals 48 11354 236.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residuals(m)[1:10]
## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
## 8 9 10
## 4.255007 12.255007 -8.677401
```

You can use `abline`

to draw a simple regression line like this.

```
plot(cars, col='blue', pch='*', cex=2)
abline(m, col='red', lwd=2)
```

More generally, you can use the `predict`

function to use the model to predict values of `y`

for any `x`

.

```
p <- predict(m, data.frame(speed=1:30))
p
## 1 2 3 4 5 6
## -13.646686 -9.714277 -5.781869 -1.849460 2.082949 6.015358
## 7 8 9 10 11 12
## 9.947766 13.880175 17.812584 21.744993 25.677401 29.609810
## 13 14 15 16 17 18
## 33.542219 37.474628 41.407036 45.339445 49.271854 53.204263
## 19 20 21 22 23 24
## 57.136672 61.069080 65.001489 68.933898 72.866307 76.798715
## 25 26 27 28 29 30
## 80.731124 84.663533 88.595942 92.528350 96.460759 100.393168
plot(1:30, p, xlab='speed', ylab='distance', type='l', lwd=2)
points(cars)
```

The `glm`

(generalized linear models) function can do what `lm`

can, but it is much more versatile. For example you can also use it for logistic regression. In logistic regression the response variable is normally binomial (0 or 1) or at least between 0 and 1. I create such a variable here (was the stopping distance above 40 or not?).

```
cars$above40 <- cars$dist > 40
```

Now we can use this variable in a `glm`

model. By stating that `family='binomial'`

we indicate that we want logistic regression. (The default is `family=gaussian`

which indicates standard (normal) regression.

```
mlog <- glm(above40 ~ speed, data=cars, family='binomial')
mlog
##
## Call: glm(formula = above40 ~ speed, family = "binomial", data = cars)
##
## Coefficients:
## (Intercept) speed
## -8.553 0.521
##
## Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
## Null Deviance: 68.59
## Residual Deviance: 36.37 AIC: 40.37
```

Because a logistic model operates with logistically transformed numbers, we need to tell the predict function that we want the predicted values on the original scale (`type='response'`

).

```
p <- predict(mlog, data.frame(speed=1:30), type='response')
```

```
plot(cars$speed, cars$above40)
lines(1:30, p)
```

For a more in-depth discussion of regression models with R go here.