Statistical models
There are many type of statistical models. Here we show how to make simple regression models with R. Other modeling approaches tend to use similar syntax.
The most common way to specify a regression model in R is by creating
a formula. For example y ~ x
means y
is a function of x
.
y ~ a + b
means that y
is a function of a
and b
.
Let’s use the cars data that come with R. This dataset has
measurements on the distance needed to stop given the speed a car was
driven when the driver stepped on the breaks. We use the lm
(linear
model) function.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
m <- lm(dist ~ speed, data=cars)
m
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
Note that the data is provided by data.frame cars
, and that the
names in formula are column names in this data.frame
. The functions
returned a model (lm
) object. When printed it shows the coefficients
of the regression model (dist = -17.579 + 3.932 * speed
). m
has
quite a bit more information, but that is not shown, by default.
There are several functions that can be used to extract this information.
summary(m)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
anova(m)
## Analysis of Variance Table
##
## Response: dist
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21186 21185.5 89.567 1.49e-12 ***
## Residuals 48 11354 236.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residuals(m)[1:10]
## 1 2 3 4 5 6 7 8
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993 4.255007
## 9 10
## 12.255007 -8.677401
You can use abline
to draw a simple regression line like this.
plot(cars, col='blue', pch='*', cex=2)
abline(m, col='red', lwd=2)
More generally, you can use the predict
function to use the model to
predict values of y
for any x
.
p <- predict(m, data.frame(speed=1:30))
p
## 1 2 3 4 5 6 7
## -13.646686 -9.714277 -5.781869 -1.849460 2.082949 6.015358 9.947766
## 8 9 10 11 12 13 14
## 13.880175 17.812584 21.744993 25.677401 29.609810 33.542219 37.474628
## 15 16 17 18 19 20 21
## 41.407036 45.339445 49.271854 53.204263 57.136672 61.069080 65.001489
## 22 23 24 25 26 27 28
## 68.933898 72.866307 76.798715 80.731124 84.663533 88.595942 92.528350
## 29 30
## 96.460759 100.393168
plot(1:30, p, xlab='speed', ylab='distance', type='l', lwd=2)
points(cars)
The glm
(generalized linear models) function can do what lm
can,
but it is much more versatile. For example you can also use it for
logistic regression. In logistic regression the response variable is
normally binomial (0 or 1) or at least between 0 and 1. I create such a
variable here (was the stopping distance above 40 or not?).
cars$above40 <- cars$dist > 40
Now we can use this variable in a glm
model. By stating that
family='binomial'
we indicate that we want logistic regression. (The
default is family=gaussian
which indicates standard (normal)
regression.
mlog <- glm(above40 ~ speed, data=cars, family='binomial')
mlog
##
## Call: glm(formula = above40 ~ speed, family = "binomial", data = cars)
##
## Coefficients:
## (Intercept) speed
## -8.553 0.521
##
## Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
## Null Deviance: 68.59
## Residual Deviance: 36.37 AIC: 40.37
Because a logistic model operates with logistically transformed numbers,
we need to tell the predict function that we want the predicted values
on the original scale (type='response'
).
p <- predict(mlog, data.frame(speed=1:30), type='response')
plot(cars$speed, cars$above40)
lines(1:30, p)