Algebra
Vectors and matrices can be used to compute new vectors (matrices) with simple and intuitive algebraic expressions.
Vector algebra
We have two vectors, a
and b
a <- 1:5
b <- 6:10
Multiplication works element by element. That is a[1] * b[1]
,
a[2] * b[2]
, etc
d <- a * b
a
## [1] 1 2 3 4 5
b
## [1] 6 7 8 9 10
d
## [1] 6 14 24 36 50
The examples above illustrate a special feature of R not found in most other programming languages. This is that you do not need to ‘loop’ over elements in an array (vector in this case) to compute new values. It is important to use this feature as much as possible. In other programming languages you would need to write a for-loop to achieve the above (for-loops do exist in R. They are very important and are discussed in a later chapter).
You can also multiply a vector with a single number.
a * 3
## [1] 3 6 9 12 15
In the examples above the computations used either vectors of the same length, or one of the vectors had length 1. You can use algebraic computations with vectors of different lengths, as the shorter ones will be “recycled”. R only issues a warning if the length of the longer vector is not a multiple of the length of the shorter object. This is a great feature when you need it, but it may also make you overlook errors when your data are not what you think they are.
a + c(1,10)
## Warning in a + c(1, 10): longer object length is not a multiple of shorter
## object length
## [1] 2 12 4 14 6
No warning here:
1:6 + c(0,10)
## [1] 1 12 3 14 5 16
Logical comparisons
It is very common in computer programs to test for (in)equality or whether a value is greater of smaller than another value.
Recall that ==
is used to test for equality
a <- 1:5
b <- 6:10
a == 2
## [1] FALSE TRUE FALSE FALSE FALSE
And inequality is evaluated with !=
a != 2
“Less than or equal” is <=
, and “more than or equal” is >=
.
a < 3
## [1] TRUE TRUE FALSE FALSE FALSE
b >= 9
## [1] FALSE FALSE FALSE TRUE TRUE
&
is Boolean “AND”, and |
is Boolean “OR”.
a
## [1] 1 2 3 4 5
b
## [1] 6 7 8 9 10
b > 6 & b < 8
## [1] FALSE TRUE FALSE FALSE FALSE
# combining a and b
b > 9 | a <= 2
## [1] TRUE TRUE FALSE FALSE TRUE
Functions
There are many functions that allow us to do vectorized algebra. For example:
sqrt(a)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
exp(a)
## [1] 2.718282 7.389056 20.085537 54.598150 148.413159
Not all functions return a vector of the same length. The following functions return just one or two numbers:
min(a)
## [1] 1
max(a)
## [1] 5
range(a)
## [1] 1 5
sum(a)
## [1] 15
mean(a)
## [1] 3
median(a)
## [1] 3
prod(a)
## [1] 120
sd(a)
## [1] 1.581139
If you cannot guess what prod
and sd
do, look it up in the help
files (e.g. ?sd
)
Random numbers
It is common to create a vector of random numbers in data analysis, and also to create example data to demonstrate how a procedure works. To get 10 numbers sampled from the uniform distribution between 0 and 1 you can do
r <- runif(10)
r
## [1] 0.31123423 0.53732622 0.60545101 0.54453739 0.45749611 0.47944866
## [7] 0.79744406 0.72818618 0.61347444 0.02660559
For Normally distributed numbers, use rnorm
r <- rnorm(10, mean=10, sd=2)
r
## [1] 4.407178 11.173913 11.259342 10.557532 11.605031 9.238570 9.787381
## [8] 9.034805 12.991576 7.408201
If you run the functions above, you will get different numbers then the
ones shown here. After all, they are random numbers! Modern data
analysis methods use a lot of randomization. This can make a challange
to exactely reproduce results obtained. To allow for exact reproduction
of examples or real data analysis, we often want to assure that we take
exactly the same random sample each time we run our code. To do that
we use set.seed
. This function initializes the random number
generator (to a specific point in an infinite but static sequence of
numbers). This is illustrated below.
set.seed(12)
runif(2)
## [1] 0.06936092 0.81777520
runif(3)
## [1] 0.9426217 0.2693819 0.1693481
runif(4)
## [1] 0.03389562 0.17878500 0.64166537 0.02287774
set.seed(12)
runif(1)
## [1] 0.06936092
runif(2)
## [1] 0.8177752 0.9426217
set.seed(12)
runif(3)
## [1] 0.06936092 0.81777520 0.94262173
runif(5)
## [1] 0.26938188 0.16934812 0.03389562 0.17878500 0.64166537
Note that after each time set.seed
is called, the same sequence of
random numbers was be generated. This is a very important feature, as it
allows us to exactly reproduce results that involve random sampling. The
seed number is arbitrary; a different seed number will give a different
sequence.
set.seed(999)
runif(3)
## [1] 0.38907138 0.58306072 0.09466569
runif(5)
## [1] 0.85263123 0.78674676 0.11934226 0.60644699 0.08095691
The idea is that this will allow you to exactly reproduce results. By avoiding small amounts of variation between each time you run your code, you can be sure that all still works as before. You may wonder how to choose the value of the seed. You could take the date (e.g. “20210329”), but it should not really matter. If you notice that you data analysis gives materially different results besed on your choice of the seed, than you need to reconsider what you are doing, as your results are not stable (or potentially run it many times).
Matrices
Computation with matrices is also ‘vectorized’. For example, with matrix
m
you can do m * 5
to multiply all values of m3 with 5, or do
m^2
or m * m
to square the values of m.
# set up an example matrix
m <- matrix(1:6, ncol=3, nrow=2, byrow=TRUE)
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
m * 2
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 8 10 12
m^2
## [,1] [,2] [,3]
## [1,] 1 4 9
## [2,] 16 25 36
We can also do math with a matrix and a vector. Note, again, that computation with matrices in R is column-wise, and that shorter vectors are recycled.
m * 1:2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 8 10 12
Can you predict the result of this multiplication?
m * 1:4
You can multiply two matrices.
m * m
## [,1] [,2] [,3]
## [1,] 1 4 9
## [2,] 16 25 36
Note that this is “cell by cell” multiplication. For ‘matrix
multiplication’
in the mathematical sense, you need to use the %*%
operator.
m %*% t(m)
## [,1] [,2]
## [1,] 14 32
## [2,] 32 77