Basic data structures
In the previous chapter we saw the most basic data types in R: vectors
of numeric, integer, character, factor and boolean values. A vector is a
one-dimensional structure of a single type. In this chapter we look at
multi-dimensional data structures: the matrix
, data.frame
and
list
.
Matrix
A vector is a one-dimensional array. A two-dimensional array can be represented with a matrix. Here is how you can create a matrix with two rows and three columns.
matrix(ncol=3, nrow=2)
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
The matrix above did not have any values: all values were
missing (NA
). Let’s
make a matrix with values 1 to 6.
matrix(1:6, ncol=3, nrow=2)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Note that by default the values are distributed column-wise. To go
row-wise you can use the byrow=TRUE
argument.
matrix(1:6, ncol=3, nrow=2, byrow=TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
This can also be achieved by switching the number of columns and rows
and using the t
(transpose) function.
m <- matrix(1:6, ncol=2, nrow=3)
t(m)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
Matrices are often created by column-binding and/or row-binding vectors
(or other matrices), using the function cbind
or rbind
. These
are two of the most commonly used functions in R so pay close
attention!
a <- c(1,2,3)
b <- 5:7
column binding
m1 <- cbind(a, b)
m1
## a b
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
row binding
m2 <- rbind(a, b)
m2
## [,1] [,2] [,3]
## a 1 2 3
## b 5 6 7
You can also use cbind
and rbind
to combine (append) matrices,
as long as the number of rows or columns of the two objects are the
same.
m3 <- cbind(b, b, a)
m <- cbind(m1, m3)
m
## a b b b a
## [1,] 1 5 5 5 1
## [2,] 2 6 6 6 2
## [3,] 3 7 7 7 3
We can get information about the properties of a matrix with functions
such as nrow
, ncol
, dim
and length
.
nrow(m)
## [1] 3
ncol(m)
## [1] 5
# dimensions of m (nrow, ncol))
dim(m)
## [1] 3 5
# number of cells, or nrow(m) * ncol(m)
length(m)
## [1] 15
Columns have (variable) names that can be changed.
# get the column names
colnames(m)
## [1] "a" "b" "b" "b" "a"
# set the column names
colnames(m) <- c('ID', 'X', 'Y', 'v1', 'v2')
m
## ID X Y v1 v2
## [1,] 1 5 5 5 1
## [2,] 2 6 6 6 2
## [3,] 3 7 7 7 3
Likewise there are row names, but these are less important.
rownames(m) <- paste0('row_', 1:nrow(m))
m
## ID X Y v1 v2
## row_1 1 5 5 5 1
## row_2 2 6 6 6 2
## row_3 3 7 7 7 3
A matrix can only store a single data type (either numeric, character, …) . If you try to mix character and numeric values, all values will become character values.
cbind(vchar=c('a','b'), vnumb=1:2)
## vchar vnumb
## [1,] "a" "1"
## [2,] "b" "2"
You can see that 1 and 2 are character values because they are quoted.
You could not use them in algebra without first converting them back to
numbers. Note that the column names were set by providing them to
cbind
A matrix is a two dimensional array. Higher dimensional arrays can also
be created. See help(array)
, but these data structures are not that
commonly used, so we do not discuss them here.
List
A list
is a very flexible container to store data. Each element of a
list can contain any type of R object, e.g. a vector, matrix,
data.frame, another list, or more complex data types.
A simple list:
list(1:3)
## [[1]]
## [1] 1 2 3
It shows that the first element [[1]]
contains a vector of
1, 2, 3
Here is one with two data types.
e <- list(c(2,5), 'abc')
e
## [[1]]
## [1] 2 5
##
## [[2]]
## [1] "abc"
List elements can be named.
names(e) <- c('first', 'last')
e
## $first
## [1] 2 5
##
## $last
## [1] "abc"
And a more complex list.
m <- matrix(1:6, ncol=3, nrow=2)
f <- list(e, m, 'abc')
f
## [[1]]
## [[1]]$first
## [1] 2 5
##
## [[1]]$last
## [1] "abc"
##
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## [[3]]
## [1] "abc"
Note that the first element of list f
is itself a list of two
elements.
Data frame
The data.frame
is the workhorse for statistical data analysis in
R. It is a special type of list that requires that all elements have
the same length. That makes the data.frame rectangular like a matrix,
but unlike matrices a data.frame
can have columns (variables) of
different data types. A data.frame
is what you get when you read
spreadsheet-like data into R with functions like read.table
or
read.csv
. We’ll show that in a later chapter. We can also create a
data.frame
with some simple code.
# four vectors
ID <- as.integer(1:4)
name <- c('Ana', 'Rob', 'Liu', 'Veronica')
sex <- as.factor(c('F','M','M','F'))
score <- c(10.2, 9, 13.5, 18)
d <- data.frame(ID, name, sex, score)
d
## ID name sex score
## 1 1 Ana F 10.2
## 2 2 Rob M 9.0
## 3 3 Liu M 13.5
## 4 4 Veronica F 18.0
d
is a data.frame, but individual columns can be of any class. Note
that the length of a data.frame is defined as the number of variables
(columns), while the length of a matrix is defined as the number of
cells! This is because a matrix is a special kind of vector
, while a
data.frame
is a special kind of list
.
class(d)
## [1] "data.frame"
length(d)
## [1] 4
Because a data.frame
is a special kind of list, you can do with a
data.frame what you can do with a list.
is.list(d)
## [1] TRUE
names(d)
## [1] "ID" "name" "sex" "score"
But in other ways, a data.frame
is also similar to a matrix (which
normal lists are not).
nrow(d)
## [1] 4
dim(d)
## [1] 4 4
colnames(d)
## [1] "ID" "name" "sex" "score"