# Basic data types¶

This chapter introduces the basic data types that are used in *R*. We
mainly show how to create data of these types. There is much more on how
to manipulate data in the following chapters.

The most important basic (or “primitive”) data types are the “numeric” (for numbers) and “character” (for text) types. Additional types are the “integer”, which can be used to represent whole numbers; the “logical” for TRUE/FALSE, and the “factor” for categorical variables. These are all discussed below. In later chapters you will see how these basic types can be combined to represent more complex data types.

## Numeric and integer values¶

Let’s create a variable `a`

that is a vector of one number.

```
a <- 7
```

To do this yourself, type the code in an R console. Or, if you use R-Studio, use ‘File / New File / R script’ and type it in the new script. Then press “Run” or “Ctrl-Enter” (Apple-Enter on a Mac) to run the line (make sure your cursor is on the line that you want to run).

The “arrow” `<-`

was used to **assign** the value `7`

to variable
`a`

. You can pronounce the above as “a *becomes* 7”.

It is also possible to use the `=`

sign.

```
a = 7
```

Although you can use `=`

, we use `<-`

because the arrow clearly
indicates the assignment action, and because `=`

is also used in
another context (to pass arguments to functions).

The variable name `a`

used above is entirely arbitrary. We could have
used `x`

, `varib`

, `fruit`

or any other name that would help us
recognize it. There are a few restrictions: variable names cannot start
with a number, and they cannot contain spaces or “special” characters,
such as `*`

(which is used for multiplication).

To check the value of a, we can ask *R* to `show`

or `print`

it.

```
show(a)
## [1] 7
print(a)
## [1] 7
```

This is also what happens if you simply type the variable name.

```
a
## [1] 7
```

In *R*, all basic values are stored as a *vector*, a one-dimensional
array of *n* values of a certain type. Even a single number is a vector
(of length 1). That is why *R* shows that the value of `a`

is
`[1] 7`

. Because 7 is the first element in vector `a`

.

We can use the `class`

function to find out what type of object `a`

is (what class it belongs to).

```
class(a)
## [1] "numeric"
```

*numeric* means that `a`

is a vector of real (decimal) numbers. Its
value is equivalent to `7.000`

, but trailing zeros are not printed by
default. In a few cases it can be useful, or even necessary, to use
integer (whole number) values. To create a vector with a single integer
you can either use the `as.integer`

function, or append an `L`

to
the number.

```
a <- as.integer(7)
class(a)
## [1] "integer"
a <- 7L
class(a)
## [1] "integer"
```

There are several ways to create vectors that consists of multiple
numbers. For example, you can use the `c`

(combine) function and spell
out the values:

```
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
```

But if you want to create a regular sequence of whole numbers, it is
easier to use `:`

.

```
d <- 5:9
d
## [1] 5 6 7 8 9
```

You can also use the `:`

to create a sequence in descending order.

```
6:2
## [1] 6 5 4 3 2
```

The `seq`

function provides more flexibility. For example it allows
for step sizes different than one. In this case we go from 3 to 12,
taking steps of 3. Try some variations!

```
seq(2,5,1)
## [1] 2 3 4 5
seq(from=6, to=12, by=3)
## [1] 6 9 12
```

To go in descending order the `by`

argument needs to be negative.

```
seq(from=12, to=0, by=-4)
## [1] 12 8 4 0
```

You can also reverse the order of a sequence, after making the sequence,
by using the `rev`

function.

```
s <- seq(from=0, to=12, by=4)
s
## [1] 0 4 8 12
r <- rev(s)
r
## [1] 12 8 4 0
```

We will discuss *functions* like `seq`

in more detail later. But, in
essence, a *function* is a named procedure that performs a certain task.
In the examples above, a function name is `seq`

, and the task is to
create a sequence of numbers.

The exact specification of the sequence is modified by the *arguments*
that are provided to `seq`

, in this case: `from`

, `to`

, and
`by`

. If you are unsure what a function does, or which arguments are
available, then read the function’s help page. You can get to the help
page for `seq`

by typing `?seq`

or `help(seq)`

, and likewise for
all other functions in *R*.

The `rep`

(for repeat) function provides another way to create a
vector of numbers. You can repeat a single number, or a sequence of
numbers.

```
rep(9, times=5)
## [1] 9 9 9 9 9
rep(5:7, times=3)
## [1] 5 6 7 5 6 7 5 6 7
rep(5:7, each=3)
## [1] 5 5 5 6 6 6 7 7 7
```

## Character values¶

A character variable is used to represent letters, codes, or words. Character values are often referred to as a “string”.

```
x <- "Yi"
y <- "Wong"
class(x)
## [1] "character"
x
## [1] "Yi"
```

To distinguish a character value from a variable name, it needs to be
quoted. `"x"`

is a character value, but `x`

is a variable!
Double-quoted `"Yi"`

is the same as single-quoted `'Yi'`

, but you
cannot mix the two in one value: `"Yi'`

is not valid. You can enclose
one type of quote inside a pair of the other type. For example, you can
do `"Yi's dog"`

or `'Wong said "good bye" and left'`

.

One of the most common mistakes for beginners is to forget the quotes.

```
Yi
## Error in eval(expr, envir, enclos): object 'Yi' not found
```

The error occurs because *R* tries to print the value of variable
`Yi`

, but there is no such variable. So remember that any time you get
the error message `object 'something' not found`

, the most likely
reason is that you forgot to quote a character value. If not, it
probably means that you have misspelled, or not yet created, the
variable that you are referring to.

Keep in mind that *R* is case-sensitive: `a`

is not the same as `A`

.
In most computing contexts, `a`

and `A`

are **entirely** different
and, for most intents and purposes, **unrelated** symbols.

Now let’s create variable `countries`

holding a character vector of
five elements.

```
countries <- c("China", "China", "Japan", "South Korea", "Japan")
class(countries)
## [1] "character"
countries
## [1] "China" "China" "Japan" "South Korea" "Japan"
```

The function `length`

tells us how long the vector is (how many
elements it has).

```
length(countries)
## [1] 5
```

If you want to know the number of characters of each element of the
vector, you can use `nchar`

.

```
nchar(countries)
## [1] 5 5 5 11 5
```

`nchar`

returns a vector of integers with the same length as `x`

(5). Each number is the number of characters of the corresponding
element of `countries`

. This is an example of why we say that most
functions in *R* are `vectorized`

. This means that you normally do not
need to tell *R* to compute things for each individual element in a
vector.

It is handy to know that `letters`

(a constant value, like `pi`

)
returns the alphabet (`LETTERS`

returns them in uppercase), and
`toupper`

and `tolower`

can be used to change case.

```
z <- letters
z
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
up <- toupper(z)
up
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
```

Perhaps the most commonly used function for string manipulation is
`paste`

. This function is used to concatenate strings. For example:

```
girl <- "Mary"
boy <- "John"
paste(girl, "talks to", boy)
## [1] "Mary talks to John"
```

By default, paste uses a space to separate the elements. You can change
that with the `sep`

argument.

```
paste(girl, "likes", boy, sep = " ~ ")
## [1] "Mary ~ likes ~ John"
```

Sometimes you do not want any separator. You can then use `sep=""`

or
the `paste0`

function.

By using the “collapse” argument, we can concatenate all values of a vector into a single value.

```
paste(countries, collapse=" -- ")
## [1] "China -- China -- Japan -- South Korea -- Japan"
```

We’ll leave more advanced manipulation of strings for later, but here
are two more important functions. To get a part of a string use
`substr`

.

```
substr("Hello World", 1, 5)
## [1] "Hello"
substr("Hello World", 7, 11)
## [1] "World"
```

To replace characters in a string use `gsub`

or `sub`

.

```
gsub("l", "!!", "Hello World")
## [1] "He!!!!o Wor!!d"
gsub("Hello", "Bye bye", "Hello World")
## [1] "Bye bye World"
```

To find elements that fit a particular pattern use `grep`

or
`grepl`

. `grep`

returns the index of the matching elements in a
vector. You can use the index to subset the original vector (we will see
more of this later).

```
d <- c("az20", "az21", "az22", "ba30", "ba31", "ab32")
i <- grep("ba", d)
i
## [1] 4 5
d[i]
## [1] "ba30" "ba31"
# or like this
grep("ba", d, value=TRUE)
## [1] "ba30" "ba31"
```

Above, also note the use of `#`

. Lines that start with this character
are ignored by *R*; so they can be used to provide natural language
comments.

```
# Instead of the index, get logical values with grepl
i <- grepl("ba", d)
i
## [1] FALSE FALSE FALSE TRUE TRUE FALSE
# return the cases of d for which i is TRUE
d[i]
## [1] "ba30" "ba31"
```

Which elements of d include the character “2”?

```
grep("2", d)
## [1] 1 2 3 6
```

Which elements of d *end* with the character “2”? `$`

has a special
meaning.

```
grep("2$", d)
## [1] 3 6
```

Which elements of d *start* with the character “b”? `^`

has a special
meaning.

```
grep("^b", d)
## [1] 4 5
```

## Logical values¶

A logical (or Boolean) value is either `TRUE`

or `FALSE`

. They are
used very frequently in *R* and in computer programming in general.

```
z <- FALSE
z
## [1] FALSE
class(z)
## [1] "logical"
z <- c(TRUE, TRUE, FALSE)
z
## [1] TRUE TRUE FALSE
```

`TRUE`

and `FALSE`

can be abbreviated to `T`

and `F`

, but that
is bad practice. This is because it is possible to change the value of
`T`

and `F`

to something else — and that would be extraordinarily
confusing. In contrast, `TRUE`

and `FALSE`

are constants that cannot
be changed.

Logical values are often the result of a computation. For example, here
we ask if the values of `x`

are larger than 3, which is `TRUE`

for
values 4 and 5

```
x <- 2:5
x > 3
## [1] FALSE FALSE TRUE TRUE
```

Likewise we can test for equality using two equal signs `==`

(not a
single `=`

which would be an assignment!). `<=`

means “smaller or
equal” and `>=`

means “larger or equal”.

```
x == 3
## [1] FALSE TRUE FALSE FALSE
x <= 2
## [1] TRUE FALSE FALSE FALSE
```

Logical values can be treated as numerical values. `TRUE`

is
equivalent to 1 and `FALSE`

to 0.

```
y <- TRUE
y + 1
## [1] 2
```

However, if you go the other way, only zero is equivalent to `FALSE`

while any number that is not zero, is `TRUE`

```
as.logical(0)
## [1] FALSE
as.logical(1)
## [1] TRUE
as.logical(2.5)
## [1] TRUE
```

## Factors¶

A `factor`

is a nominal (categorical) variable with a set of known
possible values called `levels`

. They can be created using the
`as.factor`

function. In *R* you typically need to convert (cast) a
character variable to a factor to identify groups for use in statistical
tests and models.

```
f1 <- as.factor(countries)
f1
## [1] China China Japan South Korea Japan
## Levels: China Japan South Korea
```

But numbers can also be used. For example, they may simply indicate group membership.

```
f2 <- c(5:7, 5:7, 5:7)
f2
## [1] 5 6 7 5 6 7 5 6 7
f2 <- as.factor(f2)
f2
## [1] 5 6 7 5 6 7 5 6 7
## Levels: 5 6 7
```

Dealing with factors can be tricky. For example `f2`

created above is
not what it may seem. We see numbers 5, 6 and 7, but these are now just
labels to identify groups. They cannot be used in algebraic expressions.

We can convert factors to something else. Here we use `as.integer`

. If
you want a number with decimal places, you can use `as.numeric`

instead.

```
f2
## [1] 5 6 7 5 6 7 5 6 7
## Levels: 5 6 7
as.integer(f2)
## [1] 1 2 3 1 2 3 1 2 3
```

The result of as.integer(f2) may have been surprising. But it should not
be, as there is no direct link between a category with label `"5"`

and
the number `5`

. In this case, `"5"`

is simply the label of the first
category and hence it gets converted to the integer 1. Nevertheless, we
can get the numbers back as there is an established link between the
character symbol `"5"`

and the number `5`

. So we first create
characters from the factor values, and then numbers from the characters.

```
fc2 <- as.character(f2)
fc2
## [1] "5" "6" "7" "5" "6" "7" "5" "6" "7"
as.integer(fc2)
## [1] 5 6 7 5 6 7 5 6 7
```

This is different from `as.integer(f2)`

, which returned the indices of
the factor values. It has no way of knowing if you want factor level
`6`

to represent the number 6.

At this point it is OK if you are confused about factors and *why* you
might do such things as conversion from and to them.

## Missing values¶

All basic data types can have “missing values”. These are represented by
the symbol `NA`

for “Not Available”. For example, we can have vector
‘m’

```
m <- c(2, NA, 5, 2, NA, 2)
m
## [1] 2 NA 5 2 NA 2
```

Note that `NA`

is *not* quoted (it is a special symbol, it is not the
word “NA”).

Properly treating missing values is very important. The first question
to ask when they appear is whether they should be missing (or did you
make a mistake in the data manipulation?). If they should be missing,
the second question becomes how to treat them. Can they be ignored?
Should the records with `NA`

s be removed?

## Time¶

Representing time is a somewhat complex problem. There are different calendars, hours, days, months, and leap years to consider. As a basic introduction, here is simple way to create date values.

```
d1 <- as.Date("2015-4-11")
d2 <- as.Date("2015-3-11")
class(d1)
## [1] "Date"
d1 - d2
## Time difference of 31 days
```

And there are more advanced classes as well that capture date and time.

```
as.POSIXlt(d1)
## [1] "2015-04-11 UTC"
as.POSIXct(d1)
## [1] "2015-04-11 UTC"
```

See http://www.stat.berkeley.edu/~s133/dates.html for more info.