7. Data exploration

After reading data from a file (see the previous chapter), the next thing to do is to look at some summary statistics. This is in the first place just to check your data. Very often you discover some strange values that are not quite right. Careful inspection of your data after you read it is very important. It can help avoid a lot of trouble later on when you are trying to explain odd results!

Sometimes you can correct errors in R but in other cases you first need to fix the data file. Fixing a data file can be the way to go if you are dealing with your own primary data. However when you are working with a file provided by someone else it is generally better to not change the file, but to correct things via R code, if possible. That leaves an exact trail of what you have done, and allows you to apply the same corrections to a new version of the data.

Inspecting your data is also important to understand it better, and such “exploratory data analysis” can be an important step before going doing the specific analyses of interest. There are many general and specialized tools for this, and here we only cover some of the very basic tools.

Summary and table

Consider data.frame d

d <- data.frame(id=1:10,
        name=c('Bob', 'Bobby', '???', 'Bob', 'Bab', 'Jim', 'Jim', 'jim', '', 'Jim'),
        score1=c(8, 10, 7, 9, 2, 5, 1, 6, 3, 4),
        score2=c(3,4,5,-999,5,5,-999,2,3,4), stringsAsFactors=FALSE)
d
##    id  name score1 score2
## 1   1   Bob      8      3
## 2   2 Bobby     10      4
## 3   3   ???      7      5
## 4   4   Bob      9   -999
## 5   5   Bab      2      5
## 6   6   Jim      5      5
## 7   7   Jim      1   -999
## 8   8   jim      6      2
## 9   9            3      3
## 10 10   Jim      4      4

d is very small and we can simply look at the values in d to see if they look all-right. But with real data you may have 100s or 1000s of rows and many columns, making it very hard, if not impossible, to spot errors by eye-balling.

The summary function is an easy way to start, at least for numeric variables.

summary(d)
##        id            name               score1          score2
##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :-999.00
##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:   2.25
##  Median : 5.50   Mode  :character   Median : 5.50   Median :   3.50
##  Mean   : 5.50                      Mean   : 5.50   Mean   :-196.70
##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:   4.75
##  Max.   :10.00                      Max.   :10.00   Max.   :   5.00

The minimum value of variable score2 is -999. That was probably used in data entry to indicate a missing value. These should be changed to NA.

# which values in score2 are -999?
i <- d$score2 == -999
# set these to NA
d$score2[i] <- NA
summary(d)
##        id            name               score1          score2
##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :2.000
##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:3.000
##  Median : 5.50   Mode  :character   Median : 5.50   Median :4.000
##  Mean   : 5.50                      Mean   : 5.50   Mean   :3.875
##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:5.000
##  Max.   :10.00                      Max.   :10.00   Max.   :5.000
##                                                     NA's   :2

The two steps used above: i <- d$score2 == -999 and d$score2[i] <- NA are usually done in a single line: d$score2[d$score2 == -999] <- NA.

For character (and integer) variables it can be useful to use unique and table:

unique(d$name)
## [1] "Bob"   "Bobby" "???"   "Bab"   "Jim"   "jim"   ""
table(d$name)
##
##         ???   Bab   Bob Bobby   jim   Jim
##     1     1     1     2     1     1     3

Often you will discover slight variation in spelling that need to be corrected. In this case, let’s assume that ‘Bob’, ‘Bobby’, and ‘Bab’ should both be ‘Bob’. We replace ‘Bab’ and ‘Bobby’, with ‘Bob’.

d$name[d$name %in% c('Bab', 'Bobby')] <- 'Bob'
table(d$name)
##
##     ??? Bob jim Jim
##   1   1   4   1   3

‘jim’ should be “Jim”. It is easy enough to replace as done above. But what if there were many cases like that? It would be easy to make all character values lower- or uppercase with d$name <- toupper(d$name) but I want to keep the normal name capitalization of the first letter only. So let’s assure that all names start with a uppercase letter.

# get the first letters
first <- substr(d$name, 1, 1)
# get the remainder
remainder <- substr(d$name, 2, nchar(d$name))
# assure that the first letter is upper case
first <- toupper(first)
# combine
name <- paste0(first, remainder)
# assign back to the variable
d$name <- name
table(d$name)
##
##     ??? Bob Jim
##   1   1   4   4

The question marks in name should probably also be replaced with NA

d$name[d$name == '???'] <- NA
table(d$name)
##
##     Bob Jim
##   1   4   4

You can force table to also count the NA values:

table(d$name, useNA='ifany')
##
##       Bob  Jim <NA>
##    1    4    4    1

Note that there is one ‘empty’ value

d$name[9]
## [1] ""

That should also be a missing value in this case.

d$name[d$name == ''] <- NA
table(d$name, useNA='ifany')
##
##  Bob  Jim <NA>
##    4    4    2

You can also use table to make a contingency table of two variables.

table(d[ c('name', 'score2')])
##      score2
## name  2 3 4 5
##   Bob 0 1 1 1
##   Jim 1 0 1 1

quantile, range, and mean

Other useful functions include quantile, range, and mean

quantile(d$score1)
##    0%   25%   50%   75%  100%
##  1.00  3.25  5.50  7.75 10.00
range(d$score1)
## [1]  1 10
mean(d$score1)
## [1] 5.5

Note that in some functions you may need to use na.rm=TRUE if there are NA values. Otherwise, as soon as there is a single NA value in a computation, the results becomes NA. This is very common in R — so keep that in mind if all your results are NA.

quantile(d$score2)
## Error in quantile.default(d$score2): missing values and NaN's not allowed if 'na.rm' is FALSE
range(d$score2)
## [1] NA NA
quantile(d$score2, na.rm=TRUE)
##   0%  25%  50%  75% 100%
##    2    3    4    5    5
range(d$score2, na.rm=TRUE)
## [1] 2 5

In this data exploration phase it is very useful to make plots. We’ll discuss plotting in a later chapter, but here are four example plots. Note how par(mfrow=c(2,2)) sets up the canvas for two rows and columns, that is for four plots.

par(mfrow=c(2,2))
plot(d$score1, d$score2)
boxplot(d[, c('score1', 'score2')])
plot(sort(d$score1))
hist(d$score2)