11. Data preparation

A very large part of research work can consist of data gathering, cleaning, combining and formatting. You may spend much more time on data preparation than on analysis, modeling or visualization. R provides some tools to make this process easier; but doing this well also requires that you carefully consider your options. See Wickham (2014) for a discussion. Two very important functions in this context are reshape and merge. Sorting of data can also be helpful.

reshape

reshape allows you to rearrange data from a wide to a long form and vice versa. This can be a very important step to simplify data analysis. In the wide form, each variable is a column and each case (e.g. individual) is a singe row. This is the common ‘spreadsheet’ approach. In the long form there is column that indicates the variable name and a column that has its value. Other columns identify the cases, and these can be repeated many times. The long form can be much easier for use in data analysis than the wide form. In many cases you may want to go from one form to the other and back.

The function to go between wide and long form is called reshape. Unfortunately reshape is rather difficult to use. There is only one function to go from wide to long and vice versa, creating confusion about what arguments to use when. It is also poorly documented, and its error messages are bewildering. (There is a package called reshape that you can use instead. But in this Introduction I want to stick with base R idiom.)

There is a good example of how to use reshape here. And I will re-use that and expand on it.

wide to long

Get some example data on student’s performance in different subjects. Note how you can read a text on a web server (http address).

dd <- read.csv('http://www.ats.ucla.edu/stat/r/faq/hsb2.csv')
d <- dd[1:3, c(1:2, 8:9)]
d
##    id female write math
## 1  70      0    52   41
## 2 121      1    59   53
## 3  86      0    33   54

This is a “wide” form dataset. To go to a long form, you need to provide an argument varying that indicates the variables that are combined into one new variable. v.names is the name of this new variable.

wvars <- c("write", "math")
x <- reshape(d, varying=wvars, v.names="score", direction = "long")
x
##        id female time score
## 70.1   70      0    1    52
## 121.1 121      1    1    59
## 86.1   86      0    1    33
## 70.2   70      0    2    41
## 121.2 121      1    2    53
## 86.2   86      0    2    54

x has two new variables “time”, and “score”. As you can see, “score” has the values for “write” and “math” in the d. The “time” variable identifies which is which. “write” is identified with a 1, and “math” is identified with a 2. To have clear labels rather than such numbers, you can use the times argument; and add the timevar argument to rename “time” to something more meaningful. Note that the argument names are based on the idea that the data refer to different measurements over time. But this is not always the case.

x <- reshape(d, varying=wvars, v.names="score", times=wvars, timevar = "subject", direction = "long")
x
##            id female subject score
## 70.write   70      0   write    52
## 121.write 121      1   write    59
## 86.write   86      0   write    33
## 70.math    70      0    math    41
## 121.math  121      1    math    53
## 86.math    86      0    math    54

The row names also identify how the records were created. To get rid of the row names do

rownames(x) <- NULL
x
##    id female subject score
## 1  70      0   write    52
## 2 121      1   write    59
## 3  86      0   write    33
## 4  70      0    math    41
## 5 121      1    math    53
## 6  86      0    math    54

Variables “id” and “female” were unchanging, but they were duplicated because two variables (“write” and “math”) were combined into one (“subject”)

long to wide

To go from long to wide we need to use “idvar” and “timevar”. “idvar” identifies the variables that identify a single case (e.g. a single person, or other observational unit). In contrast, “timevar” indicates the variable that has the identifiers that become variables in the wide format. The remaining variable should have the values that match these new variables.

w <- reshape(x, idvar=c("id", "female"), timevar = "subject", direction = "wide")
w
##    id female score.write score.math
## 1  70      0          52         41
## 2 121      1          59         53
## 3  86      0          33         54

Note that w is identical to d, except for the last two column names that now have “score.” prepended to them. We can change that like this:

cn <- colnames(w)
cn <- gsub("score.", "", cn)
colnames(w) <- cn
w
##    id female write math
## 1  70      0    52   41
## 2 121      1    59   53
## 3  86      0    33   54

merge

A common situation is to have multiple data.frames with data for the same cases (e.g., individuals, fields, countries, ...). Such data.frames may need to be joined such that they can be analyzed. For example here we have a and b:

a <- dd[, 1:3]
# random sample of 100 records)
set.seed(1)
b <- dd[sample(nrow(dd), 100), c(1, 7:10)]

a has 200 records. It has a unique identifier for each student and information about their sex (female or not) and race (4 groups).

dim(a)
## [1] 200   3
head(a)
##    id female race
## 1  70      0    4
## 2 121      1    4
## 3  86      0    4
## 4 141      0    4
## 5 172      0    4
## 6 113      0    4
table(a$female)
##
##   0   1
##  91 109
table(a$race)
##
##   1   2   3   4
##  24  11  20 145

b has the same unique identifier as a (but only for 100 students) and it has the grades for four subjects.

dim(b)
## [1] 100   5
head(b)
##      id read write math science
## 54  183   63    59   49      55
## 75   56   55    45   46      58
## 114 125   68    65   58      59
## 179  64   50    52   45      58
## 40  153   39    31   40      39
## 176 135   63    60   65      54

Imagine we were interested in differences in reading by sex or race. We would need to combine a and b. That, fortunately, is very simple, because we have the merge function.

ab <- merge(a, b, by='id')
head(ab)
##   id female race read write math science
## 1  3      0    1   63    65   48      63
## 2  4      1    1   44    50   41      39
## 3  8      1    1   39    44   52      44
## 4 16      0    1   47    31   44      36
## 5 17      1    1   47    57   48      44
## 6 18      0    1   50    33   49      44

Always check the dimensions of the result

dim(ab)
## [1] 100   7

In this case we expected 100 records (the lower number of the two; a had 200 records, but b only 100). Sometimes you get fewer than expected, suggesting that the identifiers do not match. In other cases you might want to keep all records, and create missing values where these are not available. You can do that like this:

ab <- merge(a, b, by='id', all.x=TRUE)
dim(ab)
## [1] 200   7
head(ab)
##   id female race read write math science
## 1  1      1    1   NA    NA   NA      NA
## 2  2      1    1   NA    NA   NA      NA
## 3  3      0    1   63    65   48      63
## 4  4      1    1   44    50   41      39
## 5  5      0    1   NA    NA   NA      NA
## 6  6      1    1   NA    NA   NA      NA

Note that the “x” in all.x refers to the first argument, hence a in this case. In other cases you might need to say all.y=TRUE or all=TRUE. Consider these extreme cases (with no matching records):

merge(a[1:3,], b[1:3, ], by='id')
## [1] id      female  race    read    write   math    science
## <0 rows> (or 0-length row.names)
merge(a[1:3,], b[1:3, ], by='id', all.x=T)
##    id female race read write math science
## 1  70      0    4   NA    NA   NA      NA
## 2  86      0    4   NA    NA   NA      NA
## 3 121      1    4   NA    NA   NA      NA
merge(a[1:3,], b[1:3, ], by='id', all.y=T)
##    id female race read write math science
## 1  56     NA   NA   55    45   46      58
## 2 125     NA   NA   68    65   58      59
## 3 183     NA   NA   63    59   49      55
merge(a[1:3,], b[1:3, ], by='id', all=T)
##    id female race read write math science
## 1  56     NA   NA   55    45   46      58
## 2  70      0    4   NA    NA   NA      NA
## 3  86      0    4   NA    NA   NA      NA
## 4 121      1    4   NA    NA   NA      NA
## 5 125     NA   NA   68    65   58      59
## 6 183     NA   NA   63    59   49      55

Now that we have ab we can compute what we needed:

tapply(ab$read, ab$female, mean, na.rm=TRUE)
##        0        1
## 53.26087 51.16667
tapply(ab$read, ab$race, mean, na.rm=TRUE)
##        1        2        3        4
## 49.75000 52.60000 45.11111 53.33784

sort

It is often useful to sort data to make it easier to inspect it. R has a sort function but that is only for vectors. For matrices or data.frames you need to use the order function.

sort is straightforward:

x <- sample(10)
x
##  [1]  7  4  3 10  9  2  1  5  8  6
sort(x)
##  [1]  1  2  3  4  5  6  7  8  9 10

Now consider order:

i <- order(x)
i
##  [1]  7  6  3  2  8 10  1  9  5  4
x[i]
##  [1]  1  2  3  4  5  6  7  8  9 10

order returns a vector that allows you to sort. The first value of i is 7. This means that x[7] should be the lowest number in x. The next number is x[6] followed by x[3] and x[2] and so forth.

Consider data.frame x:

set.seed(0)
x <- a[sample(nrow(a), 10), ]
x
##      id female race
## 180  63      1    4
## 53   21      0    1
## 74  157      0    4
## 113 131      1    4
## 179  64      1    4
## 40  153      0    4
## 175  36      1    3
## 183  92      1    4
## 127 105      1    4
## 121  35      1    1

Here is how you can use order to sort it by one column (“id” in this case):

oid <- order(x$id)
y <- x[oid, ]
y
##      id female race
## 53   21      0    1
## 121  35      1    1
## 175  36      1    3
## 180  63      1    4
## 179  64      1    4
## 183  92      1    4
## 127 105      1    4
## 113 131      1    4
## 40  153      0    4
## 74  157      0    4

Or by multiple columns. In this case, we want to sort first by “race”, then by “female” and then by “id”:

oid <- order(x$race, x$female, x$id)
x[oid, ]
##      id female race
## 53   21      0    1
## 121  35      1    1
## 175  36      1    3
## 40  153      0    4
## 74  157      0    4
## 180  63      1    4
## 179  64      1    4
## 183  92      1    4
## 127 105      1    4
## 113 131      1    4