Data exploration
================

After reading data from a file (see the previous chapter), the next
thing to do is to look at some summary statistics. This is in the first
place just to check your data. Very often you discover some strange
values that are not quite right. Careful inspection of your data after
you read it is very important. It can help avoid a lot of trouble later
on when you are trying to explain odd results!

Sometimes you can correct errors in *R* but in other cases you first
need to fix the data file. Fixing a data file can be the way to go if
you are dealing with your own primary data. However when you are working
with a file provided by someone else it is generally better to not
change the file, but to correct things via *R* code, if possible. That
leaves an exact trail of what you have done, and allows you to apply the
same corrections to a new version of the data.

Inspecting your data is also important to understand it better, and such
“exploratory data analysis” can be an important step before doing the
specific analyses of interest. There are many general and specialized
tools for this, and here we only cover some of the basics.

Summary and table
-----------------

Consider ``data.frame`` ``d``

.. code:: r

   d <- data.frame(id=1:10, 
           name=c('Bob', 'Bobby', '???', 'Bob', 'Bab', 'Jim', 'Jim', 'jim', '', 'Jim'), 
           score1=c(8, 10, 7, 9, 2, 5, 1, 6, 3, 4), 
           score2=c(3,4,5,-999,5,5,-999,2,3,4), stringsAsFactors=FALSE)
   d
   ##    id  name score1 score2
   ## 1   1   Bob      8      3
   ## 2   2 Bobby     10      4
   ## 3   3   ???      7      5
   ## 4   4   Bob      9   -999
   ## 5   5   Bab      2      5
   ## 6   6   Jim      5      5
   ## 7   7   Jim      1   -999
   ## 8   8   jim      6      2
   ## 9   9            3      3
   ## 10 10   Jim      4      4

``d`` is very small and we can simply look at the values in ``d`` to see
if they look all-right. But with real data you may have 100s or 1000s of
rows and many columns, making it very hard, if not impossible, to spot
errors by eye-balling.

The summary function is an easy way to start, at least for numeric
variables.

.. code:: r

   summary(d)
   ##        id            name               score1          score2       
   ##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :-999.00  
   ##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:   2.25  
   ##  Median : 5.50   Mode  :character   Median : 5.50   Median :   3.50  
   ##  Mean   : 5.50                      Mean   : 5.50   Mean   :-196.70  
   ##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:   4.75  
   ##  Max.   :10.00                      Max.   :10.00   Max.   :   5.00

The minimum value of variable ``score2`` is -999. That was probably used
in data entry to indicate a missing value. These should be changed to
``NA``.

.. code:: r

   # which values in score2 are -999?
   i <- d$score2 == -999
   # set these to NA
   d$score2[i] <- NA
   summary(d)
   ##        id            name               score1          score2     
   ##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :2.000  
   ##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:3.000  
   ##  Median : 5.50   Mode  :character   Median : 5.50   Median :4.000  
   ##  Mean   : 5.50                      Mean   : 5.50   Mean   :3.875  
   ##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:5.000  
   ##  Max.   :10.00                      Max.   :10.00   Max.   :5.000  
   ##                                                     NA's   :2

The two steps used above: ``i <- d$score2 == -999`` and
``d$score2[i] <- NA`` are usually done in a single line:
``d$score2[d$score2 == -999] <- NA``.

For character (and integer) variables it can be useful to use ``unique``
and ``table``:

.. code:: r

   unique(d$name)
   ## [1] "Bob"   "Bobby" "???"   "Bab"   "Jim"   "jim"   ""
   table(d$name)
   ## 
   ##         ???   Bab   Bob Bobby   jim   Jim 
   ##     1     1     1     2     1     1     3

Often you will discover slight variations in spelling that need to be
corrected. In this case, let’s assume that “Bobby” and “Bab” should both
be “Bob”. We replace “Bab” and “Bobby” with “Bob”.

.. code:: r

   d$name[d$name %in% c("Bab", "Bobby")] <- "Bob"
   table(d$name)
   ## 
   ##     ??? Bob jim Jim 
   ##   1   1   4   1   3

“jim” should be “Jim”. It is easy enough to replace as done above. But
what if there were many cases like that? It would be easy to make all
character values lower- or uppercase with ``d$name <- toupper(d$name)``
but I want to keep the normal name capitalization of the first letter
only. So let’s assure that all names start with an uppercase letter.

.. code:: r

   # get the first letters
   first <- substr(d$name, 1, 1)
   # get the remainder
   remainder <- substr(d$name, 2, nchar(d$name))
   # assure that the first letter is upper case
   first <- toupper(first)
   # combine
   name <- paste0(first, remainder)
   # assign back to the variable
   d$name <- name
   table(d$name)
   ## 
   ##     ??? Bob Jim 
   ##   1   1   4   4

The question marks in ``name`` should probably also be replaced with
``NA``.

.. code:: r

   d$name[d$name == "???"] <- NA
   table(d$name)
   ## 
   ##     Bob Jim 
   ##   1   4   4

You can force ``table`` to also count the ``NA`` values:

.. code:: r

   table(d$name, useNA="ifany")
   ## 
   ##       Bob  Jim <NA> 
   ##    1    4    4    1

Note that there is one “empty” value.

.. code:: r

   d$name[9]
   ## [1] ""

That should also be a missing value in this case.

.. code:: r

   d$name[d$name == ""] <- NA
   table(d$name, useNA="ifany")
   ## 
   ##  Bob  Jim <NA> 
   ##    4    4    2

You can also use ``table`` to make a contingency table of two variables.

.. code:: r

   table(d[ c("name", "score2")])
   ##      score2
   ## name  2 3 4 5
   ##   Bob 0 1 1 1
   ##   Jim 1 0 1 1

Quantile, range, and mean
-------------------------

Other useful functions include ``quantile``, ``range``, and ``mean``.

.. code:: r

   quantile(d$score1)
   ##    0%   25%   50%   75%  100% 
   ##  1.00  3.25  5.50  7.75 10.00
   range(d$score1)
   ## [1]  1 10
   mean(d$score1)
   ## [1] 5.5

Note that in some functions you may need to use ``na.rm=TRUE`` if there
are ``NA`` values. Otherwise, as soon as there is a single ``NA`` value
in a computation, the results becomes ``NA``. This is very common in *R*
— so keep that in mind if all your results are ``NA``.

.. code:: r

   try(quantile(d$score2))
   ## Error in quantile.default(d$score2) : 
   ##   missing values and NaN's not allowed if 'na.rm' is FALSE
   range(d$score2)
   ## [1] NA NA
   quantile(d$score2, na.rm=TRUE)
   ##   0%  25%  50%  75% 100% 
   ##    2    3    4    5    5
   range(d$score2, na.rm=TRUE)
   ## [1] 2 5

In this data exploration phase it is very useful to make plots. We’ll
discuss plotting in a later chapter, but here are four example plots.
Note how ``par(mfrow=c(2,2))`` sets up the canvas for two rows and
columns, that is for four plots.

.. code:: r

   par(mfrow=c(2,2))
   plot(d$score1, d$score2)
   boxplot(d[, c("score1", "score2")])
   plot(sort(d$score1))
   hist(d$score2)

|image1|

.. |image1| image:: figures/plot1-1.png