# 6. Read and write files¶

In most cases, the first step in data analysis is to read in a file with data. This can be pretty complicated due to the variations in file format. Here we discuss reading matrix-like (data.frame/spreadsheet) data structures, which is the most common case and relatively painless.

Although it is possible to directly read Excel files, we do not discuss that here.

To read a file into R, you need to know its name. That is, you need to know the full path (directory) name and the name of the file itself. Wrong path names often create confusion. On Windows, it is easy to copy the path from the top bar in Windows Explorer. On a Mac such things are harder, and it may not be so so easy to get a path name. You can do “Select, Finder—View—Show Path Bar” or perhaps drag your file to a text editor (that may paste the file name!).

Below I try to assign a Windows style full path and file name to a variable f so that we can use it.

f <- "C:\projects\research\data\obs.csv"
#Error: '\p' is an unrecognised escape in character string starting ""C:\p"

Yikes, an error! The problem is the use of backslashes. In R (and elsewhere), the backslash is the “escape” symbol, which is followed by another symbol to indicate a special character. For example, "\t" represents a “tab” and "\n" is the symbol for a new line (hard return). This is illustrated below.

txt <- "Here is an example:\nA new line has started!\nAnd another one...\n"
message(txt)
## Here is an example:
## A new line has started!
## And another one...

So for path delimiters we need to use either the forward-slash "/" or an escaped back-slash "\\". Both of the following are OK.

f1 <- "C:/projects/research/data/obs.csv"
f2 <- "C:\\projects\\research\\data\\obs.csv"

This is just a name. A file with that name may, or may not actually exist.

file.exists(f1)
## [1] FALSE

To make what you see here reproducible, we’ll first create files from some example data:

d <- data.frame(id=1:10, name=letters[1:10], value=seq(10,28,2))
d
##    id name value
## 1   1    a    10
## 2   2    b    12
## 3   3    c    14
## 4   4    d    16
## 5   5    e    18
## 6   6    f    20
## 7   7    g    22
## 8   8    h    24
## 9   9    i    26
## 10 10    j    28

Now we write the values in data.frame d to disk. In this section, I show how to write to siple “text” files using two different (but related) functions: write.csv and write.table. It is also possile to read Excel files (with the read_excel function from the readxl package) and many other file types but that is not shown here.

write.csv(d, 'test.csv', row.names=FALSE)
write.table(d, 'test.dat', row.names=FALSE)

write.csv is a derived from write.table. The main difference between the two is that in write.csv the field separator is a comma (“csv” stands for “comma separated values”), while in write.table the default is a tab (the \t character).

Now we have two files on disk.

file.exists('test.csv')
## [1] TRUE
file.exists('test.dat')
## [1] TRUE

As we only specified a file name, but not a path, the files are in our working directory. We can use getwd(get working directory) to see where that is.

getwd()
## [1] "d:/bitbucket/rspatial-web/source/intr/_R"

The working directory is the ‘default’ location where R will look for files and write files. To get the full path name for one of the files we created, we can use file.path.

file.path(getwd(), 'test.csv')
## [1] "d:/bitbucket/rspatial-web/source/intr/_R/test.csv"

As this is a ‘.csv’ (comma separated values) file we can read it with read.csv.

##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20

Sometimes values are delimited by other characters. In that case you can try read.table. read.table has an argument sep (separator) to indicate the field delimiter. Here we do not use that, because the default (whitespace, tab) works fine.

##   V1   V2    V3
## 1 id name value
## 2  1    a    10
## 3  2    b    12
## 4  3    c    14
## 5  4    d    16
## 6  5    e    18

Note that read.table did not automatically recognize the header row with the variable names. Instead, the variable names have become the first records, and new dummy variable names V1, V2 and V3 were assigned. To fix that, we use the header argument:

##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20

read.table can also read csv files; you just need to tell it that the field delimiter is a comma.

In the examples above, I used stringsAsFactors=FALSE. This is not required, but it is helpful. Otherwise, all character variables are converted to factors, which in many cases is a nuisance. If read.table fails, there are other options, such as readLines.

class(d)
## [1] "character"
d
##  [1] "\"id\",\"name\",\"value\"" "1,\"a\",10"
##  [3] "2,\"b\",12"                "3,\"c\",14"
##  [5] "4,\"d\",16"                "5,\"e\",18"
##  [7] "6,\"f\",20"                "7,\"g\",22"
##  [9] "8,\"h\",24"                "9,\"i\",26"
## [11] "10,\"j\",28"

But this makes it more difficult to process the data. You may need to do something like:

x <- sapply(d, function(i){ unlist(strsplit(i, ',')) }, USE.NAMES=FALSE)
t(x)
##       [,1]     [,2]       [,3]
##  [1,] "\"id\"" "\"name\"" "\"value\""
##  [2,] "1"      "\"a\""    "10"
##  [3,] "2"      "\"b\""    "12"
##  [4,] "3"      "\"c\""    "14"
##  [5,] "4"      "\"d\""    "16"
##  [6,] "5"      "\"e\""    "18"
##  [7,] "6"      "\"f\""    "20"
##  [8,] "7"      "\"g\""    "22"
##  [9,] "8"      "\"h\""    "24"
## [10,] "9"      "\"i\""    "26"
## [11,] "10"     "\"j\""    "28"

And then some. (sapply is explained later on).

For some files, readLines is the way to go. For example, you may want to read ‘html’ files if you are scraping a website. We can read this very page:

#show some of the first lines:
webpage[150:155]
## [1] "</div>"
## [2] "          <div role=\"main\" class=\"document\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Article\">"
## [3] "           <div itemprop=\"articleBody\">"
## [4] "            "
## [5] "  <div class=\"section\" id=\"read-and-write-files\">"
Another relevant function in this context is list.files. That shows you which files are present in your working directory (or another path of choice) and perhaps all other subdirectories (when using the argument recursive=TRUE). By doing something like ff <- list.files(pattern = 'csv$') you can get a vector of all files with names that end with “csv” (that is what the$ sign is for).