6. Read and write files

In most cases, the first step in data analysis is to read in a file with data. This can be pretty complicated due to the variations in file format. Here we discuss reading matrix (data.frame / spreadsheet) like data structures which is the most common case, and relatively painless.

If you have an Excel file use “File / Save As” and select “CSV (Comma delimited) (*.csv)”. Although it is possible to directly read Excel files we do not discuss that here.

To read a file into R you need to know its name. That is, you need to know the full path (directory) name and the name of the file itself. Wrong path names often create confusion. On windows it is easy to copy the path from the top bar in Windows Explorer. On a Mac such things are harder, and it may not be so so easy to get a path name. You can do “Select, Finder—View—Show Path Bar” or perhaps drag your file to a text editor (that may paste the file name!).

Below I try to assign a windows type full path and file name to a variable f so that we can use it.

f <- "C:\projects\research\data\obs.csv"
#Error: '\p' is an unrecognised escape in character string starting ""C:\p"

Yikes, an error! The problem is the use of backslashes. In R (and elsewhere) The backslash is the “escape” symbol, that is followed by another symbol to indicate a special character. For example, "\t" represents a “tab” and "\n" is the symbol for a new line (hard return). This is illustrated below.

txt <- "Here is an example:\nA new line has started!\nAnd another one...\n"
message(txt)
## Here is an example:
## A new line has started!
## And another one...

So for path delimiters we need to use either the forward-slash "/" or an escaped back-slash "\\". Both of the following are OK.

f1 <- "C:/projects/research/data/obs.csv"
f2 <- "C:\\projects\\research\\data\\obs.csv"

Now this file may not actually exist:

file.exists(f1)
## [1] FALSE

To make what you see here reproducible, without the need to download a file, we’ll first create a file. First some example data:

d <- data.frame(id=1:10, name=letters[1:10], value=seq(10,28,2))
d
##    id name value
## 1   1    a    10
## 2   2    b    12
## 3   3    c    14
## 4   4    d    16
## 5   5    e    18
## 6   6    f    20
## 7   7    g    22
## 8   8    h    24
## 9   9    i    26
## 10 10    j    28

And now we write that to disk. Here I show how to do this with two different (but related) functions: write.csv and write.table.

write.csv(d, 'test.csv', row.names=FALSE)
write.table(d, 'test.dat', row.names=FALSE)

write.csv is a derived from write.table. The main difference between the two that in write.csv the field seperator is a comma (csv stands for “comma separated values”), while in the write.table the default is a tab (the \t character).

Now we have two files on disk.

file.exists('test.csv')
## [1] TRUE
file.exists('test.dat')
## [1] TRUE

As we only specified a file name, but not a path, the files are in our working directory. We can use getwd(get working directory) to see where that is.

getwd()
## [1] "e:/bitbucket/rweb/source/intr/_R"

The working directory is the ‘default’ location where R will look for files and write files. To get the full path name for one of the files we created we can use file.path.

file.path(getwd(), 'test.csv')
## [1] "e:/bitbucket/rweb/source/intr/_R/test.csv"

As this is a ‘.csv’ (comma separated values) file we can read it with read.csv.

d <- read.csv('test.csv', stringsAsFactors=FALSE)
head(d)
##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20

Sometimes values are delimited by other characters. In that case you can try read.table. read.table has an argument sep (separator) to indicate the field delimiter. Here we do not use that as the default (whitespace, tab) works fine.

d <- read.table('test.dat', stringsAsFactors=FALSE)
head(d)
##   V1   V2    V3
## 1 id name value
## 2  1    a    10
## 3  2    b    12
## 4  3    c    14
## 5  4    d    16
## 6  5    e    18

Note that read.table did not automatically recognize the header row with the variable names. In stead, the variable names have become the first records and new dummy variable names V1, V2 and V3 were assigned. To fix that we use the header argument:

d <- read.table('test.dat', header=TRUE, stringsAsFactors=FALSE)
head(d)
##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20

read.table can also read csv files, you just need to tell it that the field delimiter is a comma.

d <- read.table('test.dat', sep=',', stringsAsFactors=FALSE)

In the examples above I used stringsAsFactors=FALSE. This is not required, but it is helpful. Otherwise all character variables are converted to factors, which in many cases is a nuisance. If read.table fails, there are other options. For example readLines.

d <- readLines('test.csv')
class(d)
## [1] "character"
d
##  [1] "\"id\",\"name\",\"value\"" "1,\"a\",10"
##  [3] "2,\"b\",12"                "3,\"c\",14"
##  [5] "4,\"d\",16"                "5,\"e\",18"
##  [7] "6,\"f\",20"                "7,\"g\",22"
##  [9] "8,\"h\",24"                "9,\"i\",26"
## [11] "10,\"j\",28"

But this makes it more difficult to process the data. You may need to do something like:

x <- sapply(d, function(i){ unlist(strsplit(i, ',')) }, USE.NAMES=FALSE)
t(x)
##       [,1]     [,2]       [,3]
##  [1,] "\"id\"" "\"name\"" "\"value\""
##  [2,] "1"      "\"a\""    "10"
##  [3,] "2"      "\"b\""    "12"
##  [4,] "3"      "\"c\""    "14"
##  [5,] "4"      "\"d\""    "16"
##  [6,] "5"      "\"e\""    "18"
##  [7,] "6"      "\"f\""    "20"
##  [8,] "7"      "\"g\""    "22"
##  [9,] "8"      "\"h\""    "24"
## [10,] "9"      "\"i\""    "26"
## [11,] "10"     "\"j\""    "28"

And then some. (sapply is explained later on).

But for some files readLines is the way to go. For example to read ‘html’ files if you are scraping a website. We can read this very page:

webpage <- readLines("http://rspatial.org/intr/rst/6-files.html")
#show some of the first lines:
webpage[150:155]
## [1] "  <div class=\"section\" id=\"read-and-write-files\">"
## [2] "<h1>6. Read and write files<a class=\"headerlink\" href=\"#read-and-write-files\" title=\"Permalink to this headline\">¶</a></h1>"
## [3] "<p>In most cases, the first step in data analysis is to read in a file with"
## [4] "data. This can be pretty complicated due to the variations in file"
## [5] "format. Here we discuss reading matrix (data.frame / spreadsheet) like"
## [6] "data structures which is the most common case, and relatively painless.</p>"

Another relevant function in this context is list.files. That shows you which files are present in your working directory (or another path of choice) and perhaps all other subdirectories (when using the argument recursive=TRUE). By doing something like ff <- list.files(pattern = 'csv$') you can get a vector of all files with a name that end on “csv” (that is what the $ sign is for).

Before writing files with, e.g. write.csv, write.table, or writeLines you may want to use dir.exists to assure that the path you are using exists, and perhaps use dir.create if it does not.

filenames need to be the full file name, including the path, unless you are reading from or writing to the working directory. You can set the working directory with setwd. For example: setwd("C:/projects/research/).