Read and write files

In most cases, the first step in data analysis is to read values from a file. This can be pretty complicated due to the variations in file format. Here we discuss reading matrix-like (data.frame/spreadsheet) data structures, which is the most common case and relatively painless.

Paths

To read a file into R, you need to know its name. That is, you need to know the full path (directory) name and the name of the file itself. Wrong path names often create confusion. On Windows, it is easy to copy the path from the top bar in Windows Explorer. On a Mac you can select the file and type Command + Option + C to copy the path to the clipboard.

Below I try to assign a Windows style full path and file name to a variable f so that we can use it.

f <- "C:\projects\research\data\obs.csv"
## Error: '\p' is an unrecognized escape in character string starting ""C:\p"

Yikes, an error! The problem is the use of back-slashes (\). In R (and elsewhere), the backslash is the “escape” symbol, which is followed by another symbol to indicate a special character. For example, "\t" represents a “tab” and "\n" is the symbol for a new line (hard return). This is illustrated below.

txt <- "Here is an example:\nA new line has started!\nAnd\tone\twith\ttabs\t...\t!\n"
message(txt)
## Here is an example:
## A new line has started!
## And  one with    tabs    ... !

So for path delimiters we need to use either the forward-slash "/" or an escaped back-slash "\\". Both of the following are OK.

f1 <- "C:/projects/research/data/obs.csv"
f2 <- "C:\\projects\\research\\data\\obs.csv"

The values of f1 and f2 are just names. A file with that name may, or may not, actually exist. We can find out with the file.exists function.

file.exists(f1)
## [1] FALSE

Note that in path names . and .. have a special meaning. The single dot refers to the current directory (the “working directory”). The double dots refers to one directory level higher. So if you set your working directory in R, like this

setwd("d:/rspatial/intro")
getwd()

Then this filename "./data/input.csv" is equivalent to "d:/rspatial/intro/data/input.csv". This filename "../scratch/test.dat" refers to the same file as“d:/rspatial/scratch/test.dat”. And“../../root.txt”is the same file as“d:/root.txt”`.

Files

To make what you see here reproducible, we’ll first create files from some example data.

d <- data.frame(id=1:10, name=letters[1:10], value=seq(10,28,2))
d
##    id name value
## 1   1    a    10
## 2   2    b    12
## 3   3    c    14
## 4   4    d    16
## 5   5    e    18
## 6   6    f    20
## 7   7    g    22
## 8   8    h    24
## 9   9    i    26
## 10 10    j    28

Now we write the values in data.frame d to disk. In this section, I show how to write to simple “text” files using two different (but related) functions: write.csv and write.table. It is also possible to read Excel files (with the read_excel function from the readxl package) and many other file types but that is not shown here.

write.csv(d, "test.csv", row.names=FALSE)
write.table(d, "test.dat", row.names=FALSE)

write.csv is a derived from write.table. The main difference between the two is that in write.csv the field separator is a comma (“csv” stands for “comma separated values”), while in write.table the default is a tab (the \t character).

Now we have two files on disk.

file.exists("test.csv")
## [1] TRUE
file.exists("test.dat")
## [1] TRUE

As we only specified a file name, but not a path, the files are in our working directory. We can use getwd(get working directory) to see where that is.

getwd()
## [1] "C:/github/rspatial/web/base/source/intr/_R"

The working directory is the ‘default’ location where R will look for files and write files. To get the full path name for one of the files we created, we can use file.path.

file.path(getwd(), "test.csv")
## [1] "C:/github/rspatial/web/base/source/intr/_R/test.csv"

As this is a “.csv” (comma separated values) file we can read it with read.csv.

d <- read.csv("test.csv", stringsAsFactors=FALSE)
head(d)
##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20

Sometimes values are delimited by other characters. In that case you can try read.table. read.table has an argument sep (separator) to indicate the field delimiter. Here we do not use that, because the default (whitespace, tab) works fine.

d <- read.table("test.dat", stringsAsFactors=FALSE)
head(d)
##   V1   V2    V3
## 1 id name value
## 2  1    a    10
## 3  2    b    12
## 4  3    c    14
## 5  4    d    16
## 6  5    e    18

Note that read.table did not automatically recognize the header row with the variable names. Instead, the variable names have become the first records, and new dummy variable names V1, V2 and V3 were assigned. To fix that, we use the header argument:

d <- read.table("test.dat", header=TRUE, stringsAsFactors=FALSE)
head(d)
##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20

read.table can also read csv files; you just need to tell it that the field delimiter is a comma.

d <- read.table("test.dat", sep=",", stringsAsFactors=FALSE)

In the examples above, I used stringsAsFactors=FALSE. This is not required, but it is helpful. Otherwise, all character variables are converted to factors, which in many cases is a nuisance. If read.table fails, there are other options, such as readLines.

d <- readLines("test.csv")
class(d)
## [1] "character"
d
##  [1] "\"id\",\"name\",\"value\"" "1,\"a\",10"
##  [3] "2,\"b\",12"                "3,\"c\",14"
##  [5] "4,\"d\",16"                "5,\"e\",18"
##  [7] "6,\"f\",20"                "7,\"g\",22"
##  [9] "8,\"h\",24"                "9,\"i\",26"
## [11] "10,\"j\",28"

But this makes it more difficult to process the data. You may need to do something like:

x <- sapply(d, function(i){ unlist(strsplit(i, ",")) }, USE.NAMES=FALSE)
t(x)
##       [,1]     [,2]       [,3]
##  [1,] "\"id\"" "\"name\"" "\"value\""
##  [2,] "1"      "\"a\""    "10"
##  [3,] "2"      "\"b\""    "12"
##  [4,] "3"      "\"c\""    "14"
##  [5,] "4"      "\"d\""    "16"
##  [6,] "5"      "\"e\""    "18"
##  [7,] "6"      "\"f\""    "20"
##  [8,] "7"      "\"g\""    "22"
##  [9,] "8"      "\"h\""    "24"
## [10,] "9"      "\"i\""    "26"
## [11,] "10"     "\"j\""    "28"

And then some. (sapply is explained later on).

For some files, readLines is the way to go. For example, you may want to read ‘html’ files if you are scraping a website. We can read this very page:

webpage <- readLines("http://rspatial.org/intr/6-files.html", warn=FALSE)
#show some of the first lines:
webpage[150:155]
## [1] "here.</p>"
## [2] "<div class=\"highlight-r notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"nf\">write.csv</span><span class=\"p\">(</span><span class=\"n\">d</span><span class=\"p\">,</span> <span class=\"s\">&#39;test.csv&#39;</span><span class=\"p\">,</span> <span class=\"n\">row.names</span><span class=\"o\">=</span><span class=\"kc\">FALSE</span><span class=\"p\">)</span>"
## [3] "<span class=\"nf\">write.table</span><span class=\"p\">(</span><span class=\"n\">d</span><span class=\"p\">,</span> <span class=\"s\">&#39;test.dat&#39;</span><span class=\"p\">,</span> <span class=\"n\">row.names</span><span class=\"o\">=</span><span class=\"kc\">FALSE</span><span class=\"p\">)</span>"
## [4] "</pre></div>"
## [5] "</div>"
## [6] "<p><code class=\"docutils literal notranslate\"><span class=\"pre\">write.csv</span></code> is a derived from <code class=\"docutils literal notranslate\"><span class=\"pre\">write.table</span></code>. The main difference"

Another relevant function in this context is list.files. That shows you which files are present in your working directory (or another path of choice) and perhaps all other subdirectories (when using the argument recursive=TRUE). By doing something like ff <- list.files(pattern = 'csv$') you can get a vector of all files with names that end with “csv” (that is what the $ sign is for).

Before writing files (e.g. with write.csv, write.table, or writeLines), you may want to use dir.exists to assure that the path you are using exists and perhaps use dir.create if it does not.

Filenames need to be the full file name, including the path, unless you are reading from or writing to the working directory. You can set the working directory with setwd. For example: setwd("C:/projects/research/).

Excel files

It it is possible to directly read Excel files with the readxl package. You may first need to install the package with

install.packages("readxl")

The package comes with example files

f <- readxl_example("datasets.xlsx")
## Error in readxl_example("datasets.xlsx"): could not find function "readxl_example"
basename(f)
## Error in basename(f): object 'f' not found

And we can read that file like this

x <- read_excel(f)
## Error in read_excel(f): could not find function "read_excel"
head(x)
##      [,1]        [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]
## [1,] "\"id\""    "1"     "2"     "3"     "4"     "5"     "6"     "7"
## [2,] "\"name\""  "\"a\"" "\"b\"" "\"c\"" "\"d\"" "\"e\"" "\"f\"" "\"g\""
## [3,] "\"value\"" "10"    "12"    "14"    "16"    "18"    "20"    "22"
##      [,9]    [,10]   [,11]
## [1,] "8"     "9"     "10"
## [2,] "\"h\"" "\"i\"" "\"j\""
## [3,] "24"    "26"    "28"

This returns a tibble which is similar to the data.frame. We can turn it into a data.frame with

x <- as.data.frame(x)
class(x)
## [1] "data.frame"
head(x)
##        V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11
## 1    "id"   1   2   3   4   5   6   7   8   9  10
## 2  "name" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
## 3 "value"  10  12  14  16  18  20  22  24  26  28

See ?read_excel for options such as selecting a specific “sheet”