Read and write files¶
In most cases, the first step in data analysis is to read values from a file. This can be pretty complicated due to the variations in file format. Here we discuss reading matrix-like (data.frame/spreadsheet) data structures, which is the most common case and relatively painless.
Paths¶
To read a file into R, you need to know its name. That is, you need to know the full path (directory) name and the name of the file itself. Wrong path names often create confusion. On Windows, it is easy to copy the path from the top bar in Windows Explorer. On a Mac you can select the file and type Command + Option + C to copy the path to the clipboard.
Below I try to assign a Windows style full path and file name to a
variable f
so that we can use it.
f <- "C:\projects\research\data\obs.csv"
## Error: '\p' is an unrecognized escape in character string (<text>:1:10)
Yikes, an error! The problem is the use of back-slashes (\
). In R
(and elsewhere), the backslash is the “escape” symbol, which is followed
by another symbol to indicate a special character. For example, "\t"
represents a “tab” and "\n"
is the symbol for a new line (hard
return). This is illustrated below.
txt <- "Here is an example:\nA new line has started!\nAnd\tone\twith\ttabs\t...\t!\n"
message(txt)
## Here is an example:
## A new line has started!
## And one with tabs ... !
So for path delimiters we need to use either the forward-slash "/"
or an escaped back-slash "\\"
. Both of the following are OK.
f1 <- "C:/projects/research/data/obs.csv"
f2 <- "C:\\projects\\research\\data\\obs.csv"
The values of f1
and f2
are just names. A file with that name
may, or may not, actually exist. We can find out with the
file.exists
function.
file.exists(f1)
## [1] FALSE
Note that in path names .
and ..
have a special meaning. The
single dot refers to the current directory (the “working directory”).
The double dots refers to one directory level higher. So if you set your
working directory in R, like this
setwd("d:/rspatial/intro")
getwd()
Then this filename "./data/input.csv"
is equivalent to
"d:/rspatial/intro/data/input.csv"
. This filename
"../scratch/test.dat"
refers to the same file as
"d:/rspatial/scratch/test.dat"
. And "../../root.txt"
is the same
file as "d:/root.txt"
.
Files¶
To make what you see here reproducible, we’ll first create files from some example data.
d <- data.frame(id=1:10, name=letters[1:10], value=seq(10,28,2))
d
## id name value
## 1 1 a 10
## 2 2 b 12
## 3 3 c 14
## 4 4 d 16
## 5 5 e 18
## 6 6 f 20
## 7 7 g 22
## 8 8 h 24
## 9 9 i 26
## 10 10 j 28
Now we write the values in data.frame d
to disk. In this section, I
show how to write to simple “text” files using two different (but
related) functions: write.csv
and write.table
. It is also
possible to read Excel files (with the read_excel
function from the
readxl
package) and many other file types but that is not shown
here.
write.csv(d, "test.csv", row.names=FALSE)
write.table(d, "test.dat", row.names=FALSE)
write.csv
is a derived from write.table
. The main difference
between the two is that in write.csv
the field separator is a comma
(“csv” stands for “comma separated values”), while in write.table
the default is a tab (the \t
character).
Now we have two files on disk.
file.exists("test.csv")
## [1] TRUE
file.exists("test.dat")
## [1] TRUE
As we only specified a file name, but not a path, the files are in our
working directory. We can use getwd
(get working directory) to see
where that is.
getwd()
## [1] "C:/github/rspatial/web/terra/source/intr/_R"
The working directory is the ‘default’ location where R will look for
files and write files. To get the full path name for one of the files we
created, we can use file.path
.
file.path(getwd(), "test.csv")
## [1] "C:/github/rspatial/web/terra/source/intr/_R/test.csv"
As this is a “.csv” (comma separated values) file we can read it with
read.csv
.
d <- read.csv("test.csv", stringsAsFactors=FALSE)
head(d)
## id name value
## 1 1 a 10
## 2 2 b 12
## 3 3 c 14
## 4 4 d 16
## 5 5 e 18
## 6 6 f 20
Sometimes values are delimited by other characters. In that case you can
try read.table
. read.table
has an argument sep
(separator)
to indicate the field delimiter. Here we do not use that, because the
default (whitespace, tab) works fine.
d <- read.table("test.dat", stringsAsFactors=FALSE)
head(d)
## V1 V2 V3
## 1 id name value
## 2 1 a 10
## 3 2 b 12
## 4 3 c 14
## 5 4 d 16
## 6 5 e 18
Note that read.table
did not automatically recognize the header row
with the variable names. Instead, the variable names have become the
first records, and new dummy variable names V1
, V2
and V3
were assigned. To fix that, we use the header
argument:
d <- read.table("test.dat", header=TRUE, stringsAsFactors=FALSE)
head(d)
## id name value
## 1 1 a 10
## 2 2 b 12
## 3 3 c 14
## 4 4 d 16
## 5 5 e 18
## 6 6 f 20
read.table
can also read csv files; you just need to tell it that
the field delimiter is a comma.
d <- read.table("test.dat", sep=",", stringsAsFactors=FALSE)
In the examples above, I used stringsAsFactors=FALSE
. This is not
required, but it is helpful. Otherwise, all character variables are
converted to factors, which in many cases is a nuisance. If
read.table
fails, there are other options, such as readLines
.
d <- readLines("test.csv")
class(d)
## [1] "character"
d
## [1] "\"id\",\"name\",\"value\"" "1,\"a\",10"
## [3] "2,\"b\",12" "3,\"c\",14"
## [5] "4,\"d\",16" "5,\"e\",18"
## [7] "6,\"f\",20" "7,\"g\",22"
## [9] "8,\"h\",24" "9,\"i\",26"
## [11] "10,\"j\",28"
But this makes it more difficult to process the data. You may need to do something like:
x <- sapply(d, function(i){ unlist(strsplit(i, ",")) }, USE.NAMES=FALSE)
t(x)
## [,1] [,2] [,3]
## [1,] "\"id\"" "\"name\"" "\"value\""
## [2,] "1" "\"a\"" "10"
## [3,] "2" "\"b\"" "12"
## [4,] "3" "\"c\"" "14"
## [5,] "4" "\"d\"" "16"
## [6,] "5" "\"e\"" "18"
## [7,] "6" "\"f\"" "20"
## [8,] "7" "\"g\"" "22"
## [9,] "8" "\"h\"" "24"
## [10,] "9" "\"i\"" "26"
## [11,] "10" "\"j\"" "28"
And then some. (sapply
is explained later on).
For some files, readLines is the way to go. For example, you may want to read ‘html’ files if you are scraping a website. We can read this very page:
webpage <- readLines("https://rspatial.org/intr/6-files.html", warn=FALSE)
#show some lines:
webpage[293:295]
## [1] "<p>For some files, readLines is the way to go. For example, you may want to"
## [2] "read ‘html’ files if you are scraping a website. We can read this very"
## [3] "page:</p>"
Another relevant function in this context is list.files
. That shows
you which files are present in your working directory (or another path
of choice) and perhaps all other subdirectories (when using the argument
recursive=TRUE
). By doing something like
ff <- list.files(pattern = 'csv$')
you can get a vector of all files
with names that end with “csv” (that is what the $
sign is for).
Before writing files (e.g. with write.csv
, write.table
, or
writeLines
), you may want to use dir.exists
to assure that the
path you are using exists and perhaps use dir.create
if it does not.
Filenames need to be the full file name, including the path, unless
you are reading from or writing to the working directory. You can set
the working directory with setwd
. For example:
setwd("C:/projects/research/)
.
Excel files¶
It it is possible to directly read Excel files with the readxl
package. You may first need to install the package with
install.packages("readxl")
The package comes with example files
f <- readxl::readxl_example("datasets.xlsx")
basename(f)
## [1] "datasets.xlsx"
And we can read that file like this
x <- readxl::read_excel(f)
head(x)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
This returns a tibble
which is similar to the data.frame
. We can
turn it into a data.frame
with
x <- as.data.frame(x)
class(x)
## [1] "data.frame"
head(x)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
See ?read_excel
for options such as selecting a specific “sheet”