readr
package functionsxlsx
Packageopenxlsx
Packagereadxl
Packagereadr
PackageAll statistical work begins with data, and most data are stuck inside files and databases. Data are arriving from multiple sources at an alarming rate and analysts and organizations are seeking ways to leverage these new sources of information. Consequently, analysts need to understand how to get data from these sources. Module 2 will cover the process of importing data, scraping data from web, and exporting data. First, we will cover the basics of importing tabular and spreadsheet data (i.e., .txt, .xls, .csv files). Then, we will cover how to acquire data sets from other statistical software (i.e., Stata, SPSS, or SAS) and databases. As the modern data analysis techniques often include scraping data files stored online, we will also cover the fundamentals of web scraping using R. Lastly, the equally important process of getting data out of R, in other words, exporting data will be covered.
The learning objectives of this module are as follows:
The first step in any data preprocessing task is to
“GET” the data. Data can come from many resources but
two of the most common formats of the data sources include text and
Excel files. In addition to text and Excel files, there are other ways
that data can be stored and exchanged. Commercial statistical software
such as SPSS, SAS, Stata, and Minitab often have the option to store
data in a specific format for that software. In addition, analysts
commonly use databases to store large quantities of data. R has good
support to work with these additional options. In this section, we will
cover how to import data into R by reading data from text files, Excel
spreadsheets, commercial statistical software data files and databases.
Moreover, we will cover how to load data from saved R object files for
holding or transferring data that has been processed in R. In addition
to the commonly used base R functions to perform data importing, we will
also cover functions from the popular readr
,
readxl
and foreign
packages.
Text files are a popular way to hold and exchange tabular data as
almost any data application supports exporting data to the CSV (or other
text file) formats. Text file formats use delimiters to separate the
different elements in a line, and each line of data is in its own line
in the text file. Therefore, importing different kinds of text files can
follow a consistent process once you have identified the
delimiter.
There are two main groups of functions that we can use to read in
text files:
Base R functions: The Base R functions are the
built-in functions that are already available when you download R and
RStudio. Therefore, in order to use Base R functions, you do not need to
install or load any packages before using them.
readr
package functions: Compared
to the equivalent base functions, readr
functions are
around 10× faster. In order to use readr
package functions,
you need to install and load the readr
package using the
following commands:
#install.packages("readr")
library(readr)
read.table()
is a multi-purpose function in base R for
importing data. The functions read.csv()
and
read.delim()
are special cases of read.table()
in which the defaults have been adjusted for efficiency. To illustrate
these functions let’s work with a CSV (.csv comma separated values) file
called iris.csv which is in our data repository. Before running any
command note that we need to save this data set into our working
directory, or we need to explicitly define the location of this data
set.
In the first example, let’s assume that we have already downloaded
iris.csv data and saved it in our working directory. Then, the following
command will read iris.csv data and store it in the iris1
object in R as a data frame:
# The following command assumes that the iris.csv file is in the working directory
iris1 <- read.csv("iris.csv")
Note that the iris1 object has appeared in your Environment pane
(probably located on the top-right of your RStudio window). If you click
the arrow next to it, it will expand to show you the variables it
contains, and clicking on it will open it to view, the same as using the
View()
function:
View(iris1)
Now you can also observe the initial few rows of the
iris1
object using head()
function as
follows:
head(iris1)
## X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1 5.1 3.5 1.4 0.2 setosa
## 2 2 4.9 3.0 1.4 0.2 setosa
## 3 3 4.7 3.2 1.3 0.2 setosa
## 4 4 4.6 3.1 1.5 0.2 setosa
## 5 5 5.0 3.6 1.4 0.2 setosa
## 6 6 5.4 3.9 1.7 0.4 setosa
If you wish to observe more rows, you can use the n
argument within the head()
function:
head(iris1, n = 15)
## X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1 5.1 3.5 1.4 0.2 setosa
## 2 2 4.9 3.0 1.4 0.2 setosa
## 3 3 4.7 3.2 1.3 0.2 setosa
## 4 4 4.6 3.1 1.5 0.2 setosa
## 5 5 5.0 3.6 1.4 0.2 setosa
## 6 6 5.4 3.9 1.7 0.4 setosa
## 7 7 4.6 3.4 1.4 0.3 setosa
## 8 8 5.0 3.4 1.5 0.2 setosa
## 9 9 4.4 2.9 1.4 0.2 setosa
## 10 10 4.9 3.1 1.5 0.1 setosa
## 11 11 5.4 3.7 1.5 0.2 setosa
## 12 12 4.8 3.4 1.6 0.2 setosa
## 13 13 4.8 3.0 1.4 0.1 setosa
## 14 14 4.3 3.0 1.1 0.1 setosa
## 15 15 5.8 4.0 1.2 0.2 setosa
You can also observe the final few rows by using the
tail()
function:
tail(iris1)
## X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 145 6.7 3.3 5.7 2.5 virginica
## 146 146 6.7 3.0 5.2 2.3 virginica
## 147 147 6.3 2.5 5.0 1.9 virginica
## 148 148 6.5 3.0 5.2 2.0 virginica
## 149 149 6.2 3.4 5.4 2.3 virginica
## 150 150 5.9 3.0 5.1 1.8 virginica
Note that when we used the View()
function, it opened a
new tab in RStudio, whereas the head()
and
tail()
functions printed the output beneath the syntax.
Using View()
will not print the output in a report and will
therefore not be available for readers of your report. Bear this in mind
when providing outputs in your assignments.*
In the second example, let’s assume that the iris.csv data is located in another file path (i.e. on desktop under data folder) “~/Desktop/data/iris.csv” (“./Desktop/data/iris.csv” for Windows users). Now we need to provide a direct path to our .csv file depending on where it is located:
# The following command assumes that the iris.csv file is in the "~/Desktop/data/iris.csv" path
iris2 <- read.csv(file="~/Desktop/data/iris.csv")
Another suggested option is to set the working directory where the
data is located. To illustrate, assume that the iris.csv is located on
your desktop under data folder and you want to set this directory as the
working directory. The setwd()
function will set the
working directory to the folder “data”:
# Set the working directory to "~/Desktop/data"
setwd("~/Desktop/data")
Remember that you must use the forward slash / or double backslash \\ in R while specifying the file path. The Windows format of single backslash will not work.
After that you can read the iris.csv data using:
iris3 <- read.csv("iris.csv")
Let’s check the header of the iris3
object:
head(iris3)
## X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1 5.1 3.5 1.4 0.2 setosa
## 2 2 4.9 3.0 1.4 0.2 setosa
## 3 3 4.7 3.2 1.3 0.2 setosa
## 4 4 4.6 3.1 1.5 0.2 setosa
## 5 5 5.0 3.6 1.4 0.2 setosa
## 6 6 5.4 3.9 1.7 0.4 setosa
You can also compactly display the structure of an R object using
str()
function.
str(iris3)
## 'data.frame': 150 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
Note that when we assess the structure of the iris data set that we
read in, Species
is a character variable.
However, we may want to read in Species
as a factor
variable rather than a character. We can take care of this by changing
the stringsAsFactors
argument. The default is
stringsAsFactors = False
; however, setting it equal to TRUE
will read in the variable as a factor variable.
iris4 <- read.csv("iris.csv", stringsAsFactors = TRUE)
str(iris4)
## 'data.frame': 150 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Now, you can see that the variable Species
is a factor
variable with three levels.
As previously stated read.csv
is just a wrapper for
read.table
but with adjusted default arguments. Therefore,
we can use read.table
to read in this same data. The two
arguments we need to be aware of are the field separator
(sep
) and the argument indicating whether the file contains
the names of the variables as its first line (header
). In
read.table
the defaults are sep = ""
and
header = FALSE
whereas in read.csv
the
defaults are sep = ","
and header = TRUE
.
Therefore, we can also use the read.table
function to
read the iris.csv data. The extra thing we need to specify is the
separator and the header arguments. As the data is comma separated and
the first line contains the names of the variables, we will use
sep = ","
and header = TRUE
options:
# provides same results as read.csv above
iris5 <- read.table("iris.csv", sep=",", header = TRUE)
Sometimes, it could happen that the file extension is .csv, but the
data is not comma separated; rather, a semicolon (;) or any other symbol
is used as a separator. In that case, we can still use the
read.csv()
function, but in this case we have to specify
the separator.
Let’s look at the example with a semicolon-separated file named iris_semicolon.csv which is located under our data repository. After downloading and saving this data file in our working directory we can use:
iris6 <- read.csv("iris_semicolon.csv", sep=";")
Similarly, if the values are tab separated (.txt file), we can use
read.csv()
with sep= "\t"
. Alternatively, we
can use read.table()
. The following is an example:
iris7 <- read.csv("iris_tab.txt",sep="\t")
# provides same results as read.csv above
iris8 <- read.table("iris_tab.txt",header=TRUE)
Notice that here when we used read.table()
, we had to
specify whether the variable name is present or not, using the argument
header=TRUE
.
readr
package functionsCompared to the equivalent base functions, readr
functions are around 10× faster. This will make a remarkable difference
in reading time if you have a very large data set. They bring
consistency to importing functions, they produce data frames in a
data.table format which are easier to view for large data sets. The
default settings for readr
function removes the hassles of
stringsAsFactors
, and they are more flexible in column
specification.
read_csv()
function is equivalent to base R ’s
read.csv()
function (note the distinction between these two
function names!). The main difference between read_csv()
and base R ’s read.csv()
functions is:
read_csv()
maintains the full variable name whereas,
read.csv
eliminates any spaces in variable names and fills
it with ‘.’
Prior to R 4.0.0
, read.csv()
set
stringsAsFactors = TRUE
by default, whereas
read_csv()
sets stringsAsFactors = FALSE
by
default. Since R 4.0.0
, both functions now set
stringsAsFactors = FALSE
by default so there is no longer
any difference in this regard, however if you are working with an older
version of R, read.csv()
will automatically convert
character strings to factors when reading csv files.
Let’s read the iris.csv
file using read_csv
function. Note that the readr
package needs to be installed
and loaded before using this function.
#install.packages("readr")
library(readr)
iris9 <- read_csv("iris.csv")
When we use read_csv
function, “Parsed with column
specification” information will be reported. Note that this is not a
warning, and nothing is wrong with your code. You can use this
information to check the variable types in your data set. If you want to
adjust in the variable types, you can use additional arguments inside
this function. For more information on read_csv
function
and its arguments type help(read_csv)
.
There is also another function from readr
package called
read_tsv()
. This function reads data sets with tab
separated values.
iris9_ <- read_tsv("iris_tab.txt")
spec(iris9_)
The good news is RStudio has the built in “Import Dataset” dialog box on the upper-right “Environment” pane. You can also use this dialog box to import a wide range of file types including csv, Excel, SPSS, SAS and Stata data files. The process of importing a csv data set into RStudio is explained briefly in R Bootcamp notes (taken from Dr. James Baglin’s).
More information on readr
package can be found here: https://cran.r-project.org/web/packages/readr/readr.pdf
Excel is the most commonly used spreadsheet software. Therefore, it’s
important to be able to efficiently import and export data from Excel.
Often, users prefer to export the Excel data file as a .csv file and
then import into R using read.csv
or read_csv
.
However, this is not an efficient way to import Excel files. In this
section, you will learn how to import data directly from Excel using
different packages, the xlsx
, openxlsx
and
readxl
packages.
xlsx
PackageIf the dataset is stored in the .xls or .xlsx format, we have to use
certain R packages to import those files; one of the packages is
xlsx
which has the function read.xlsx()
to be
used. However, Java and macOS (Version ≥ 10.12) do not play well due to
some legacy issues. It has an adverse impact on the rJava package where
a lot of R packages, including the xlsx
package depends on.
Unfortunately, for macOS users, installing xlsx
package can
be problematic as it requires rJava
package and the
rJava
package installation may fail. In order to avoid this
issue, we can use alternative packages.
openxlsx
PackageThe openxlsx
package, developed by Philipp Schauberger,
Alexander Walker and Luca Braglia, can be used to import data from an
Excel file into a data.frame
. This package does not rely on
Java (Java errors can occasionally be a problem for some users wishing
to use some other packages). It is capable of reading dates and times,
characters as characters (rather than factors), and can drop blank rows.
It also provides the option for formatting outputs, allowing R data
frames to be exported to attractive Excel workbooks for stakeholders
such as management and clients, who may not be comfortable using R.
#install.packages("openxlsx")
library(openxlsx)
# read in xlsx worksheet via openxlsx package using sheet name
iris10<- read.xlsx("iris.xlsx", sheet = "iris")
# read in xlsx worksheet starting from the third row
iris11<- read.xlsx("iris.xlsx", sheet = "iris", startRow = 3)
readxl
Packagereadxl
was developed by Hadley Wickham and the RStudio
team who also developed the readr
package. This package
works with both .xls and .xlsx formats. Unlike xlsx
package, the readxl
package has no external dependencies
(like Java or Perl), so you can use it to read Excel data on any
platform. Moreover, readxl
has the ability to load dates
and times, it automatically drops blank columns, reads in character
variables as characters, and returns outputs as data.table format which
is more convenient for viewing large data sets.
To read in Excel data with readxl
you can use the
read_excel()
function. Here are some examples:
#install.packages("readxl")
library(readxl)
# read in xlsx worksheet using a sheet index or name
iris13<- read_excel("iris.xlsx", sheet = "iris")
# read in xlsx worksheet and change variable names by skipping the first row
# and using col_names to set the new names
iris14<- read_excel("iris.xlsx", sheet = "iris", skip = 1, col_names = paste ("Var", 1:6))
More information on read_excel
function and
readxl
package can be found here: https://cran.r-project.org/web/packages/readxl/readxl.pdf.
The foreign
package provides functions that help you
read data files from other statistical software such as SPSS, SAS,
Stata, and others into R. To import an SPSS data file (.sav) into R, we
need to call the foreign
library and then use the
read.spss()
function. Similarly, if we want to import a
STATA data file, the corresponding function will be
read.dta()
. Here is an example of importing an SPSS data
file:
#install.packages("foreign")
library(foreign)
# read in spss data file and store it as data frame
iris_spss <- read.spss("iris.sav", to.data.frame = TRUE)
Note that we set the to.data.frame = TRUE
option in
order to have a data frame format, otherwise, the defaults
(to.data.frame = FALSE
) will read in the data as a
list.
More information on foreign
package can be found here:
https://cran.r-project.org/web/packages/foreign/foreign.pdf
Remember that you can also use the “Import Dataset”
dialog box on the upper-right “Environment” pane to
import SPSS, SAS and Stata data files instead of using the
foreign
package.
A data set can be stored in any format whereas large-scale data sets are generally stored in database software. Commonly, large organizations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.
One of the best approaches for working with data from a database is
to export the data to a text file and then import the text file into R.
According to Adler (2010), importing data
into R at a much faster rate from text files than you can from database
connections, especially when dealing with very large data sets (1 GB or
more). This approach is considered to be the best approach if you plan
to import a large amount of data once and then analyse. However, if you
need to produce regular reports or to repeat an analysis many times,
then it might be better to import data into R directly through a
database connection.
There are some packages in order to connect directly to a database
from R. The packages you need to install would depend on the database(s)
to which you want to connect and the connection method you want to use.
There are two sets of database interfaces available in R:
RODBC
: The RODBC
package allows R to
fetch data from Open DataBase Connectivity (ODBC) connections. ODBC
provides a standard interface for different programs to connect to
databases. Before using RODBC, you need to i) install the RODBC package
in R, ii) install the ODBC drivers for your platform, iii) configure an
ODBC connection to your database. Adler
(2010) provides a comprehensive list of where to find ODBC
drivers for different databases and operating systems.
DBI
: The DBI
package allows R to
connect to databases using native database drivers or JDBC drivers. This
package provides a common database abstraction for R software. You must
install additional packages to use the native drivers for each
database(s).
The process of creating a connection is huge and beyond the scope of
this course. Here, I will provide a list of additional resources to
learn about data importing from these specific databases:
MySQL: https://cran.r-project.org/web/packages/RMySQL/index.html
Oracle: https://cran.r-project.org/web/packages/ROracle/index.html
PostgreSQL: https://cran.r-project.org/web/packages/RPostgreSQL/index.html
SQLite: https://cran.r-project.org/web/packages/RSQLite/index.html
Open Database Connectivity databases: https://cran.rstudio.com/web/packages/ RODBC
Also, R data import/export manual https://cran.r-project.org/doc/manuals/R-data.html is a comprehensive source for configuring database connections and importing data from databases.
As a result of rapid growth of the World Wide Web, vast amount of
information is now being stored online, both in structured and
unstructured forms. Collecting data from the web is not an easy process
as there are many technologies used to distribute web content (i.e.,
HTML, XML, JSON). Therefore, dealing with more advanced web scraping
requires familiarity in accessing data stored in these technologies via
R.
In this section, I will provide an introduction to some of the fundamental tools required to perform basic web scraping. This includes importing spreadsheet data files stored online and scraping HTML table data. In order to advance your knowledge in web scraping, I highly recommend getting copies of “XML and Web Technologies for Data Sciences with R” (by Deborah and Ducan (2014)) and “Automated Data Collection with R” (by Munzert et al. (2014)).
The most basic form of getting data from online is to import tabular
(i.e. .txt , .csv) or Excel files that are being hosted online.
Importing tabular data is especially common for the many types of
government data available online.
To illustrate we will use “Credit Licensee Dataset” data which is
available online at https://data.gov.au/data/dataset/fa0b0d71-b8b8-4af8-bc59-0b000ce0d5e4/resource/35953a01-a9a8-4609-8566-c9fa7de465d3/download/credit_lic_202302.csv.
This .csv file covers monthly punctuality and reliability data of major
domestic and regional airlines operating between Australian
airports.
We can use read.csv
or read.table
functions
to read online data depending upon the format of the data file. In fact,
reading online .csv or .txt file is just like reading tabular data. The
only difference is, we need to provide the URL of the data instead of
the file name as follows:
# the url for the online csv file
url <- "https://data.gov.au/data/dataset/fa0b0d71-b8b8-4af8-bc59-0b000ce0d5e4/resource/35953a01-a9a8-4609-8566-c9fa7de465d3/download/credit_lic_202302.csv"
Next, as the online data is a .csv file, we can read this data file
using read.csv
function.
# use read.csv to import
CreditL_data <- read.csv(url)
# display first six rows and two variables in the data
CreditL_data[1:6,1:2]
## REGISTER_NAME CRED_LIC_NUM
## 1 Credit Licence 219612
## 2 Credit Licence 222213
## 3 Credit Licence 222640
## 4 Credit Licence 222835
## 5 Credit Licence 223246
## 6 Credit Licence 224543
Importing Excel spreadsheets hosted online can be performed just as
easily. Recall that there is no base R function for importing Excel
data; however, several packages exist to import .xls and .xlsx files.
One package that works smoothly with pulling Excel data from URLs is
gdata
. With gdata
we can use
read.xls()
to import Excel files hosted online.
To illustrate, we will use income data set from the Australian Bureau of Statistics (ABS) website available at https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/personal-income-australia/2014-15-2018-19/6524055002_DO001.xlsx.
First, we need to install and load the gdata
package,
then we can read the online Excel file using read.xlsx
function.
# first install and load the gdata package
#install.packages("gdata")
library(gdata)
# the url for the online Excel file
income.url <- "https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/personal-income-australia/2014-15-2018-19/6524055002_DO001.xlsx"
# use read.xls to import
income <- read.xlsx(income.url)
# display the first six rows in the data
head(income)
##
## 1 Personal Income in Australia Table 1. Total Income
## 2 Released at 11:30 am (Canberra time) 17/12/2021
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## Australian.Bureau.of.Statistics
## 1 <NA>
## 2 <NA>
## 3 Contents
## 4 Tables
## 5 1.1000000000000001
## 6 1.2
## X3
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 Total Income by Greater Capital City Statistical Area (2014-15 to 2018-19)
## 6 Total Income by Statistical Area Level 4 (2014-15 to 2018-19)
By default, the read.xlsx()
will read the data from
first sheet (first workbook) and first line in the xlsx file. As you
have seen in the head(income)
output given above, the first
sheet includes the dataset information, not the actual data. Therefore,
it is important to check the original xlsx file (download and open it
using Excel) and note the location (i.e. Sheet name and the line number
where the data starts) of the data that you want to import.
For this example, the actual dataset is in the second sheet
(i.e. Table 1.1) and starts from the fifth line (the first four lines
are for the title of the data). We can easily specify the sheet name and
skip the first four lines by using sheet
and
startRow
arguments as follows:
# this time use sheet and startRow arguments to locate the data properly
# Use sheet = 2 and startRow= 5 as the data is in second sheet and starts from fifth line (skip first four lines)
income1 <- read.xlsx(income.url, sheet = 2)
income2 <- read.xlsx(income.url, sheet = 2, startRow = 5)
# display the first six rows in the data sets
head(income1)
## Australian.Bureau.of.Statistics
## 1 Personal Income in Australia Table 1. Total Income
## 2 Released at 11:30 am (Canberra time) 17/12/2021
## 3 Table 1.1
## 4 <NA>
## 5 GCCSA
## 6 Australia
## X2
## 1 <NA>
## 2 <NA>
## 3 Total Income by Greater Capital City Statistical Area (2014-15 to 2018-19)
## 4 <NA>
## 5 GCCSA NAME
## 6 <NA>
## X3 X4 X5 X6 X7
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 Earners (persons) <NA> <NA> <NA> <NA>
## 5 2014-15 2015-16 2016-17 2017-18 2018-19
## 6 13102895 13358252 13678024 14069082 14425037
## X8 X9 X10 X11 X12 X13
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Median age of earners (years) <NA> <NA> <NA> <NA> Sum ($)
## 5 2014-15 2015-16 2016-17 2017-18 2018-19 2014-15
## 6 42 42 42 42 42 799748049060
## X14 X15 X16 X17 X18 X19
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> Median ($) <NA>
## 5 2015-16 2016-17 2017-18 2018-19 2014-15 2015-16
## 6 827875051087 856159475083 903888674402 951373381486 46854 47692
## X20 X21 X22 X23 X24 X25 X26 X27
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> Mean ($) <NA> <NA> <NA> <NA>
## 5 2016-17 2017-18 2018-19 2014-15 2015-16 2016-17 2017-18 2018-19
## 6 48360 49805 51389 61036 61975 62594 64246 65953
head(income2)
## X1 X2 Earners.(persons) X4 X5 X6
## 1 GCCSA GCCSA NAME 2014-15 2015-16 2016-17 2017-18
## 2 Australia <NA> 13102895 13358252 13678024 14069082
## 3 New South Wales <NA> 4091347 4191542 4344997 4466941
## 4 1GSYD Greater Sydney 2712735 2788215 2909201 2996160
## 5 1RNSW Rest of NSW 1376020 1388501 1435100 1470472
## 6 Victoria <NA> 3260682 3352403 3474938 3603810
## X7 Median.age.of.earners.(years) X9 X10 X11 X12
## 1 2018-19 2014-15 2015-16 2016-17 2017-18 2018-19
## 2 14425037 42 42 42 42 42
## 3 4569650 42 42 42 42 42
## 4 3072150 41 41 40 40 40
## 5 1497125 45 45 45 45 45
## 6 3714224 42 42 42 41 41
## Sum.($) X14 X15 X16 X17 Median.($)
## 1 2014-15 2015-16 2016-17 2017-18 2018-19 2014-15
## 2 799748049060 827875051087 856159475083 903888674402 951373381486 46854
## 3 256928457330 270324902216 283275046025 300176988652 314465961966 46879
## 4 183938528088 194328537211 204516965671 217598375558 228536177866 49571
## 5 72836696235 75127561974 78728246008 82559714374 85917025853 42445
## 6 192443671450 202318292276 213668870028 228631116478 242783096575 45930
## X19 X20 X21 X22 Mean.($) X24 X25 X26 X27
## 1 2015-16 2016-17 2017-18 2018-19 2014-15 2015-16 2016-17 2017-18 2018-19
## 2 47692 48360 49805 51389 61036 61975 62594 64246 65953
## 3 48085 48700 50153 51818 62798 64493 65196 67200 68816
## 4 50724 51191 52665 54406 67806 69696 70300 72626 74390
## 5 43557 44339 45798 47312 52933 54107 54859 56145 57388
## 6 46984 47709 49266 51027 59019 60350 61489 63442 65366
Sometimes, web pages contain several HTML tables and we may want to
read the data from that HTML table. The simplest approach to scraping
HTML table data directly into R is by using the rvest
package. Recall that. HTML tables are contained within
<table>
tags; therefore, to extract the tables, we
need to use the html_nodes()
function to select the
<table>
nodes.
To illustrate, I will use the example from the help page for
rvest
, which loads all tables from the U.S. Social Security
webpage: https://www.ssa.gov/oact/babynames/numberUSbirths.html
First, we will install and load the rvest
package:
# first install and load the rvest package
#install.packages("rvest")
library(rvest)
We will use read_html
to locate the URL of the HTML
table. When we use read_html
, all table nodes that exist on
the webpage will be captured.
births <- read_html("https://www.ssa.gov/oact/babynames/numberUSbirths.html")
In this example, using the length
function we can see
that the html_nodes
captures 1 HTML table.
length(html_nodes(births, "table"))
## [1] 1
In this example the webpage included only one table and this first
table on the webpage is the place where our data is located, thus, we
will select the first element of the html_nodes
.
# select the second element of the html_nodes
births_data<- html_table(html_nodes(births, "table")[[1]])
# view the header of the births_data
head(births_data)
## # A tibble: 6 × 4
## `Year of birth` Male Female Total
## <int> <chr> <chr> <chr>
## 1 1880 118,399 97,605 216,004
## 2 1881 108,276 98,855 207,131
## 3 1882 122,031 115,694 237,725
## 4 1883 112,475 120,060 232,535
## 5 1884 122,738 137,585 260,323
## 6 1885 115,945 141,947 257,892
However, in some cases the webpage can include data from a few additional tables used to format other parts of the page (i.e. table of contents, table of figures, advertisements, etc.). If this is the case, one needs to scan the html source of the webpage and select the table(s) that the data are located.
Exporting data out of R is equally important as importing data into
R. In this section, we will cover how to export data to text files,
Excel files and save to R data objects. In addition to the commonly used
base R functions to export data, we will also cover functions from the
popular readr
and xlsx
packages.
Similar to the previous examples provided in the importing text files
section, in this section I will introduce the base R and
readr
package functions to export data to text files.
write.table()
is the multi-purpose function in base R
for exporting data. The function write.csv()
is a special
case of write.table()
in which the defaults have been
adjusted for efficiency. To illustrate, let’s create a data frame and
export it to a CSV file in our working directory.
# create a data frame and assign it to an object named df
df <- data.frame (cost = c(10, 25, 40),
color = c ("blue", "red", "green"),
suv = c (TRUE, TRUE, FALSE),
row.names = c ("car1", "car2", "car3"))
df
## cost color suv
## car1 10 blue TRUE
## car2 25 red TRUE
## car3 40 green FALSE
To export df
to a CSV file we will use
write.csv()
.
# write to a csv file in our working directory
write.csv(df, file = "cars_csv")
This function has additional arguments which will allow you to exclude row/column names, specify what to use for missing values, add or remove quotations around character strings, etc.
# write to a csv file without row names
write.csv(df, file = "cars_csv", row.names = FALSE)
In addition to CSV files, we can also write to other text formats
using write.table()
by specifying the sep
argument.
# write to a tab delimited text file
write.table(df, file = "cars_txt", sep="\t")
readr
PackageThe readr
package functions, write_csv
and
write_delim
are twice as fast as base R functions and they
are very similar in usage.
Let’s use the same example to illustrate the usage of
`write_csv
and write_delim
# load the library
library(readr)
# write to a csv file in the working directory
write_csv(df, file = "cars_csv2")
# write to a csv file without column names
write_csv(df, file = "export_csv2", col_names = FALSE)
# write to a txt file in the working directory
write_delim(df, file = "export_txt2")
The xlsx
and openxlsx
packages can be used
for exporting data to Excel. However, the readxl
package
which I demonstrated in the importing data section does not have a
function to export to Excel, therefore I will skip it here.
We will use the write.xlsx()
function in the
openxlsx
package to export the data to a xlsx file.
# load the library
library(openxlsx)
# write to a .xlsx file in the working directory
write.xlsx(df, file = "cars.xlsx")
For the full functionality of the openxlsx
package
please refer to the package
documentation.
Sometimes we may need to save data or other R objects outside of the workspace or may want to store, share, or transfer between computers. Basically, we can use .rda or .RData file types when we want to save several, or all, objects and functions that exist in the global environment. On the other hand, if we only want to save a single R object such as a data frame, function, or statistical model result, it is best to use the .rds file type. Still, we can use .rda or .RData to save a single object but the benefit of .rds is it only saves a representation of the object and not the name whereas .rda and .RData save both the object and its name. As a result, with .rds the saved object can be loaded into a named object within R that is different from the name it had when originally saved.
To illustrate let’s create two objects named x and y and save them to
a .RData file using save()
function.
# generate random numbers from uniform and normal distribution and assign them to objects named x and y, respectively.
x <- runif(10)
y <- rnorm(10, 0, 1)
# Save both objects in .RData format in the working directory
save(x, y, file = "xy.RData")
Also, the save.image()
function will save your all
current workspace as .RData.
# save all objects in the global environment
save.image()
The following examples will illustrate how a single object will be
saved using saveRDS()
# save a single object to file
saveRDS(x, "x.rds")
# restore it under a different name
x2 <- readRDS("x.rds")
# check if x and x2 are identical
identical(x, x2)
## [1] TRUE
For further details on differences between rds and RData formats please read the article (thanks to Shane Smith for finding this) in http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata.
R data import/export manual https://cran.r-project.org/doc/manuals/R-data.html (R Team (2000)) is a comprehensive source for all types of data importing and exporting tasks in R.