readr
packagexlsx
package (or openxlsx
alternatively)readxl
packageforeign
packageread.csv
and gdata
)rvest
)readr
packagexlsx
package getwd
(get working directory) function.getwd()
1. Use the setwd()
function and specify the path to the desired folder.
1. Use the setwd()
function and specify the path to the desired folder.
setwd("~/Desktop/Week2") #for Mac userssetwd("./Desktop/Week2") #for Windows users
1. Use the setwd()
function and specify the path to the desired folder.
setwd("~/Desktop/Week2") #for Mac userssetwd("./Desktop/Week2") #for Windows users
To check:
getwd()
1. Use the setwd()
function and specify the path to the desired folder.
setwd("~/Desktop/Week2") #for Mac userssetwd("./Desktop/Week2") #for Windows users
To check:
getwd()
Use the forward slash / in R while specifying the file path.
The Windows format of single backslash will not work.
2. Use Session or Tools depending on your system and Change Working Directory. menu (Session -> Set Working Directory on a Mac).
2. Use Session or Tools depending on your system and Change Working Directory. menu (Session -> Set Working Directory on a Mac).
3. From the Files pane, use the More -> Set As Working Directory menu.
Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.
Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.
Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.
Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.
Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.
Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.
Base R functions are the built-in functions that are already available when you download R and RStudio.
readr
package functions: The readr
functions are around 10× faster. You need to install and load the readr
package using the following commands:
Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.
Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.
Base R functions are the built-in functions that are already available when you download R and RStudio.
readr
package functions: The readr
functions are around 10× faster. You need to install and load the readr
package using the following commands:
install.packages("readr")library(readr)
For the demonstration, we will use the "iris.csv" data available in our Data Repository click on here and save iris.csv data in your working directory.
The following command will read iris.csv data and store it in the iris
object in R as a data frame:
# iris.csv file is located in the working directoryiris <- read.csv( "iris.csv" )
For the demonstration, we will use the "iris.csv" data available in our Data Repository click on here and save iris.csv data in your working directory.
The following command will read iris.csv data and store it in the iris
object in R as a data frame:
# iris.csv file is located in the working directoryiris <- read.csv( "iris.csv" )
#iris.csv file is located in the "~/Desktop/data/iris.csv" pathiris <- read.csv( file="~/Desktop/data/iris.csv" )
readr
package functionsreadr
functions are around 10× faster. This will make a remarkable difference in reading time if you have a very large data set.
read_csv()
function is equivalent to base R ’s read.csv()
function (note the distinction between these two function names!)
install.packages("readr")library(readr)
iris <- read_csv("iris.csv")
Import the city.csv using BaseR's read.csv
and readr's read_csv
.
Did you spot any differences between imported datasets? Hint: Have a look at the variable names in both datasets.
What are the differences?
Press P to reveal the answer!
city1 <- read.csv( file="../data/city.csv" )head(city1)
## X X2016.Rank X2015.Rank X2014.Rank X2012.Rank X2010.Rank City## 1 1 1 1 1 1 1 Vienna## 2 2 2 2 2 2 2 Zürich## 3 3 3 3 3 3 4 Auckland## 4 4 4 4 4 4 7 Munich## 5 5 5 5 5 5 4 Vancouver## 6 6 6 6 6 6 6 Düsseldorf## Country Index.2010## 1 Austria 108.6## 2 Switzerland 108.0## 3 New Zealand 107.4## 4 Germany 107.0## 5 Canada 107.4## 6 Germany 107.2
library(readr)city2 <- read_csv( file="../data/city.csv" )
## New names:## Rows: 51 Columns: 9## ── Column specification## ──────────────────────────────────────────────────────── Delimiter: "," chr## (2): City, Country dbl (7): ...1, 2016 Rank, 2015 Rank, 2014 Rank, 2012 Rank,## 2010 Rank, Index ...## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ## Specify the column types or set `show_col_types = FALSE` to quiet this message.## • `` -> `...1`
head(city2)
## # A tibble: 6 × 9## ...1 `2016 Rank` `2015 Rank` `2014 Rank` `2012 Rank` `2010 Rank` City ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 1 1 1 1 1 1 Vienna ## 2 2 2 2 2 2 2 Zürich ## 3 3 3 3 3 3 4 Auckland ## 4 4 4 4 4 4 7 Munich ## 5 5 5 5 5 5 4 Vancouver ## 6 6 6 6 6 6 6 Düsseldorf## # ℹ 2 more variables: Country <chr>, `Index 2010` <dbl>
When read_csv()
is used, parsed with column specification
is given in the screen. It is useful to see the specs of the variables that are read in R.
read_csv()
maintains the full variable name whereas, read.csv
eliminates any spaces in variable names and fills it with ‘.’
read.csv
puts an X
prefix when the variable names starts with numeric characters and when variable names are missing.
read_csv()
automatically reads character variables as characters where else BaseR alternative reads them as factors (refer to Module 3).
readr
package functionsRStudio has the built in "Import Dataset" dialog box on the upper-right Environment pane.
You can also use this dialog box to import a wide range of file types including csv, Excel, SPSS, SAS and Stata data files.
xlsx
Packagexlsx
. However, we have:xlsx
Packagexlsx
. However, we have:The xlsx
package has external dependencies (i.e., rJava).
Often installation and loading this package would be problematic.
Do not waste time to fix Rjava problem, instead use alternative packages with the same functionality
I recommend using readxl
or openxlsx
packages instead.
readxl
Packagereadxl
was developed by Hadley Wickham and the RStudio team who also developed the readr
package.
This package works with both .xls and .xlsx formats.
Unlike xlsx
package, the readxl
package has no external dependencies (like Java or Perl).
install.packages("readxl")library(readxl)
# read in xlsx worksheet using a sheet index or nameiris<- read_excel("../data/iris.xlsx", sheet = "iris")
The foreign
package provides functions that help you read data files from other statistical software such as SPSS, SAS, Stata, and others into R.
Here is an example of importing an SPSS data file called iris.sav:
install.packages("foreign")library(foreign)
# read in spss data file and store it as data frame iris_spss <- read.spss("../data/iris.sav", to.data.frame = TRUE)
Large-scale data sets are generally stored in database software.
Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.
Large-scale data sets are generally stored in database software.
Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.
One of the best approaches for working with data from a database is to export the data to a text file and then import the text file into R.
Large-scale data sets are generally stored in database software.
Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.
One of the best approaches for working with data from a database is to export the data to a text file and then import the text file into R.
According to Adler (2010), importing data into R at a much faster rate from text files than you can from database connections, especially when dealing with very large data sets (1 GB or more).
Vast amount of information is now being stored online, both in structured and unstructured forms.
The most basic form of getting data from online is to import tabular (i.e. . txt , .csv) files that are being hosted online.
Importing tabular data is common for the government data available online like Credit Licensee Dataset
# the url for the online csv fileurl <- "https://data.gov.au/data/dataset/fa0b0d71-b8b8-4af8-bc59-0b000ce0d5e4/resource/35953a01-a9a8-4609-8566-c9fa7de465d3/download/credit_lic_202302.csv"
read.csv
function.# use read.csv to importCreditL_data <- read.csv(url)# display first six rows and two variables in the dataCreditL_data[1:6,1:2]
## REGISTER_NAME CRED_LIC_NUM## 1 Credit Licence 219612## 2 Credit Licence 222213## 3 Credit Licence 222640## 4 Credit Licence 222835## 5 Credit Licence 223246## 6 Credit Licence 224543
The simplest approach to scraping HTML table data directly into R is by using the rvest
package.
HTML tables are contained within <table>
tags; therefore, to extract the tables, we need to use the html_nodes()
function to select the <table>
nodes.
We will use the example from the help page for rvest
, which loads all tables from the U.S. Social Security webpage: https://www.ssa.gov/oact/babynames/numberUSbirths.html
First, we will install and load the rvest
package:
# first install and load the rvest packageinstall.packages("rvest")library(rvest)
read_html
to locate the URL of the HTML table. When we use read_html
, all table nodes that exist on the webpage will be captured. births <- read_html("https://www.ssa.gov/oact/babynames/numberUSbirths.html")
length
function we can see that the html_nodes
captures only one HTML table.length(html_nodes(births, "table"))
## [1] 1
Note that for this example we had only one table in the website, there can be cases where the website includes additional tables used to format other parts of the page (i.e. table of contents, table of figures, advertisements, etc.).
Best way to check this is right click on the website -> View Page Source, then find the table tags in the HTML document.
html_nodes
.# select the second element of the html_nodesbirths_data<- html_table(html_nodes(births, "table")[[1]])# view the header of the births_datahead(births_data)
## # A tibble: 6 × 4## `Year of birth` Male Female Total ## <int> <chr> <chr> <chr> ## 1 1880 118,399 97,605 216,004## 2 1881 108,276 98,855 207,131## 3 1882 122,031 115,694 237,725## 4 1883 112,475 120,060 232,535## 5 1884 122,738 137,585 260,323## 6 1885 115,945 141,947 257,892
Exporting data out of R is equally important as importing data into R.
First we look at the base R and readr
package functions to export data to text files.
# create a data frame and assign it to an object named dfdf <- data.frame (cost = c(10, 25, 40), color = c ("blue", "red", "green"), suv = c (TRUE, TRUE, FALSE), row.names = c ("car1", "car2", "car3"))
df
to a CSV file we will use write.csv()
. # write to a csv file in our working directorywrite.csv(df, file = "cars_csv")
readr
functionsreadr
package functions, write_csv
and write_delim
are twice as fast as base R functions and they are very similar in usage. # load the librarylibrary(readr)# write to a csv file in the working directorywrite_csv(df, file = "cars_csv2")
Sometimes we may need to save data or other R objects outside of the workspace or may want to store, share, or transfer between computers.
We can use the .rda or .RData file types when we want to save several, or all, objects and functions that exist in the global environment.
On the other hand, if we only want to save a single R object such as a data frame, function, or statistical model results, it is best to use the .rds file type.
To illustrate let's create two objects named x and y and save them to a .RData file using save()
function.
# generate random numbers from uniform and normal distribution and assign them to objects named x and y, respectively.x <- runif(10)y <- rnorm(10, 0, 1)# Save both objects in .RData format in the working directorysave(x, y, file = "xy.RData")
save.image()
function will save your all current workspace as .RData.# save all objects in the global environmentsave.image()
saveRDS()
# save a single object to filesaveRDS(x, "x.rds")# restore it under a different namex2 <- readRDS("x.rds")
Understand how to get data from tabular and spreadsheet files.
Understand how to get data from statistical software and databases.
Know how to scrape data files stored online.
Know how to export to tabular and spreadsheet files.
Know how to save R objects.
Know how to get further help: Use R data import/export manual https://cran.r-project.org/doc/manuals/R-data.html.
readr
packagexlsx
package (or openxlsx
alternatively)readxl
packageforeign
packageread.csv
and gdata
)rvest
)readr
packagexlsx
package Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |