+ - 0:00:00
Notes for current slide
Notes for next slide

Module 2 Demonstration

Get: Importing, Scraping and Exporting Data with R

1 / 26

Module 2: Outline

  • Reading Data from Text Files
    • Base R functions
    • readr package
  • Reading Data from Excel files
    • xlsx package (or openxlsx alternatively)
    • readxl package
  • Importing Data from statistical software
    • foreign package
  • Reading from Databases
    • export and then import
    • connect directly to a database from R
  • Scraping Data from Web
    • Importing tabular and Excel files stored online (read.csv and gdata)
    • Scraping HTML Table Data (using rvest)
  • Exporting Data to text files
    • Base R functions
    • readr package
  • Exporting data to Excel files
    • xlsx package
    • export to a csv file then save as xlsx
  • Saving Data as an R object File
2 / 26

Preliminaries: Setting The Working Directory

  • R is always pointed at a directory on your computer. You can find out which directory by running the getwd (get working directory) function.
getwd()
  • The location of working directory will be the place that R stores everything.
3 / 26

Preliminaries: Changing The Working Directory

  • There are a number of ways to change the current working directory:
4 / 26

Preliminaries: Changing The Working Directory

  • There are a number of ways to change the current working directory:

1. Use the setwd() function and specify the path to the desired folder.

4 / 26

Preliminaries: Changing The Working Directory

  • There are a number of ways to change the current working directory:

1. Use the setwd() function and specify the path to the desired folder.

  • Create a folder on desktop and name it "Week2" and set this as our working directory.
setwd("~/Desktop/Week2") #for Mac users
setwd("./Desktop/Week2") #for Windows users
4 / 26

Preliminaries: Changing The Working Directory

  • There are a number of ways to change the current working directory:

1. Use the setwd() function and specify the path to the desired folder.

  • Create a folder on desktop and name it "Week2" and set this as our working directory.
setwd("~/Desktop/Week2") #for Mac users
setwd("./Desktop/Week2") #for Windows users

To check:

getwd()
4 / 26

Preliminaries: Changing The Working Directory

  • There are a number of ways to change the current working directory:

1. Use the setwd() function and specify the path to the desired folder.

  • Create a folder on desktop and name it "Week2" and set this as our working directory.
setwd("~/Desktop/Week2") #for Mac users
setwd("./Desktop/Week2") #for Windows users

To check:

getwd()
  • Use the forward slash / in R while specifying the file path.

  • The Windows format of single backslash will not work.

4 / 26

Preliminaries: Changing The Working Directory Cont.

5 / 26

Preliminaries: Changing The Working Directory Cont.

2. Use Session or Tools depending on your system and Change Working Directory. menu (Session -> Set Working Directory on a Mac).

5 / 26

Preliminaries: Changing The Working Directory Cont.

2. Use Session or Tools depending on your system and Change Working Directory. menu (Session -> Set Working Directory on a Mac).

3. From the Files pane, use the More -> Set As Working Directory menu.

5 / 26

Reading Data from Text Files

  • Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.
6 / 26

Reading Data from Text Files

  • Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.

  • Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.

6 / 26

Reading Data from Text Files

  • Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.

  • Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.

  • Base R functions are the built-in functions that are already available when you download R and RStudio.
6 / 26

Reading Data from Text Files

  • Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.

  • Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.

  • Base R functions are the built-in functions that are already available when you download R and RStudio.

  • readr package functions: The readr functions are around 10× faster. You need to install and load the readr package using the following commands:

6 / 26

Reading Data from Text Files

  • Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the CSV (or other text file) formats.

  • Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file.

  • Base R functions are the built-in functions that are already available when you download R and RStudio.

  • readr package functions: The readr functions are around 10× faster. You need to install and load the readr package using the following commands:

install.packages("readr")
library(readr)
6 / 26

Reading Data from Text Files: Base R functions

# iris.csv file is located in the working directory
iris <- read.csv( "iris.csv" )
7 / 26

Reading Data from Text Files: Base R functions

# iris.csv file is located in the working directory
iris <- read.csv( "iris.csv" )
  • If the data file is under a different folder you need to specify the path to the data file explicitly:
#iris.csv file is located in the "~/Desktop/data/iris.csv" path
iris <- read.csv( file="~/Desktop/data/iris.csv" )
7 / 26

Reading Data from Text Files: readr package functions

  • readr functions are around 10× faster. This will make a remarkable difference in reading time if you have a very large data set.

  • read_csv() function is equivalent to base R ’s read.csv() function (note the distinction between these two function names!)

install.packages("readr")
library(readr)
iris <- read_csv("iris.csv")
8 / 26

Your turn! Activity

  • Import the city.csv using BaseR's read.csv and readr's read_csv.

  • Did you spot any differences between imported datasets? Hint: Have a look at the variable names in both datasets.

  • What are the differences?

  • Press P to reveal the answer!

9 / 26
city1 <- read.csv( file="../data/city.csv" )
head(city1)
## X X2016.Rank X2015.Rank X2014.Rank X2012.Rank X2010.Rank City
## 1 1 1 1 1 1 1 Vienna
## 2 2 2 2 2 2 2 Zürich
## 3 3 3 3 3 3 4 Auckland
## 4 4 4 4 4 4 7 Munich
## 5 5 5 5 5 5 4 Vancouver
## 6 6 6 6 6 6 6 Düsseldorf
## Country Index.2010
## 1 Austria 108.6
## 2 Switzerland 108.0
## 3 New Zealand 107.4
## 4 Germany 107.0
## 5 Canada 107.4
## 6 Germany 107.2
library(readr)
city2 <- read_csv( file="../data/city.csv" )
## New names:
## Rows: 51 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): City, Country dbl (7): ...1, 2016 Rank, 2015 Rank, 2014 Rank, 2012 Rank,
## 2010 Rank, Index ...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(city2)
## # A tibble: 6 × 9
## ...1 `2016 Rank` `2015 Rank` `2014 Rank` `2012 Rank` `2010 Rank` City
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 1 1 1 1 1 Vienna
## 2 2 2 2 2 2 2 Zürich
## 3 3 3 3 3 3 4 Auckland
## 4 4 4 4 4 4 7 Munich
## 5 5 5 5 5 5 4 Vancouver
## 6 6 6 6 6 6 6 Düsseldorf
## # ℹ 2 more variables: Country <chr>, `Index 2010` <dbl>
  • When read_csv() is used, parsed with column specification is given in the screen. It is useful to see the specs of the variables that are read in R.

  • read_csv() maintains the full variable name whereas, read.csv eliminates any spaces in variable names and fills it with ‘.’

  • read.csv puts an X prefix when the variable names starts with numeric characters and when variable names are missing.

  • read_csv() automatically reads character variables as characters where else BaseR alternative reads them as factors (refer to Module 3).

Reading Data from Text Files: readr package functions

  • RStudio has the built in "Import Dataset" dialog box on the upper-right Environment pane.

  • You can also use this dialog box to import a wide range of file types including csv, Excel, SPSS, SAS and Stata data files.

10 / 26

Reading Data from Excel files: xlsx Package

  • Excel is the most commonly used spreadsheet software. If the dataset is stored in the .xls or .xlsx format, we have to use certain R packages to import those files; one of the packages is xlsx. However, we have:
11 / 26

Reading Data from Excel files: xlsx Package

  • Excel is the most commonly used spreadsheet software. If the dataset is stored in the .xls or .xlsx format, we have to use certain R packages to import those files; one of the packages is xlsx. However, we have:
  • The xlsx package has external dependencies (i.e., rJava).

  • Often installation and loading this package would be problematic.

  • Do not waste time to fix Rjava problem, instead use alternative packages with the same functionality

  • I recommend using readxl or openxlsx packages instead.

11 / 26

Reading Data from Excel files: readxl Package

  • readxl was developed by Hadley Wickham and the RStudio team who also developed the readr package.

  • This package works with both .xls and .xlsx formats.

  • Unlike xlsx package, the readxl package has no external dependencies (like Java or Perl).

install.packages("readxl")
library(readxl)
# read in xlsx worksheet using a sheet index or name
iris<- read_excel("../data/iris.xlsx", sheet = "iris")
12 / 26

Importing Data from statistical software

  • The foreign package provides functions that help you read data files from other statistical software such as SPSS, SAS, Stata, and others into R.

  • Here is an example of importing an SPSS data file called iris.sav:

install.packages("foreign")
library(foreign)
# read in spss data file and store it as data frame
iris_spss <- read.spss("../data/iris.sav", to.data.frame = TRUE)
13 / 26

Reading from Databases

  • Large-scale data sets are generally stored in database software.
14 / 26

Reading from Databases

  • Large-scale data sets are generally stored in database software.

  • Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.

14 / 26

Reading from Databases

  • Large-scale data sets are generally stored in database software.

  • Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.

  • One of the best approaches for working with data from a database is to export the data to a text file and then import the text file into R.

14 / 26

Reading from Databases

  • Large-scale data sets are generally stored in database software.

  • Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R.

  • One of the best approaches for working with data from a database is to export the data to a text file and then import the text file into R.

  • According to Adler (2010), importing data into R at a much faster rate from text files than you can from database connections, especially when dealing with very large data sets (1 GB or more).

14 / 26

Reading from Databases Cont.

  • This approach is considered to be the best approach if you plan to import a large amount of data once and then analyse.
  • However, if you need to produce regular reports or to repeat an analysis many times, then it might be better to import data into R directly through a database connection.
  • Database connection is an advanced topic, for more information refer to the Module 2 notes
15 / 26

Importing Tabular files Stored Online

  • Vast amount of information is now being stored online, both in structured and unstructured forms.

  • The most basic form of getting data from online is to import tabular (i.e. . txt , .csv) files that are being hosted online.

  • Importing tabular data is common for the government data available online like Credit Licensee Dataset

  • Reading online .csv or .txt file is just like reading tabular data. The only difference is, we need to provide the URL of the data instead of the file name as follows:
# the url for the online csv file
url <- "https://data.gov.au/data/dataset/fa0b0d71-b8b8-4af8-bc59-0b000ce0d5e4/resource/35953a01-a9a8-4609-8566-c9fa7de465d3/download/credit_lic_202302.csv"
16 / 26
  • Next, as the online data is a .csv file, we can read this data file using read.csv function.
# use read.csv to import
CreditL_data <- read.csv(url)
# display first six rows and two variables in the data
CreditL_data[1:6,1:2]
## REGISTER_NAME CRED_LIC_NUM
## 1 Credit Licence 219612
## 2 Credit Licence 222213
## 3 Credit Licence 222640
## 4 Credit Licence 222835
## 5 Credit Licence 223246
## 6 Credit Licence 224543
17 / 26

Scraping HTML Table Data

  • Web pages contain several HTML tables and we may want to read the data from that HTML table.
  • The simplest approach to scraping HTML table data directly into R is by using the rvest package.

  • HTML tables are contained within <table> tags; therefore, to extract the tables, we need to use the html_nodes() function to select the <table> nodes.

# first install and load the rvest package
install.packages("rvest")
library(rvest)
18 / 26
  • We will use read_html to locate the URL of the HTML table. When we use read_html, all table nodes that exist on the webpage will be captured.
births <- read_html("https://www.ssa.gov/oact/babynames/numberUSbirths.html")
  • In this example, using the length function we can see that the html_nodes captures only one HTML table.
length(html_nodes(births, "table"))
## [1] 1
  • Note that for this example we had only one table in the website, there can be cases where the website includes additional tables used to format other parts of the page (i.e. table of contents, table of figures, advertisements, etc.).

  • Best way to check this is right click on the website -> View Page Source, then find the table tags in the HTML document.

19 / 26
  • The first table on the webpage is the place where our data is located, thus, we will select the first element of the html_nodes.
# select the second element of the html_nodes
births_data<- html_table(html_nodes(births, "table")[[1]])
# view the header of the births_data
head(births_data)
## # A tibble: 6 × 4
## `Year of birth` Male Female Total
## <int> <chr> <chr> <chr>
## 1 1880 118,399 97,605 216,004
## 2 1881 108,276 98,855 207,131
## 3 1882 122,031 115,694 237,725
## 4 1883 112,475 120,060 232,535
## 5 1884 122,738 137,585 260,323
## 6 1885 115,945 141,947 257,892
20 / 26

Exporting Data to text files : Base R functions

  • Exporting data out of R is equally important as importing data into R.

  • First we look at the base R and readr package functions to export data to text files.

# create a data frame and assign it to an object named df
df <- data.frame (cost = c(10, 25, 40),
color = c ("blue", "red", "green"),
suv = c (TRUE, TRUE, FALSE),
row.names = c ("car1", "car2", "car3"))
  • To export df to a CSV file we will use write.csv().
# write to a csv file in our working directory
write.csv(df, file = "cars_csv")
21 / 26

Exporting Data to text files : readr functions

  • The readr package functions, write_csv and write_delim are twice as fast as base R functions and they are very similar in usage.
# load the library
library(readr)
# write to a csv file in the working directory
write_csv(df, file = "cars_csv2")
22 / 26

Saving Data as an R object File

  • Sometimes we may need to save data or other R objects outside of the workspace or may want to store, share, or transfer between computers.

  • We can use the .rda or .RData file types when we want to save several, or all, objects and functions that exist in the global environment.

  • On the other hand, if we only want to save a single R object such as a data frame, function, or statistical model results, it is best to use the .rds file type.

To illustrate let's create two objects named x and y and save them to a .RData file using save() function.

# generate random numbers from uniform and normal distribution and assign them to objects named x and y, respectively.
x <- runif(10)
y <- rnorm(10, 0, 1)
# Save both objects in .RData format in the working directory
save(x, y, file = "xy.RData")
23 / 26

Saving Data as an R object File Cont.

  • Also, the save.image() function will save your all current workspace as .RData.
# save all objects in the global environment
save.image()
  • The following example will illustrate how a single object will be saved using saveRDS()
# save a single object to file
saveRDS(x, "x.rds")
# restore it under a different name
x2 <- readRDS("x.rds")
  • For further details on differences between rds and RData formats please read this article.
24 / 26

What do you need to know by Week 2

  • Understand how to get data from tabular and spreadsheet files.

  • Understand how to get data from statistical software and databases.

  • Know how to scrape data files stored online.

  • Know how to export to tabular and spreadsheet files.

  • Know how to save R objects.

  • Know how to get further help: Use R data import/export manual https://cran.r-project.org/doc/manuals/R-data.html.

  • Practice!
25 / 26

Worksheet questions:

  • Complete the following worksheet:

Module 2 Worksheet

  • Once completed, feel free to work on your Assessments.


Return to Course Website

26 / 26

Module 2: Outline

  • Reading Data from Text Files
    • Base R functions
    • readr package
  • Reading Data from Excel files
    • xlsx package (or openxlsx alternatively)
    • readxl package
  • Importing Data from statistical software
    • foreign package
  • Reading from Databases
    • export and then import
    • connect directly to a database from R
  • Scraping Data from Web
    • Importing tabular and Excel files stored online (read.csv and gdata)
    • Scraping HTML Table Data (using rvest)
  • Exporting Data to text files
    • Base R functions
    • readr package
  • Exporting data to Excel files
    • xlsx package
    • export to a csv file then save as xlsx
  • Saving Data as an R object File
2 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow