class: center, middle, inverse, title-slide .title[ # Module 2 Demonstration ] .subtitle[ ## Get: Importing, Scraping and Exporting Data with R ] --- # Module 2: Outline .pull-left[ * Reading Data from Text Files - Base R functions - `readr` package * Reading Data from Excel files - `xlsx` package (or `openxlsx` alternatively) - `readxl` package * Importing Data from statistical software - `foreign` package * Reading from Databases - export and then import - connect directly to a database from R * Scraping Data from Web - Importing tabular and Excel files stored online (`read.csv` and `gdata`) - Scraping HTML Table Data (using `rvest`) ] .pull-right[ * Exporting Data to text files - Base R functions - `readr` package * Exporting data to Excel files + `xlsx` package + export to a csv file then save as xlsx * Saving Data as an R object File ] ??? --- # Preliminaries: Setting The Working Directory - R is always pointed at a directory on your computer. You can find out which directory by running the `getwd` (get working directory) function. ```r getwd() ``` <!-- - Remember that your working directory may be different from my path. --> - The location of working directory will be the place that R stores everything. --- # Preliminaries: Changing The Working Directory - There are a number of ways to change the current working directory: -- **1. Use the `setwd()` function and specify the path to the desired folder.** -- - Create a folder on desktop and name it "**Week2**" and set this as our working directory. ```r setwd("~/Desktop/Week2") #for Mac users setwd("./Desktop/Week2") #for Windows users ``` -- To check: ```r getwd() ``` -- - Use the forward slash */* in R while specifying the file path. - The Windows format of single backslash will not work. --- # Preliminaries: Changing The Working Directory Cont. -- **2. Use Session or Tools depending on your system and Change Working Directory.** **menu (Session -> Set Working Directory on a Mac).** <!-- This will also change directory location of the Files pane. --> -- **3. From the Files pane, use the More -> Set As Working Directory menu.** --- # Reading Data from Text Files - Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the **CSV** (or other text file) formats. -- - Text file formats use **delimiters** to separate the different elements in a line, and each line of data is in its own line in the text file. -- + **Base R functions** are the built-in functions that are already available when you download R and RStudio. -- + **`readr` package functions**: The `readr` functions are around 10× faster. You need to install and load the `readr` package using the following commands: -- ```r install.packages("readr") library(readr) ``` --- # Reading Data from Text Files: Base R functions - For the demonstration, we will use the "iris.csv" data available in our Data Repository [click on here and save iris.csv data in your working directory](../data/iris.csv). - The following command will read iris.csv data and store it in the `iris` object in R as a data frame: ```r # iris.csv file is located in the working directory iris <- read.csv( "iris.csv" ) ``` -- - If the data file is under a different folder you need to specify the path to the data file explicitly: ```r #iris.csv file is located in the "~/Desktop/data/iris.csv" path iris <- read.csv( file="~/Desktop/data/iris.csv" ) ``` --- # Reading Data from Text Files: `readr` package functions - `readr` functions are around 10× faster. This will make a remarkable difference in reading time if you have a very large data set. - `read_csv()` function is equivalent to base R ’s `read.csv()` function (note the distinction between these two function names!) ```r install.packages("readr") library(readr) ``` ```r iris <- read_csv("iris.csv") ``` --- # Your turn! Activity <!-- - Work in small groups --> - Import the [city.csv](../data/city.csv) using BaseR's `read.csv` and readr's `read_csv`. - Did you spot any differences between imported datasets? Hint: Have a look at the variable names in both datasets. - What are the differences? - Press P to reveal the answer! ??? ```r city1 <- read.csv( file="../data/city.csv" ) head(city1) ``` ``` ## X X2016.Rank X2015.Rank X2014.Rank X2012.Rank X2010.Rank City ## 1 1 1 1 1 1 1 Vienna ## 2 2 2 2 2 2 2 Zürich ## 3 3 3 3 3 3 4 Auckland ## 4 4 4 4 4 4 7 Munich ## 5 5 5 5 5 5 4 Vancouver ## 6 6 6 6 6 6 6 Düsseldorf ## Country Index.2010 ## 1 Austria 108.6 ## 2 Switzerland 108.0 ## 3 New Zealand 107.4 ## 4 Germany 107.0 ## 5 Canada 107.4 ## 6 Germany 107.2 ``` ```r library(readr) city2 <- read_csv( file="../data/city.csv" ) ``` ``` ## New names: ## Rows: 51 Columns: 9 ## ── Column specification ## ──────────────────────────────────────────────────────── Delimiter: "," chr ## (2): City, Country dbl (7): ...1, 2016 Rank, 2015 Rank, 2014 Rank, 2012 Rank, ## 2010 Rank, Index ... ## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ ## Specify the column types or set `show_col_types = FALSE` to quiet this message. ## • `` -> `...1` ``` ```r head(city2) ``` ``` ## # A tibble: 6 × 9 ## ...1 `2016 Rank` `2015 Rank` `2014 Rank` `2012 Rank` `2010 Rank` City ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 1 1 1 1 1 1 Vienna ## 2 2 2 2 2 2 2 Zürich ## 3 3 3 3 3 3 4 Auckland ## 4 4 4 4 4 4 7 Munich ## 5 5 5 5 5 5 4 Vancouver ## 6 6 6 6 6 6 6 Düsseldorf ## # ℹ 2 more variables: Country <chr>, `Index 2010` <dbl> ``` + When `read_csv()` is used, `parsed with column specification` is given in the screen. It is useful to see the specs of the variables that are read in R. + `read_csv()` maintains the full variable name whereas, `read.csv` eliminates any spaces in variable names and fills it with ‘.’ + `read.csv` puts an `X` prefix when the variable names starts with numeric characters and when variable names are missing. + `read_csv()` automatically reads character variables as characters where else BaseR alternative reads them as factors (refer to Module 3). --- # Reading Data from Text Files: `readr` package functions - RStudio has the built in "**Import Dataset**" dialog box on the upper-right **Environment** pane. - You can also use this dialog box to import a wide range of file types including csv, Excel, SPSS, SAS and Stata data files. <!-- - The following slides (taken from Dr. James Baglin's [R Bootcamp notes](https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_02.html#importing_and_exporting_data)) will briefly explain the process of importing a csv data set into RStudio. --> <!-- --- --> <!-- <iframe width="800" height="600" align = center src="https://docs.google.com/presentation/d/1RgCNC-vwqBQGL74J0MNnIb1QYaG0vl42TepYISUsbYI/embed?slide=id.p&usp=embed_facebook" frameborder="0" allowfullscreen style="display:block;margin:5px auto;text-align:center;"></iframe> --> --- # Reading Data from Excel files: `xlsx` Package - Excel is the most commonly used spreadsheet software. If the dataset is stored in the .xls or .xlsx format, we have to use certain R packages to import those files; one of the packages is `xlsx`. However, we have: <!-- - We will use the iris.xlsx data available [here](../data/iris.xlsx) but first install and load the `xlsx` package: --> -- - The `xlsx` package has external dependencies (i.e., rJava). - Often installation and loading this package would be problematic. - Do not waste time to fix Rjava problem, instead use alternative packages with the same functionality - I recommend using `readxl` or `openxlsx` packages instead. --- # Reading Data from Excel files: `readxl` Package - `readxl` was developed by Hadley Wickham and the RStudio team who also developed the `readr` package. - This package works with both .xls and .xlsx formats. - Unlike `xlsx` package, the `readxl` package has no external dependencies (like Java or Perl). ```r install.packages("readxl") library(readxl) ``` ```r # read in xlsx worksheet using a sheet index or name iris<- read_excel("../data/iris.xlsx", sheet = "iris") ``` - Help on arguments: Use [package documentation](https://cran.r-project.org/web/packages/readxl/readxl.pdf). --- # Importing Data from statistical software - The `foreign` package provides functions that help you read data files from other statistical software such as SPSS, SAS, Stata, and others into R. - Here is an example of importing an SPSS data file called [iris.sav](../data/iris.sav): ```r install.packages("foreign") library(foreign) ``` ```r # read in spss data file and store it as data frame iris_spss <- read.spss("../data/iris.sav", to.data.frame = TRUE) ``` --- # Reading from Databases - Large-scale data sets are generally stored in database software. -- - Commonly, large organisations and companies keep their data in relational databases. Therefore, we may need to import and process large-scale data sets in R. -- - One of the best approaches for working with data from a database is to export the data to a text file and then import the text file into R. -- - According to Adler (2010), importing data into R at a much faster rate from text files than you can from database connections, especially when dealing with very large data sets (1 GB or more). <!-- - Also have a look at this post on Slack (previous semester): --> <!-- <center><img src="../images/fast_readr.png" width="500px" /></center> --> --- # Reading from Databases Cont. - This approach is considered to be the best approach if you plan to import a large amount of data once and then analyse. - However, if you need to produce regular reports or to repeat an analysis many times, then it might be better to import data into R directly through a database connection. - Database connection is an advanced topic, for more information refer to the [Module 2 notes](../Module_02.html) --- # Importing Tabular files Stored Online - Vast amount of information is now being stored online, both in structured and unstructured forms. - The most basic form of getting data from online is to import tabular (i.e. . txt , .csv) files that are being hosted online. - Importing tabular data is common for the government data available online like [Credit Licensee Dataset](https://data.gov.au/data/dataset/fa0b0d71-b8b8-4af8-bc59-0b000ce0d5e4/resource/35953a01-a9a8-4609-8566-c9fa7de465d3/download/credit_lic_202302.csv) - Reading online .csv or .txt file is just like reading tabular data. The only difference is, we need to provide the URL of the data instead of the file name as follows: ```r # the url for the online csv file url <- "https://data.gov.au/data/dataset/fa0b0d71-b8b8-4af8-bc59-0b000ce0d5e4/resource/35953a01-a9a8-4609-8566-c9fa7de465d3/download/credit_lic_202302.csv" ``` --- - Next, as the online data is a .csv file, we can read this data file using `read.csv` function. ```r # use read.csv to import CreditL_data <- read.csv(url) # display first six rows and two variables in the data CreditL_data[1:6,1:2] ``` ``` ## REGISTER_NAME CRED_LIC_NUM ## 1 Credit Licence 219612 ## 2 Credit Licence 222213 ## 3 Credit Licence 222640 ## 4 Credit Licence 222835 ## 5 Credit Licence 223246 ## 6 Credit Licence 224543 ``` --- # Scraping HTML Table Data - Web pages contain several HTML tables and we may want to read the data from that HTML table. - The simplest approach to scraping HTML table data directly into R is by using the `rvest` package. - HTML tables are contained within `<table>` tags; therefore, to extract the tables, we need to use the `html_nodes()` function to select the `<table>` nodes. - We will use the example from the help page for `rvest`, which loads all tables from the U.S. Social Security webpage: [https://www.ssa.gov/oact/babynames/numberUSbirths.html](https://www.ssa.gov/oact/babynames/numberUSbirths.html) - First, we will install and load the `rvest` package: ```r # first install and load the rvest package install.packages("rvest") library(rvest) ``` --- - We will use `read_html` to locate the URL of the HTML table. When we use `read_html`, all table nodes that exist on the webpage will be captured. ```r births <- read_html("https://www.ssa.gov/oact/babynames/numberUSbirths.html") ``` - In this example, using the `length` function we can see that the `html_nodes` captures only one HTML table. ```r length(html_nodes(births, "table")) ``` ``` ## [1] 1 ``` - Note that for this example we had only one table in the website, there can be cases where the website includes additional tables used to format other parts of the page (i.e. table of contents, table of figures, advertisements, etc.). - Best way to check this is right click on the website -> View Page Source, then find the table tags in the HTML document. --- - The first table on the webpage is the place where our data is located, thus, we will select the first element of the `html_nodes`. ```r # select the second element of the html_nodes births_data<- html_table(html_nodes(births, "table")[[1]]) # view the header of the births_data head(births_data) ``` ``` ## # A tibble: 6 × 4 ## `Year of birth` Male Female Total ## <int> <chr> <chr> <chr> ## 1 1880 118,399 97,605 216,004 ## 2 1881 108,276 98,855 207,131 ## 3 1882 122,031 115,694 237,725 ## 4 1883 112,475 120,060 232,535 ## 5 1884 122,738 137,585 260,323 ## 6 1885 115,945 141,947 257,892 ``` --- # Exporting Data to text files : Base R functions - Exporting data out of R is equally important as importing data into R. - First we look at the base R and `readr` package functions to export data to text files. ```r # create a data frame and assign it to an object named df df <- data.frame (cost = c(10, 25, 40), color = c ("blue", "red", "green"), suv = c (TRUE, TRUE, FALSE), row.names = c ("car1", "car2", "car3")) ``` - To export `df` to a CSV file we will use `write.csv()`. ```r # write to a csv file in our working directory write.csv(df, file = "cars_csv") ``` --- # Exporting Data to text files : `readr` functions - The `readr` package functions, `write_csv` and `write_delim` are twice as fast as base R functions and they are very similar in usage. ```r # load the library library(readr) # write to a csv file in the working directory write_csv(df, file = "cars_csv2") ``` --- # Saving Data as an R object File - Sometimes we may need to save data or other R objects outside of the workspace or may want to store, share, or transfer between computers. - We can use the .rda or .RData file types when we want to save several, or all, objects and functions that exist in the global environment. - On the other hand, if we only want to save a single R object such as a data frame, function, or statistical model results, it is best to use the .rds file type. To illustrate let's create two objects named x and y and save them to a .RData file using `save()` function. ```r # generate random numbers from uniform and normal distribution and assign them to objects named x and y, respectively. x <- runif(10) y <- rnorm(10, 0, 1) # Save both objects in .RData format in the working directory save(x, y, file = "xy.RData") ``` --- # Saving Data as an R object File Cont. - Also, the `save.image()` function will save your all current workspace as .RData. ```r # save all objects in the global environment save.image() ``` - The following example will illustrate how a single object will be saved using `saveRDS()` ```r # save a single object to file saveRDS(x, "x.rds") # restore it under a different name x2 <- readRDS("x.rds") ``` - For further details on differences between rds and RData formats please read [this article](http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata). --- # What do you need to know by Week 2 - Understand how to get data from tabular and spreadsheet files. - Understand how to get data from statistical software and databases. - Know how to scrape data files stored online. - Know how to export to tabular and spreadsheet files. - Know how to save R objects. - Know how to get further help: Use R data import/export manual [https://cran.r-project.org/doc/manuals/R-data.html](https://cran.r-project.org/doc/manuals/R-data.html). <!-- - RStudio's "[Data Import Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf)" is a compact resource for all importing functions available in the `readr` package. --> <!-- - For extra study/practice you can complete "Importing Data in R (Part 1)" DataCamp course. --> - Practice! --- # Worksheet questions: <center><img src="../images/giphy.gif" width="300px" /></center> - Complete the following worksheet: [Module 2 Worksheet](../worksheets/Week_02_Worksheet.html) - Once completed, feel free to work on your Assessments. <br> <br> <br> [Return to Course Website](../index.html)