Module 4.2 Demonstration

class: center, middle, inverse, title-slide

.title[
# Module 4.2 Demonstration
]
.subtitle[
## Tidy & Manipulate: Part II - Manipulate
]

---

# Recall: Tidy Data Principles

Previous week:

- The framework of "**Tidy Data Principles**" provides a standard and consistent way of storing data that makes transformation, visualization, and modeling easier.

- Each variable must have its own column.

- Each observation must have its own row.

- Each value must have its own cell.

???

---

# Recall: `tidyr`

- `tidyr` is a one such package which was built for the sole purpose of simplifying the process of creating tidy data.

- Following tidy principles makes manipulation,  transformation, visualization, and modeling easier.

- **tidyverse** is a set of packages that work in harmony because they share common data representations and API design.

???
haven, for importing spss sas .. data
ggplot2 for visualization
purr, for functional programming
modelr, for simple modeling within a pipeline
broom, for turning models into tidy data
knitr, for report generation
rmarkdown, turn your analysis into anything

---

# The grammar of Data Manipulation: dplyr

- There are many data manipulation packages/functions in R.

- Most of them lack consistent coding and the ability to easily flow together.

- This leads to difficult-to-read nested functions and/or choppy code.

- The `dplyr` package is regarded as the "**Grammar of Data Manipulation**" in R.

- It provides a consistent set of verbs that help you solve the most common data manipulation challenges.

- Remember: `dplyr` functions work with pipes %>% and expect **tidy data** .

---

# The grammar of Data Manipulation: dplyr

- There are six fundamental functions of data manipulation that `dplyr` provides:

* **`select()`** pick/select variables
      * **`filter()`** pick/filter observations based on values
      * **`arrange()`** sort variables
      * **`mutate()`** create new variables
      * **`summarise()`** summarise data by functions of choice
      * **`group_by()`** + **`summarise()`**

- There are also functions to join and merge data sets:

* **Mutating joins** 
      * **Filtering joins** 
      * **Set operations** 
      * **Merging data sets**

- The full list of capabilities can be found in the [dplyr reference manual](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).

- I highly recommend going through it as there are many great functions provided by `dplyr` that will not cover here.

---

# select() : select variables

- Often we only assess specific variables. The `select()` function allows us to select variables.

- In addition to the existing functions like `:` and `c()`, there are a number of special functions that can work inside select.

Functions | Usage
----------|-------
`-` | Select everything but
`:` | Select range
`contains()` | Select columns whose name contains a character string
`starts_with()` | Select columns whose name starts with a character string
`ends_with()` | Select columns whose name ends with a string
`everything()` | Select every column
`matches()` | Select columns whose name matches a regular expression
`num_range()` | Select columns named ...
`one_of()` | Select columns whose names are in a group of names

---

# Activity:

- Import the [CustomerData.csv](../data/CustomerData.csv) data set and answer the questions.

---

# Import Customer Data

- The [CustomerData.csv](../data/CustomerData.csv) data set includes some characteristics of 5000 customers. Header of this data set is as follows.

``` r
customer <- read.csv("../data/CustomerData.csv")
head(customer[, 1:7], 5)
```

```
##        CustomerID Region TownSize Gender Age EducationYears  JobCategory
## 1 3964-QJWTRG-NPN      1        2 Female  20             15 Professional
## 2 0648-AIPJSP-UVM      5        5   Male  22             17        Sales
## 3 5195-TLUDJE-HVO      3        4 Female  67             14        Sales
## 4 4459-VLPQUH-3OL      4        3   Male  23             16        Sales
## 5 8158-SMTQFB-CNO      2        2   Male  26             16        Sales
```

---

# Activity: select()

Which of the following can be used to:

- Q1. select all variables between `CustomerID` and `Gender`.

- Q2. select all variables other than those between `CustomerID` and `Gender`.

- Q3. select CustomerID and all variables that contain the word "Card".

???

Q1. Select all variables between CustomerID and Gender.

``` r
select(customer, CustomerID:Gender)
#customer %>% select(CustomerID:Gender)
```

Q2. Select all variables except for those between CustomerID and Gender.

``` r
select(customer, -(CustomerID:Gender))
```

Q3. Select CustomerID and all variables that contain the word "Card".

``` r
select(customer, CustomerID, contains("Card"))
```

---

# filter(): filter observations based on values

- `filter()` identifies or selects observations in which a particular variable matches a specific value/condition.

- The condition(s) can be any kind of logical comparison and Boolean operators, such as:

.pull-left[

Symbol | Usage
-------|------
`<` | Less than
`>` | Greater than
`==` | Equal to
`<=` | Less than or equal to
`>=` | Greater than or equal to
`!=` | Not equal to
`%in%` | Group membership
`is.na` | Is NA]

.pull-right[

Symbol | Usage
-------|------
`!is.na` | Is not NA
`&, I` | Boolean AND, OR
`xor` | exclusive or
`!` | not
`any` | any true
`all` | all true
]

---

# Activity: filter()

Which of the following can be used to:

- Q4. filter for female customers only.

- Q5. filter for female customers that are greater than 45 years old AND live in region 3.

- Q6. filter for female customers that are greater than 45 years old OR live in region 3.

???

Q4: Filter for female customers only.

``` r
filter(customer, Gender == "Female")
```

Q5: Filter for female customers that are greater than 45 years old and live in region 3.

``` r
filter(customer, Gender == "Female", Age > 45, Region == 3)
```

Q6: Filter for female customers that are greater than 45 years old or live in region 3.

``` r
filter(customer, Gender == "Female", Age > 45 | Region == 3)
```

---

# arrange(): order data by variables

- `arrange()` orders the data by variables in ascending (default) or descending order.

- For a descending order, use `desc()` within the `arrange()` function.

---

# Activity: arrange()

Which of the following can be used to:

- Q7: select the variables `CustomerID`, `Region`, `Gender`, `Age`, `HHIncome`, `CardSpendMonth` and save this as `sub_cust`.

- Q8: order `sub_cust` data by `Age` and `CardSpendMonth` (ascending order).

- Q9: order `sub_cust` data by `Age` (oldest to youngest) and `CardSpendMonth` (least to most).

???

Q7: Select variables.

``` r
sub_cust <- select(customer, CustomerID, Region, Gender, Age, HHIncome, CardSpendMonth)
```

Q8: Order sub_cust data by Age and CardSpendMonth (ascending order).

``` r
arrange(sub_cust, Age, CardSpendMonth)
```

Q9: Order sub_cust data by Age (oldest to youngest) and CardSpendMonth (least to most).

``` r
arrange(sub_cust, desc(Age), CardSpendMonth)
```

---

# mutate(): create new variables

- `mutate()` adds new variables while preserving the existing variables.

- `transmute()` creates a new variable and then drops the other variables.

- Here is the list of some useful functions used inside the `mutate()`.

---

# Activity: mutate()

Which of the following can be used to:

- Q10: create a ratio variable that computes the ratio of `CardSpendMonth` to `HHIncome` using `sub_cust` data.

- Q11: create two variables: ratio1 = CardSpendMonth / HHIncome and ratio2 = CardSpendMonth / Age.

???

Q10: Create a ratio variable that computes the ratio of CardSpendMonth to HHIncome.

``` r
mutate(sub_cust, ratio = CardSpendMonth / HHIncome)
```

Q11: Create 2 variables.

``` r
mutate(sub_cust, ratio1 = CardSpendMonth / HHIncome, ratio2 = CardSpendMonth / Age)
```

---

# summarise(): summarise data by functions of choice

-  `summarise()` (or `summarize()` ) performs the majority of summary statistics.

- All functions in this list takes a vector of values and returns a single summary value.

---

# group_by() + summarise() function

- If we want to take the summary statistics grouped by a variable, then we need to use another function called `group_by()`.

- `group_by()` along with `summarise()` functions  will allow us to take and compare summary statistics grouped by another (usually factor) variable.

---

# Activity: group_by() + summarise()

Which of the following can be used to:

- Q12: compute the average `CardSpendMonth` across all customers in our sub_cust data.

- Q13: compute the average `CardSpendMonth` for each `Gender`.

- Q14: compute the average `CardSpendMonth` for each `Gender` and `Region`.

???

Q12: Avg spend across all customers.

``` r
summarize(sub_cust, Avg_spend = mean(CardSpendMonth, na.rm = TRUE)) 
```

Q13:  Compute the average CardSpendMonth for each gender.

``` r
by_gender <- group_by(sub_cust, Gender)
summarize(by_gender, Avg_spend = mean(CardSpendMonth, na.rm = TRUE))
```

Q14 : Compute the average CardSpendMonth for each gender and region.

``` r
by_gdr_rgn <- group_by(sub_cust, Gender, Region)
avg_gdr_rgn <- summarize(by_gdr_rgn, Avg_spend = mean(CardSpendMonth, na.rm = TRUE))
arrange(avg_gdr_rgn, desc(Avg_spend))
```

---

# Relational data sets

- Often we have separate data frames that can have common and differing variables for similar observations (relational data sets).

- Take a look at these toy data sets (i.e. band members of the Beatles and Rolling Stones):

.pull-left[

``` r
band_members
```

```
## # A tibble: 3 × 2
## name band 
## <chr> <chr> 
## 1 Mick Stones 
## 2 John Beatles
## 3 Paul Beatles
```
]

.pull-right[

``` r
band_instruments
```

```
## # A tibble: 3 × 2
## name plays 
## <chr> <chr> 
## 1 John guitar
## 2 Paul bass 
## 3 Keith guitar
```
]

---

# Joining data sets

- `dplyr`  offers three sets of joining functions to provide alternative ways to join data frames.

* Mutating joins: add new variables to one data frame from matching observations in another.

* Filtering joins:  filter observations from one data frame based on whether or not they match an observation in the other table.

* Set operations: treat observations as if they were set elements.
    
    * Merging data sets: merge data frames by row and column.

---

# Mutating joins

.pull-left[
<center><img src="../images/mutatingjoin.png" width="140%"></center>

]

.pull-right[

- `inner_join()`: only retains rows in both datasets.

- `left_join()`: prioritizes left dataset.

- `right_join()`: prioritizes right dataset.

- `full_join()`: retains all rows.
]

- Each mutating join takes an argument `by` that controls which variables are used to match observations in the two data sets.

- `dplyr` will use all variables that appear in both tables, a natural join (`NULL`: default value/key).

---

# Mutating joins

---

# Example: Mutating joins

.pull-left[

``` r
band_members
```

```
## # A tibble: 3 × 2
## name band 
## <chr> <chr> 
## 1 Mick Stones 
## 2 John Beatles
## 3 Paul Beatles
```
]

.pull-right[

``` r
band_instruments
```

```
## # A tibble: 3 × 2
## name plays 
## <chr> <chr> 
## 1 John guitar
## 2 Paul bass 
## 3 Keith guitar
```
]

- **PRESS P**

???

``` r
band_members %>% inner_join(band_instruments) #only retains rows in both datasets
```

```
## Joining with `by = join_by(name)`
```

```
## # A tibble: 2 × 3
## name band plays 
## <chr> <chr> <chr> 
## 1 John Beatles guitar
## 2 Paul Beatles bass
```

``` r
band_members %>% left_join(band_instruments) #prioritizes left dataset
```

```
## Joining with `by = join_by(name)`
```

```
## # A tibble: 3 × 3
## name band plays 
## <chr> <chr> <chr> 
## 1 Mick Stones <NA> 
## 2 John Beatles guitar
## 3 Paul Beatles bass
```

``` r
band_members %>% right_join(band_instruments) #prioritizes right dataset
```

```
## Joining with `by = join_by(name)`
```

```
## # A tibble: 3 × 3
## name band plays 
## <chr> <chr> <chr> 
## 1 John Beatles guitar
## 2 Paul Beatles bass 
## 3 Keith <NA> guitar
```

``` r
band_members %>% full_join(band_instruments) #retains all rows
```

```
## Joining with `by = join_by(name)`
```

```
## # A tibble: 4 × 3
## name band plays 
## <chr> <chr> <chr> 
## 1 Mick Stones <NA> 
## 2 John Beatles guitar
## 3 Paul Beatles bass 
## 4 Keith <NA> guitar
```

---

# Filtering joins

- `semi_join(x, y)`: keeps all observations in x that have a match in y. 
  - `anti_join(x, y)`: drops all observations in x that have a match in y.

---

# Example: Filtering joins

.pull-left[

``` r
band_members
```

```
## # A tibble: 3 × 2
## name band 
## <chr> <chr> 
## 1 Mick Stones 
## 2 John Beatles
## 3 Paul Beatles
```
]

.pull-right[

``` r
band_instruments
```

```
## # A tibble: 3 × 2
## name plays 
## <chr> <chr> 
## 1 John guitar
## 2 Paul bass 
## 3 Keith guitar
```
]

- **PRESS P**

???

``` r
band_members %>% semi_join(band_instruments) #keeps all observations in members that have a match in instruments
```

```
## Joining with `by = join_by(name)`
```

```
## # A tibble: 2 × 2
## name band 
## <chr> <chr> 
## 1 John Beatles
## 2 Paul Beatles
```

``` r
band_members %>% anti_join(band_instruments) #drops all observations in members that have a match in instruments
```

```
## Joining with `by = join_by(name)`
```

```
## # A tibble: 1 × 2
## name band 
## <chr> <chr> 
## 1 Mick Stones
```

---

# Set operations

- `intersect(y, z)`: return only observations in both y and z.
  - `union(y, z)`: return unique observations in y and z.
  - `setdiff(y, z)`: return observations in y, but not in z.

---

# Merging data sets

- `bind_rows(y, z)`: Append z to y as new rows.
  - `bind_cols(y, z)`: Append z to y as new columns.

---

# Functions to Remember for Week 5 (dplyr)

- Practice!

---

# Worksheet questions:

- Complete the following worksheet:

[Module 4.2 Worksheet](../worksheets/Week_05_Worksheet.html)

- Once completed, feel free to work on your Assessments.

[Return to Course Website](../index.html)