Module 5 Demonstration

class: center, middle, inverse, title-slide

.title[
# Module 5 Demonstration
]
.subtitle[
## Scan: Missing Values
]

---

# Recall: Grammar of Data Manipulation

Previous week:

- The "**Grammar of Data Manipulation**" in R, `dplyr` package.

- It provides a consistent set of verbs that help you solve the most common data manipulation challenges.

* **`select()`** pick/select variables
      * **`filter()`** pick/filter observations based on values
      * **`arrange()`** sort variables
      * **`mutate()`** create new variables
      * **`summarise()`** summarise data by functions of choice
      * **`group_by()`** + **`summarise()`**
      * Functions for joining and merging data sets

---

# Scan: Missing Values

- Dealing missing values is an unavoidable task in the data preprocessing.

- It is important to know how R handles missing values and how they are represented.

- Next, we will learn how to:

- identify 
    - recode 
    - exclude
    - impute 
    
missing values and special values.

---

# Missing Data

---

# Missing Data

- In R, a numeric missing value --> `NA`

- Character missing value --> `<NA>`.

- Some other values for missing values -->  `99`, `.`, `..`, just space, or `NULL`.

- Be careful about different representations of the missing values while importing the data from other software.

---

# Why NA's are problematic?

- Any operation involving `NA` yields `NA` as the result.

```r
5 + NA
```

```
## [1] NA
```

```r
mean( c( 5, 7, NA ) )
```

```
## [1] NA
```

```r
NA < 3
```

```
## [1] NA
```

```r
NA == NA
```

```
## [1] NA
```

---

# Identifying Missing Data

- To identify existence of missing values use `is.na()` function:

- `is.na()` returns a logical vector with `TRUE` in the element locations that contain  `NA`. 
    
    - `is.na()` will work on vectors, lists, matrices, and data frames.

- To identify the **location** of NA's use `which()`.

- To find **the number of NAs** in a vector use the `sum()`.

- To find  **the total missing values in each column**  use `colSums()`.

- To find  **the total missing values in each row**  use `rowSums()`.

- To find complete cases use `complete.cases()` (opposite of `is.na`). It returns a logical vector with TRUE in the locations that are complete (no missing).

---

# Example: Customer Data

- The [CustomerData.csv](../data/CustomerData.csv) data set includes some characteristics of 5000 customers. There are 59 variables in this data. The head of this data set is

```r
customer <- read_csv("../data/CustomerData.csv")
```

```r
head(customer[, 1:7], 5)
```

```
## # A tibble: 5 × 7
##   CustomerID      Region TownSize Gender   Age EducationYears JobCategory 
##   <chr>            <dbl>    <dbl> <chr>  <dbl>          <dbl> <chr>       
## 1 3964-QJWTRG-NPN      1        2 Female    20             15 Professional
## 2 0648-AIPJSP-UVM      5        5 Male      22             17 Sales       
## 3 5195-TLUDJE-HVO      3        4 Female    67             14 Sales       
## 4 4459-VLPQUH-3OL      4        3 Male      23             16 Sales       
## 5 8158-SMTQFB-CNO      2        2 Male      26             16 Sales
```

---

# Example: Identify missing values

- Identify missing value(s) for `EducationYears`:

```r
# Returns a TRUE value for the missing values

is.na(customer$EducationYears) %>% head()
```

```
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
```

```r
# Returns the location of missing values

which(is.na(customer$EducationYears))
```

```
## integer(0)
```

```r
# Returns the total number of missing values in Education Years

sum(is.na(customer$EducationYears))
```

```
## [1] 0
```

---

# Class Activity: Your turn!

- Task 1: Identify the location(s) of missing value(s) for `JobCategory`.

- Task 2: Find the total number of missing values in `JobCategory`.

- Task 3: Find the total number of missing values in the whole data set.

- Task 4: Find the total number of missing values in each column (variable) in the data set.

- Press P to reveal the answers.

???

```r
# Task 1:
which(is.na(customer$JobCategory))
```

```
##  [1]   34   55   92  163  196  257  319 2031 2057 2083 2113 2149 3200 3237 3263
```

```r
# Task 2:
sum(is.na(customer$JobCategory))
```

```
## [1] 15
```

```r
# Task 3:
sum(is.na(customer))
```

```
## [1] 131
```

```r
# Task 4:
colSums(is.na(customer))
```

```
##          CustomerID              Region            TownSize              Gender 
##                   0                   0                   2                  33 
##                 Age      EducationYears         JobCategory         UnionMember 
##                   0                   0                  15                   0 
##    EmploymentLength             Retired            HHIncome   DebtToIncomeRatio 
##                   0                   0                   0                   0 
##          CreditDebt           OtherDebt         LoanDefault       MaritalStatus 
##                   0                   0                   0                   0 
##       HouseholdSize          NumberPets          NumberCats          NumberDogs 
##                   8                   6                   7                   8 
##         NumberBirds           HomeOwner           CarsOwned        CarOwnership 
##                  34                  13                   0                   0 
##            CarBrand            CarValue         CommuteTime   PoliticalPartyMem 
##                   0                   0                   2                   0 
##               Votes          CreditCard          CardTenure    CardItemsMonthly 
##                   0                   0                   0                   0 
##      CardSpendMonth     ActiveLifestyle       PhoneCoTenure      VoiceLastMonth 
##                   0                   0                   0                   0 
##     VoiceOverTenure     EquipmentRental  EquipmentLastMonth EquipmentOverTenure 
##                   3                   0                   0                   0 
##         CallingCard        WirelessData       DataLastMonth      DataOverTenure 
##                   0                   0                   0                   0 
##           Multiline                  VM               Pager            Internet 
##                   0                   0                   0                   0 
##            CallerID            CallWait         CallForward     ThreeWayCalling 
##                   0                   0                   0                   0 
##            EBilling     TVWatchingHours              OwnsPC    OwnsMobileDevice 
##                   0                   0                   0                   0 
##      OwnsGameSystem             OwnsFax      NewsSubscriber 
##                   0                   0                   0
```

---

# Recode Missing Data

- Often missing value is due to a data entry/measurement error and you are able to reassess/remeasure the missing information.

- If so, you may know what to recode the missing information.

- We can use regular subsetting and assignment operations to recode missing values; or recode specific indicators that represent missing values.

---

# Example: Recode Missing Data

```r
# create vector with missing data

x <- c(1:4, "NULL", 6:7, "NULL")
x
```

```
## [1] "1"    "2"    "3"    "4"    "NULL" "6"    "7"    "NULL"
```

```r
# recode/replace NULL values with NA

x[which(x == "NULL")] <- NA

x
```

```
## [1] "1" "2" "3" "4" NA  "6" "7" NA
```

```r
# convert it to a numeric vector

x <- as.numeric(x)

x
```

```
## [1]  1  2  3  4 NA  6  7 NA
```

---

# Class Activity: Your turn!

Create a data frame with some missing data.

```r
df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("credit", "","cash", "amex"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA))
```

- Task 1: Find the total number of missing values in each column.

- Task 2: Assume that empty string in `col2` represents a missing value, so replace it with a `NA`.

- Task 3: Identify the location(s) of missing value(s) for `col2`.

???

```r
# Task 1:
colSums(is.na(df))
```

```
## col1 col2 col3 col4 
##    1    0    0    1
```

```r
# Task 2:
df$col2[which(df$col2 == "")] <- NA

# Task 3:
which(is.na(df$col2))
```

```
## [1] 2
```

---

# Excluding Missing Data

- A common method of handling missing values is to omit the NA values from the analysis.

- However, this may be dangerous, because the pattern of missing values may in fact be systematic, and simply deleting records with missing values would lead to a biased subset of the data.

- **One recommendation:** If the amount of missing data is relatively small (up to 5% of the data set), then leaving out the few values with missing features would be the best strategy.

---

# Excluding Missing Data Cont.

- When this is the case, we can exclude missing values in a couple of different ways.

- Exclude missing values from mathematical operations, use `na.rm = TRUE` argument.

- Another option is to use `na.omit()` or `complete.cases()` functions.

- If you do not exclude these values, most functions will return an `NA`.

---

# Example: Exclude missing data

```r
# A data frame with some missing values

df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA))
```

- Find the mean in `col4`.

```r
mean(df$col4) 
```

```
## [1] NA
```

---

# Example: Exclude missing data Cont.

- Find the mean by excluding missing values.

```r
# Option 1

mean(df$col4, na.rm = TRUE) 
```

```
## [1] 3.3
```

```r
# Option 2

mean(na.omit(df$col4))
```

```
## [1] 3.3
```

```r
# Option 3

mean(df$col4[complete.cases(df$col4)])
```

```
## [1] 3.3
```

---

# Class Activity: Your turn!

```r
# A data frame with some missing values

df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA))
```

- Calculate mean of `col4`.

???

```r
mean(df$col4,na.rm=TRUE)
```

```
## [1] 3.3
```

---

# Basic Missing Value Imputation Techniques

- An alternative to excluding missing data is imputation.

- **Imputation** is the process of estimating or deriving values for missing data.

- There is a huge literature on imputation methods and it goes beyond the scope of this course to discuss all of them.

- We will look at some basic missing value imputation techniques.

---

# Replace missing value(s) with mean, median or mode

- Replacing the missing value with the mean, median (for numerical variables) or the mode (for categorical variables) is a crude way of treating missing values.

- We will use `Hmisc` package to impute values for missing values:

- `impute()` function.
    
    - use `fun` argument to specify what to impute (i.e., mean, median, mode).

```r
#install.packages("Hmisc")
library(Hmisc)
```

---

# Example: Basic Imputation using `Hmisc`

Consider the following data frame with missing values:

```r
x <- data.frame( no = c(1:6),
                 x1 = c(15000 , 20000, 23000, NA, 18000, 21000), 
                 x2 = c(4, NA, 4, 5, 7, 8),
                 x3 = factor(c(NA, "False", "False", "False", "True", "True"))
                 )
```

```r
# mean imputation (for numerical variables)

x1 <- impute(x$x1, fun = mean)

# median imputation (for numerical variables)

x2 <- impute(x$x2, fun = median)

# mode imputation (for categorical/factor variables)

x3 <- impute(x$x3, fun= mode)
```

---

# Class Activity: Your Turn!

- Use the [weight_height_NA.csv](../data/weight_height_NA.csv) data. The column weight has missing value information represented with `-`.

```r
weight_height_NA <- read_csv("../data/weight_height_NA.csv")
```

-  Task 1: Recode missing values `-` to `NA`.

-  Task 2: Find the total missing values for `weight`.

-  Task 3: Impute the missing values using mean.

???

```r
#Task 1:

weight_height_NA$weight[which(weight_height_NA$weight == "-")] <- NA

head(weight_height_NA$weight)
```

```
## [1] "45.19" "61.83" "69.32" "64.48" "65.37" "55.86"
```

```r
weight_height_NA$weight<-as.numeric(weight_height_NA$weight)

#Task 2:
sum(is.na(weight_height_NA$weight))
```

```
## [1] 11
```

```r
# Task 3:

boxplot(weight_height_NA$weight)
```

```r
weight_height_NA$weight <- impute(weight_height_NA$weight, fun = mean)
```

---

# More Complex Approaches to Missing Value Imputation

- There are more complex methods to impute missing values (like predictive models).

- There are many  different predictive models and algorithms to **predict and impute** the missing values.

- Regression analysis, multiple imputation methods, random forests, `$k$` nearest neighbors, last observation carried forward / next observation carried backward, etc. are some of these techniques.

- In R, there are many different packages (e.g., `mice`, `missForest`, `impute` etc. ) that can be used to predict and impute the missing data.

- In this course we will not cover these techniques. For details refer to ["Statistical analysis with missing data"](https://www.wiley.com/en-au/Statistical+Analysis+with+Missing+Data%2C+2nd+Edition-p-9781119013563)

---

# Special values

- There are a few special values that are used in R.

- These are `-Inf`, `Inf` and `NaN`.

- If a computation results in a very big number, R will return `Inf` for positive infinity and `-Inf` for a negative infinity.

```r
 3 ^ 1024
```

```
## [1] Inf
```

```r
-3 ^ 4000
```

```
## [1] -Inf
```

```r
12 / 0
```

```
## [1] Inf
```

---

# Special values Cont.

- Sometimes, a computation will produce a result that makes little sense. In these cases, R will often return `NaN` (meaning "not a number"):

```r
0/0
```

```
## [1] NaN
```

```r
Inf - Inf
```

```
## [1] NaN
```

---

# Identifying Special Values

- `is.finite`, `is.infinite`, or `is.nan` functions will generate logical values (TRUE or FALSE) and they can be used to identify special values.

```r
m <- c( 2, 0/0, NA, 1/0, -Inf, Inf, (Inf*2) )
m
```

```
## [1]    2  NaN   NA  Inf -Inf  Inf  Inf
```

```r
is.finite(m) # check finite values 
```

```
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
```

```r
is.infinite(m) # check infinite (-inf or +inf) values
```

```
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
```

```r
is.nan(m) # check not a number (NaN) values
```

```
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
```

---

# Class Activity: Your turn!

- Create a data frame containing special values and NA's.

```r
df <- data.frame(col1 = c( 2, 0/0, NA, 1/0, -Inf, Inf),
                 col2 = c( NA, Inf/0, 2/0, NaN, -Inf, 4))

df
```

```
##   col1 col2
## 1    2   NA
## 2  NaN  Inf
## 3   NA  Inf
## 4  Inf  NaN
## 5 -Inf -Inf
## 6  Inf    4
```

- Find the special values in `df`.

???

```r
is.infinite(df)

# Error in is.infinite(df) : default method not implemented for type 'list'
```

Here, you can make use of `apply` family functions:

```r
# check whether data has infinite (-inf or +inf) values using sapply or apply

sapply(df, is.infinite)
```

```
##       col1  col2
## [1,] FALSE FALSE
## [2,] FALSE  TRUE
## [3,] FALSE  TRUE
## [4,]  TRUE FALSE
## [5,]  TRUE  TRUE
## [6,]  TRUE FALSE
```

Another way is to write a simple function to check every numerical column in a data for infinite values or `NaN`'s.

```r
# Check every numerical column whether they have infinite or NaN values using a function called is.special

is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
```

Then, apply `is.special` function to each column of `df` using `sapply`. `is.special` checks the data frame for numerical special values if the type is numeric.

```r
# apply this function to the data frame (list) using sapply.

sapply(df, is.special)
```

```
##       col1  col2
## [1,] FALSE FALSE
## [2,]  TRUE  TRUE
## [3,] FALSE  TRUE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE
## [6,]  TRUE FALSE
```

---

# Remark: Useful `apply` family functions

- These functions will apply a specified function to a given data object (e.g. vectors, lists, matrices, data frames).

- Most common forms of apply functions are:

- `apply()` for matrices and data frames
  - `lapply()` for lists (output as list)
  - `sapply()` for lists (output simplified)
  - `tapply()` for vectors
  - ... many others

- There is a very useful and [compherensive tutorial on `apply` family functions](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family) in DataCamp.

- In  `swirl()`, "R Programming Course" Lesson 10 and 11 cover the apply family functions.

---

# Functions to Remember for Week 6

- identify : `is.na`, `is.finite`, `is.infinite`, `is.nan`.

- identify location: `which()`.

- identify number of NAs: `sum()`, `colSums()`, `rowSums()`.

- recode : `which()`.

- exclude : `na.omit()`, `na.rm=TRUE`, `complete.cases()`.

- impute: Hmisc `impute()`.

- Practice!

---

# Worksheet questions:

- Working in small groups, complete the following worksheet:

[Module 5 Worksheet](../worksheets/Week_06_Worksheet.html)

- Once completed, feel free to work on your Assessments.

[Return to Course Website](../index.html)