class: center, middle, inverse, title-slide .title[ # Module 5 Demonstration ] .subtitle[ ## Scan: Missing Values ] --- # Recall: Grammar of Data Manipulation Previous week: - The "**Grammar of Data Manipulation**" in R, `dplyr` package. - It provides a consistent set of verbs that help you solve the most common data manipulation challenges. * **`select()`** pick/select variables * **`filter()`** pick/filter observations based on values * **`arrange()`** sort variables * **`mutate()`** create new variables * **`summarise()`** summarise data by functions of choice * **`group_by()`** + **`summarise()`** * Functions for joining and merging data sets --- # Scan: Missing Values - Dealing missing values is an unavoidable task in the data preprocessing. - It is important to know how R handles missing values and how they are represented. - Next, we will learn how to: - identify - recode - exclude - impute missing values and special values. --- # Missing Data <center><img src="images/missingblock.png" width="60%"></center> --- # Missing Data - In R, a numeric missing value --> `NA` - Character missing value --> `<NA>`. - Some other values for missing values --> `99`, `.`, `..`, just space, or `NULL`. - Be careful about different representations of the missing values while importing the data from other software. --- # Why NA's are problematic? - Any operation involving `NA` yields `NA` as the result. ```r 5 + NA ``` ``` ## [1] NA ``` ```r mean( c( 5, 7, NA ) ) ``` ``` ## [1] NA ``` ```r NA < 3 ``` ``` ## [1] NA ``` ```r NA == NA ``` ``` ## [1] NA ``` --- # Identifying Missing Data - To identify existence of missing values use `is.na()` function: - `is.na()` returns a logical vector with `TRUE` in the element locations that contain `NA`. - `is.na()` will work on vectors, lists, matrices, and data frames. - To identify the **location** of NA's use `which()`. - To find **the number of NAs** in a vector use the `sum()`. - To find **the total missing values in each column** use `colSums()`. - To find **the total missing values in each row** use `rowSums()`. - To find complete cases use `complete.cases()` (opposite of `is.na`). It returns a logical vector with TRUE in the locations that are complete (no missing). --- # Example: Customer Data - The [CustomerData.csv](../data/CustomerData.csv) data set includes some characteristics of 5000 customers. There are 59 variables in this data. The head of this data set is ```r customer <- read_csv("../data/CustomerData.csv") ``` ```r head(customer[, 1:7], 5) ``` ``` ## # A tibble: 5 × 7 ## CustomerID Region TownSize Gender Age EducationYears JobCategory ## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> ## 1 3964-QJWTRG-NPN 1 2 Female 20 15 Professional ## 2 0648-AIPJSP-UVM 5 5 Male 22 17 Sales ## 3 5195-TLUDJE-HVO 3 4 Female 67 14 Sales ## 4 4459-VLPQUH-3OL 4 3 Male 23 16 Sales ## 5 8158-SMTQFB-CNO 2 2 Male 26 16 Sales ``` --- # Example: Identify missing values - Identify missing value(s) for `EducationYears`: ```r # Returns a TRUE value for the missing values is.na(customer$EducationYears) %>% head() ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE ``` ```r # Returns the location of missing values which(is.na(customer$EducationYears)) ``` ``` ## integer(0) ``` ```r # Returns the total number of missing values in Education Years sum(is.na(customer$EducationYears)) ``` ``` ## [1] 0 ``` --- # Class Activity: Your turn! - Task 1: Identify the location(s) of missing value(s) for `JobCategory`. - Task 2: Find the total number of missing values in `JobCategory`. - Task 3: Find the total number of missing values in the whole data set. - Task 4: Find the total number of missing values in each column (variable) in the data set. - Press P to reveal the answers. ??? ```r # Task 1: which(is.na(customer$JobCategory)) ``` ``` ## [1] 34 55 92 163 196 257 319 2031 2057 2083 2113 2149 3200 3237 3263 ``` ```r # Task 2: sum(is.na(customer$JobCategory)) ``` ``` ## [1] 15 ``` ```r # Task 3: sum(is.na(customer)) ``` ``` ## [1] 131 ``` ```r # Task 4: colSums(is.na(customer)) ``` ``` ## CustomerID Region TownSize Gender ## 0 0 2 33 ## Age EducationYears JobCategory UnionMember ## 0 0 15 0 ## EmploymentLength Retired HHIncome DebtToIncomeRatio ## 0 0 0 0 ## CreditDebt OtherDebt LoanDefault MaritalStatus ## 0 0 0 0 ## HouseholdSize NumberPets NumberCats NumberDogs ## 8 6 7 8 ## NumberBirds HomeOwner CarsOwned CarOwnership ## 34 13 0 0 ## CarBrand CarValue CommuteTime PoliticalPartyMem ## 0 0 2 0 ## Votes CreditCard CardTenure CardItemsMonthly ## 0 0 0 0 ## CardSpendMonth ActiveLifestyle PhoneCoTenure VoiceLastMonth ## 0 0 0 0 ## VoiceOverTenure EquipmentRental EquipmentLastMonth EquipmentOverTenure ## 3 0 0 0 ## CallingCard WirelessData DataLastMonth DataOverTenure ## 0 0 0 0 ## Multiline VM Pager Internet ## 0 0 0 0 ## CallerID CallWait CallForward ThreeWayCalling ## 0 0 0 0 ## EBilling TVWatchingHours OwnsPC OwnsMobileDevice ## 0 0 0 0 ## OwnsGameSystem OwnsFax NewsSubscriber ## 0 0 0 ``` --- # Recode Missing Data - Often missing value is due to a data entry/measurement error and you are able to reassess/remeasure the missing information. - If so, you may know what to recode the missing information. - We can use regular subsetting and assignment operations to recode missing values; or recode specific indicators that represent missing values. --- # Example: Recode Missing Data ```r # create vector with missing data x <- c(1:4, "NULL", 6:7, "NULL") x ``` ``` ## [1] "1" "2" "3" "4" "NULL" "6" "7" "NULL" ``` ```r # recode/replace NULL values with NA x[which(x == "NULL")] <- NA x ``` ``` ## [1] "1" "2" "3" "4" NA "6" "7" NA ``` ```r # convert it to a numeric vector x <- as.numeric(x) x ``` ``` ## [1] 1 2 3 4 NA 6 7 NA ``` --- # Class Activity: Your turn! Create a data frame with some missing data. ```r df <- data.frame(col1 = c(1:3, NA), col2 = c("credit", "","cash", "amex"), col3 = c(TRUE, FALSE, TRUE, TRUE), col4 = c(2.5, 4.2, 3.2, NA)) ``` - Task 1: Find the total number of missing values in each column. - Task 2: Assume that empty string in `col2` represents a missing value, so replace it with a `NA`. - Task 3: Identify the location(s) of missing value(s) for `col2`. ??? ```r # Task 1: colSums(is.na(df)) ``` ``` ## col1 col2 col3 col4 ## 1 0 0 1 ``` ```r # Task 2: df$col2[which(df$col2 == "")] <- NA # Task 3: which(is.na(df$col2)) ``` ``` ## [1] 2 ``` --- # Excluding Missing Data - A common method of handling missing values is to omit the NA values from the analysis. - However, this may be dangerous, because the pattern of missing values may in fact be systematic, and simply deleting records with missing values would lead to a biased subset of the data. - **One recommendation:** If the amount of missing data is relatively small (up to 5% of the data set), then leaving out the few values with missing features would be the best strategy. --- # Excluding Missing Data Cont. - When this is the case, we can exclude missing values in a couple of different ways. - Exclude missing values from mathematical operations, use `na.rm = TRUE` argument. - Another option is to use `na.omit()` or `complete.cases()` functions. - If you do not exclude these values, most functions will return an `NA`. --- # Example: Exclude missing data ```r # A data frame with some missing values df <- data.frame(col1 = c(1:3, NA), col2 = c("this", NA,"is", "text"), col3 = c(TRUE, FALSE, TRUE, TRUE), col4 = c(2.5, 4.2, 3.2, NA)) ``` - Find the mean in `col4`. ```r mean(df$col4) ``` ``` ## [1] NA ``` --- # Example: Exclude missing data Cont. - Find the mean by excluding missing values. ```r # Option 1 mean(df$col4, na.rm = TRUE) ``` ``` ## [1] 3.3 ``` ```r # Option 2 mean(na.omit(df$col4)) ``` ``` ## [1] 3.3 ``` ```r # Option 3 mean(df$col4[complete.cases(df$col4)]) ``` ``` ## [1] 3.3 ``` --- # Class Activity: Your turn! ```r # A data frame with some missing values df <- data.frame(col1 = c(1:3, NA), col2 = c("this", NA,"is", "text"), col3 = c(TRUE, FALSE, TRUE, TRUE), col4 = c(2.5, 4.2, 3.2, NA)) ``` - Calculate mean of `col4`. ??? ```r mean(df$col4,na.rm=TRUE) ``` ``` ## [1] 3.3 ``` --- # Basic Missing Value Imputation Techniques - An alternative to excluding missing data is imputation. - **Imputation** is the process of estimating or deriving values for missing data. - There is a huge literature on imputation methods and it goes beyond the scope of this course to discuss all of them. - We will look at some basic missing value imputation techniques. --- # Replace missing value(s) with mean, median or mode - Replacing the missing value with the mean, median (for numerical variables) or the mode (for categorical variables) is a crude way of treating missing values. - We will use `Hmisc` package to impute values for missing values: - `impute()` function. - use `fun` argument to specify what to impute (i.e., mean, median, mode). ```r #install.packages("Hmisc") library(Hmisc) ``` --- # Example: Basic Imputation using `Hmisc` Consider the following data frame with missing values: ```r x <- data.frame( no = c(1:6), x1 = c(15000 , 20000, 23000, NA, 18000, 21000), x2 = c(4, NA, 4, 5, 7, 8), x3 = factor(c(NA, "False", "False", "False", "True", "True")) ) ``` ```r # mean imputation (for numerical variables) x1 <- impute(x$x1, fun = mean) # median imputation (for numerical variables) x2 <- impute(x$x2, fun = median) # mode imputation (for categorical/factor variables) x3 <- impute(x$x3, fun= mode) ``` --- # Class Activity: Your Turn! - Use the [weight_height_NA.csv](../data/weight_height_NA.csv) data. The column weight has missing value information represented with `-`. ```r weight_height_NA <- read_csv("../data/weight_height_NA.csv") ``` - Task 1: Recode missing values `-` to `NA`. - Task 2: Find the total missing values for `weight`. - Task 3: Impute the missing values using mean. ??? ```r #Task 1: weight_height_NA$weight[which(weight_height_NA$weight == "-")] <- NA head(weight_height_NA$weight) ``` ``` ## [1] "45.19" "61.83" "69.32" "64.48" "65.37" "55.86" ``` ```r weight_height_NA$weight<-as.numeric(weight_height_NA$weight) #Task 2: sum(is.na(weight_height_NA$weight)) ``` ``` ## [1] 11 ``` ```r # Task 3: boxplot(weight_height_NA$weight) ``` <img src="Module_05_Demo_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ```r weight_height_NA$weight <- impute(weight_height_NA$weight, fun = mean) ``` --- # More Complex Approaches to Missing Value Imputation - There are more complex methods to impute missing values (like predictive models). - There are many different predictive models and algorithms to **predict and impute** the missing values. - Regression analysis, multiple imputation methods, random forests, `\(k\)` nearest neighbors, last observation carried forward / next observation carried backward, etc. are some of these techniques. - In R, there are many different packages (e.g., `mice`, `missForest`, `impute` etc. ) that can be used to predict and impute the missing data. - In this course we will not cover these techniques. For details refer to ["Statistical analysis with missing data"](https://www.wiley.com/en-au/Statistical+Analysis+with+Missing+Data%2C+2nd+Edition-p-9781119013563) --- # Special values - There are a few special values that are used in R. - These are `-Inf`, `Inf` and `NaN`. - If a computation results in a very big number, R will return `Inf` for positive infinity and `-Inf` for a negative infinity. ```r 3 ^ 1024 ``` ``` ## [1] Inf ``` ```r -3 ^ 4000 ``` ``` ## [1] -Inf ``` ```r 12 / 0 ``` ``` ## [1] Inf ``` --- # Special values Cont. - Sometimes, a computation will produce a result that makes little sense. In these cases, R will often return `NaN` (meaning "not a number"): ```r 0/0 ``` ``` ## [1] NaN ``` ```r Inf - Inf ``` ``` ## [1] NaN ``` --- # Identifying Special Values - `is.finite`, `is.infinite`, or `is.nan` functions will generate logical values (TRUE or FALSE) and they can be used to identify special values. ```r m <- c( 2, 0/0, NA, 1/0, -Inf, Inf, (Inf*2) ) m ``` ``` ## [1] 2 NaN NA Inf -Inf Inf Inf ``` ```r is.finite(m) # check finite values ``` ``` ## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE ``` ```r is.infinite(m) # check infinite (-inf or +inf) values ``` ``` ## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE ``` ```r is.nan(m) # check not a number (NaN) values ``` ``` ## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE ``` --- # Class Activity: Your turn! - Create a data frame containing special values and NA's. ```r df <- data.frame(col1 = c( 2, 0/0, NA, 1/0, -Inf, Inf), col2 = c( NA, Inf/0, 2/0, NaN, -Inf, 4)) df ``` ``` ## col1 col2 ## 1 2 NA ## 2 NaN Inf ## 3 NA Inf ## 4 Inf NaN ## 5 -Inf -Inf ## 6 Inf 4 ``` - Find the special values in `df`. ??? ```r is.infinite(df) # Error in is.infinite(df) : default method not implemented for type 'list' ``` Here, you can make use of `apply` family functions: ```r # check whether data has infinite (-inf or +inf) values using sapply or apply sapply(df, is.infinite) ``` ``` ## col1 col2 ## [1,] FALSE FALSE ## [2,] FALSE TRUE ## [3,] FALSE TRUE ## [4,] TRUE FALSE ## [5,] TRUE TRUE ## [6,] TRUE FALSE ``` Another way is to write a simple function to check every numerical column in a data for infinite values or `NaN`'s. ```r # Check every numerical column whether they have infinite or NaN values using a function called is.special is.special <- function(x){ if (is.numeric(x)) (is.infinite(x) | is.nan(x)) } ``` Then, apply `is.special` function to each column of `df` using `sapply`. `is.special` checks the data frame for numerical special values if the type is numeric. ```r # apply this function to the data frame (list) using sapply. sapply(df, is.special) ``` ``` ## col1 col2 ## [1,] FALSE FALSE ## [2,] TRUE TRUE ## [3,] FALSE TRUE ## [4,] TRUE TRUE ## [5,] TRUE TRUE ## [6,] TRUE FALSE ``` --- # Remark: Useful `apply` family functions - These functions will apply a specified function to a given data object (e.g. vectors, lists, matrices, data frames). - Most common forms of apply functions are: - `apply()` for matrices and data frames - `lapply()` for lists (output as list) - `sapply()` for lists (output simplified) - `tapply()` for vectors - ... many others - There is a very useful and [compherensive tutorial on `apply` family functions](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family) in DataCamp. - In `swirl()`, "R Programming Course" Lesson 10 and 11 cover the apply family functions. --- # Functions to Remember for Week 6 - identify : `is.na`, `is.finite`, `is.infinite`, `is.nan`. - identify location: `which()`. - identify number of NAs: `sum()`, `colSums()`, `rowSums()`. - recode : `which()`. - exclude : `na.omit()`, `na.rm=TRUE`, `complete.cases()`. - impute: Hmisc `impute()`. <!-- - For extra study complete assignment in DataCamp. --> - Practice! --- # Worksheet questions: <center><img src="../images/giphy.gif" width="300px" /></center> - Working in small groups, complete the following worksheet: [Module 5 Worksheet](../worksheets/Week_06_Worksheet.html) - Once completed, feel free to work on your Assessments. <br> <br> <br> [Return to Course Website](../index.html)