Work in small groups.
Have a look at the four data examples with visualizations.
Inspect the variables visually for possible univariate and multivariate outliers.
df1 <- c(34, 30, 30, 29, 67, 29, 27, 30, 31, 31, 28)
Age = 67 is a possible (univariate) outlier
anscombe1<- anscombe[, c(3,7)]anscombe1
## x3 y3## 1 10 7.46## 2 8 6.77## 3 13 12.74## 4 9 7.11## 5 11 7.81## 6 14 8.84## 7 6 6.08## 8 4 5.39## 9 12 8.15## 10 7 6.42## 11 5 5.73
anscombe2<- anscombe[, c(4,8)]anscombe2
## x4 y4## 1 8 6.58## 2 8 5.76## 3 8 7.71## 4 8 8.84## 5 8 8.47## 6 8 7.04## 7 8 5.25## 8 19 12.50## 9 8 5.56## 10 8 7.91## 11 8 6.89
However, when we look at the bivariate (two dimensional) distribution of Height and Weight (using a scatter plot), we can see that we have one observation whose weight is 45.19 kg and height is 185.09 (on the upper-left side of the scatter plot).
This observation is far away from the most of the other weight and height combinations thus, will be seen as a multivariate outlier.
Some common causes of outliers (taken from: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)
Outliers can drastically change the results of the data analysis and statistical modeling. Some unfavorable impacts of outliers are:
There are many methods developed for outlier detection.
Majority of them deal with numerical data.
I will introduce the most basic descriptive, graphical and distance based methods to detect outliers with their application using R packages.
A box plot is a graphical display for describing the distribution of the data using the median, the first (Q1) and third quartiles (Q3), and the inter-quartile range (IQR = Q3−Q1).
Below is an illustration of a typical box plot (taken from Dr. James Baglin's Intro to Stats website)
"Tukey’s method of outlier detection" is used to detect outliers in the box plots.
This method is nonparametric, therefore is mainly used to test outliers in non-symmetric/ non-normal data distributions.
Outliers are defined as the values in the data set that fall beyond the range of Q1−1.5×IQR to Q3+1.5×IQR.
These limits are called "outlier fences" and any values outside the outlier fences are presented using an "o" or "*" on the box plot.
The boxplot()
function uses this technique to detect possible outliers.
The cars data set (from datasets package) includes the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s. Header of this data set is as follows.
library(datasets)head(cars)
## speed dist## 1 4 2## 2 4 10## 3 7 4## 4 7 22## 5 8 16## 6 9 10
Task 1: Identify the possible outliers for speed
using the box plot and Tukey’s method.
Task 2: Identify the possible outliers for dist
using the box plot and Tukey’s method.
Task 3: Identify the location of outlier in dist
.
Press P to reveal answers.
# Task 1:cars$speed %>% boxplot(main="Box Plot of speed", ylab="Speed in km/h", col = "grey")
# Task 2:cars$dist %>% boxplot(main="Box Plot of stopping distance", ylab="Stopping distance in meters", col = "grey")# Task 3:# Option 1:cars_outliers <- cars$dist %>% boxplot(main="Box Plot of stopping distance", ylab="Stopping distance in meters", col = "grey")
# The `out` value will report the values of any data points which lie beyond the outlier fences.cars_outliers$out
## [1] 120
# Then we can find the location of that value using basic subsettingcars[(cars$dist == 120),]
## speed dist## 49 24 120
# Other options# The package car identifies the outliers on the boxplotlibrary(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## ## Attaching package: 'car'
## The following object is masked from 'package:dplyr':## ## recode
Boxplot(cars$dist, id=TRUE)
## [1] 49
There are also distance based methods to detect univariate outliers. One of them is to use the z-scores (i.e., normal scores) method.
In this method, a standardized score (z-score) of all observations are calculated using the following equation:
zi=Xi−¯XS
Here Xi denotes the values of observations, ¯X and S are the sample mean and standard deviation, respectively.
An observation is regarded as an outlier based on its z-score, if the absolute value of its z-score is greater than 3.
Note that, this method assumes that the underlying data is normally distributed. Therefore, if the distribution is not approximately normal, this method shouldn't be used.
The "outliers package" provides a number of useful functions to systematically extract outliers. Among those, the scores()
function will calculate the z-scores for a given vector.
Use the scores
function in outliers
package to identify the outliers with z-score approach.
Task 1: Identify the possible outliers for speed
. If there exists an outlier, identify its location.
Task 2: Identify the possible outliers for dist
. If there exists an outlier, identify its location.
# Task 1:library(outliers)z_speed <- cars$speed %>% scores(type = "z") which( abs(z_speed) >3 )
## integer(0)
# Task 2:z_dist <- cars$dist %>% scores(type = "z") which( abs(z_dist) >3 )
## integer(0)
boxplot(cars$dist)
Scatter plots are used to visualise the relationship between two quantitative variables (x, y).
They are also very useful tools to detect obvious outliers for the two dimensional data (i.e., for two continuous variables).
The plot()
function will be used to get the scatter plot and detect outliers visually.
plot(y, x)
When we have one factor (categorical) variable and one continuous variable, bivariate box plot can be used to detect outliers for the continuous variable in each level of the factor variable.
To get a bivariate box plot with one factor variable (x) and a numerical variable (y) we can again use the boxplot()
function with the following generic arguments:
boxplot(y ~ x)
Use the cars data set (from datasets package).
library(datasets)head(cars)
## speed dist## 1 4 2## 2 4 10## 3 7 4## 4 7 22## 5 8 16## 6 9 10
Task 1: Inspect the data for possible multivariate outliers using scatter plot.
Task 2: Split the speed
variable into two groups and name this variable as speed_group
. The first group should include speed ≤ 15km/h and the second should include speed > 15km/h.
Task 3: Inspect the possible outliers in dist
grouped by speed_group
.
#Task 1:plot(cars$speed, cars$dist)
# Task 2:cars <- mutate(cars, speed_group = ifelse( (speed > 15), 2, 1))cars$speed_group <- factor(cars$speed_group, labels=c("=<15", ">15"))# Task 3:boxplot(cars$dist ~ cars$speed_group)
When there are more >2 variables, scatter plots can no longer be used. For such cases, multivariate distance based methods of outlier detection can be used.
The Mahalanobis distance is the most commonly used distance metric to detect outliers for the multivariate setting.
This distance is simply an extension of the univariate z-score, which also accounts for the correlation structure between all the variables.
Mahalanobis distance follows a Chi-square distribution with n (number of variables) degrees of freedom, therefore any Mahalanobis distance greater than the critical chi-square value is treated as outliers.
We will use the mvn()
function from MVN
package to get these distances as it will also provide us the useful Mahalanobis distance vs. Chi-square quantile distribution plot (QQ plot).
mvn()
function lets us to define the multivariate outlier detection method using the multivariateOutlierMethod
argument.results <- mvn(data = .., multivariateOutlierMethod = "quan", showOutliers = TRUE)
multivariateOutlierMethod = "quan"
argument, it detects the multivariate outliers using the chi-square distribution critical value approach mentioned before.showOutliers = TRUE
argument will depict any multivariate outliers and show them on the QQ plot.Use the cars data set (from datasets package).
library(datasets)head(cars)
## speed dist speed_group## 1 4 2 =<15## 2 4 10 =<15## 3 7 4 =<15## 4 7 22 =<15## 5 8 16 =<15## 6 9 10 =<15
Task 1: Inspect the data for possible multivariate outliers using the Mahalanobis distance vs. Chi-square quantile distribution plot.
Task 2: Find the locations of the multivariate outliers.
#Task 1:results <- mvn(data = cars[,1:2], multivariateOutlierMethod = "quan", showOutliers = TRUE)
# Task 2:results$multivariateOutliers
## Observation Mahalanobis Distance Outlier## 49 49 36.565 TRUE## 23 23 24.263 TRUE## 35 35 16.337 TRUE## 48 48 11.718 TRUE## 47 47 11.113 TRUE## 34 34 10.288 TRUE## 22 22 7.887 TRUE
Most of the ways to deal with outliers are similar to the methods of missing values like:
There are also other approaches specific to dealing with outliers like:
*Transforming and binning will be covered in the next module (Module 7: Transform).
which()
.Like imputation of missing values, we can also impute outliers.
We can use mean or median imputation methods to replace outliers.
Before imputing values, always check whether the outlier is a result of data entry/processing error.
If the outlier is due to a data entry/processing error, then go with imputing.
# Define a function to cap the values outside the limitscap <- function(x){ quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) ) x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1] x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4] x}
Use x3-y3 pair in anscombe data set:
anscombe1<- anscombe[, c(3,7)]anscombe1
## x3 y3## 1 10 7.46## 2 8 6.77## 3 13 12.74## 4 9 7.11## 5 11 7.81## 6 14 8.84## 7 6 6.08## 8 4 5.39## 9 12 8.15## 10 7 6.42## 11 5 5.73
Task 1: For y3
, cap the outlier using the function given before.
Task 2: For y3
, replace the outlier with its median.
# Task 1:task1 <- anscombe1$y3 %>% cap()# Task 2: Similar to task 1, we can write our own function to replace the outlier with its median:replace_median <- function(x){ quantiles <- quantile( x, c(0.25, 0.5, 0.75) ) x[ x < quantiles[1] - 1.5*IQR(x) ] <- quantiles[2] x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[2] x}task2 <- anscombe1$y3 %>% replace_median()
Transforming variables can also eliminate outliers.
Natural logarithm of a value reduces the variation caused by outliers.
Binning is also a form of variable transformation.
Transforming and binning will be covered in details in the next module.
We don’t always remove, impute, cap or transform suggested outliers in the data.
For some applications, (i.e. in anomaly detection or fraud detection), outliers can provide valuable information or insight therefore analysts may chose to keep those values for further investigation.
For such cases you may choose to leave (and investigate further) those values as they can tell you an interesting story about your data.
boxplot()
, plot()
Outliers package scores()
MVN package mvn()
Practice!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |