The following packages will be required or may come in handy.
library(readr)
library(dplyr)
library(readxl)
library(gdata)
library(rvest)
library(tidyr)
library(knitr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
The following exercises (exercise 1-3) will be based on Private Consumption Data, pr.RDS located at http://www.oecd-ilibrary.org/economics/data/main-economic-indicators_mei-data-en containing 44 observations of countries’ seasonally adjusted private consumption which is one of the main economic indicators. Variables are self explanatory however it is expected to do checks on the type of the data and using the suitable transformations if necessary.
Identify NAs in full data frame and for each column, use
print()
to see the results. Find out the location of NAs in
each column using the appropriate function. Identify the count of NAs in
each column using sum()
, then use colSums()
for the same task.
Create two data frames using complete.cases()
and
subsetting with !
operator to get incomplete cases
respectively. Use na.omit()
to get complete cases. Find out
the country that has NAs for every column using rowSums()
nested with is.na()
then remove that country from the data
frame with the same method with reversing the calculation.
Recode the missing values using the mean values for quarterly
values for 2017 with rowMeans()
, use
na.rm=TRUE
argument. To complete this, simply use the mean
values of the quarterly columns for 2017, recode the NA values using
ifelse()
function. When you complete recoding the missing
values, check for the existence of NaN
values within the
data frame, refer to the lecture notes for is.notanumber()
function given below (using na.rm=TRUE
will create
NaN
values).
is.notanumber <- function(x){ if (is.numeric(x)) is.nan(x) }
The following exercises (exercise 4-6) will be based on Population by Country Time Series Data, popbycountry.csv sourced from U.S Department of Energy, located at https://openei.org/doe-opendata/dataset/population-by-country-1980-2010 containing 232 observations of countries’ population in millions per year from 1980 to 2010. Variables are self explanatory however it is expected to do checks on the type of the data and using the suitable transformations if necessary.
Investigate the NA
values in the
popbycountry.csv
data set. Have you noticed --
values? Replace them with NA
. Remove the countries if they
have NAs for each column. Now identify the NA values. Don’t forget to
check types of data using str()
or typeof()
functions, make appropriate adjustments.
Use str_detect()
function from stringr
package to dplyr::filter()
strings that has
Germany
in it and save this as a data frame. Replace the NA
values in Germany
row with the column sums of
Germany, East
and Germany, West
.
Repeat task 5 in the original dataset, this time do the
calculation without subsetting Germany
. Remove
Germany, East
, Germany, West
and the countries
that have no data for any year.
Data Challenge: Use the data frame you created
in Exercise 5 and impute the NAs with the mean values using
impute()
function from Hmisc
package. Use
is.imputed()
to see values imputed once you finish. Work
with the transpose of the data frame with simply using t()
.
The challenge is creating a loop since there are many columns with NAs.
Once you take the transpose, don’t forget to check the type of the
columns! Understand the difference of [ ]
and
[[ ]]
. Here is an example to get you started:
df <- data.frame(var1=c(1,3,NA,10),
var2=c(NA,2,5,9))
vars <- c("var1", "var2")
for (i in vars ) { df[[i]]<-mean(df[[i]], na.rm=TRUE) #replaces every value with the mean of the column
}
Bonus exercise will be based on randomly sampled bank marketing data (revisiting from Week 1), banksim.csv which is manipulated for the purpose of the task,located at UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Bank+Marketing containing the variables:
age: Numerical variable
marital: Categorical variable with three levels (married,single,divorced where widowed counted as divorced)
education: Categorical variable with three levels (primary, secondary, tertiary)
job: Categorical variable containing type of jobs
balance: Numerical variable, balance in the bank account
day: Numerical variable, last contacted month of the day
month: Categorical variable, last contacted month
duration: Numerical variable, duration of the contact time
If you have finished the above tasks, work through the weekly list of tasks posted on the Canvas announcement page.