Module 4.1 Demonstration
Tidy & Manipulate: Part I - Tidy
1 / 45

Recall

Now that you have imported your data and you understand the basics of managing your data structure.
The next thing you probably want to do is jump into exploratory data analysis (i.e. basic descriptive statistics or data visualizations).
However, prior to that, it is important to make sure your data frame is properly prepared for analysis.
This may require you to do some basic manipulation and ensure your data is in a "tidy" format.

2 / 45

%>% (Pipe) Operator

The tidyr package makes use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr.
Operator %>% moves or "pipes" the result forward into the next function call/expression.
$f (x)$ is the same as $x$ %>% $f ()$

For instance, regular code chunks work from inside out:

finally_last_step(
  and_then_third(
    then_second(
      do_first(data)
    )
  )
)

Piping uses intuitive ordering:

data %>%
  do_first() %>% then_second() %>% and_then_third() %>% finally_last_step()

3 / 45

What is Tidy data?

Hadley Wickham's tidy data publication

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

4 / 45

Why Tidy data ?Placing variables in columns takes advantage of R's vectorized nature. One can extract variables in a simple, standard way. 
5 / 45

Why Tidy data ?

Placing variables in columns takes advantage of R's vectorized nature. One can extract variables in a simple, standard way.
Have a look at the following illustration. Which would you rather work with?

5 / 45

Why Tidy data ?

Placing variables in columns takes advantage of R's vectorized nature. One can extract variables in a simple, standard way.
Have a look at the following illustration. Which would you rather work with?
Provides a standard and consistent way of storing data that makes transformation, visualization, and modeling easier.

5 / 45

Activity1. States of the United States of America

The following data frame includes some characteristics of 50 states of the United States of America for 1977.

##             Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama           3615   3624        2.1    69.05   15.1    41.3    20
## Alaska             365   6315        1.5    69.31   11.3    66.7   152
## Arizona           2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas          2110   3378        1.9    70.66   10.1    39.9    65
## California       21198   5114        1.1    71.71   10.3    62.6    20
## Colorado          2541   4884        0.7    72.06    6.8    63.9   166
## Connecticut       3100   5348        1.1    72.48    3.1    56.0   139
## Delaware           579   4809        0.9    70.06    6.2    54.6   103

Is this Tidy? Press P to get the answer!

6 / 45

Population: population Income: per capita income Illiteracy: illiteracy (percent of the population) Life Exp: life expectancy in years Murder: murder and non-negligent manslaughter rate per 100,000 population HS Grad: percent high-school graduates Frost: mean number of days with minimum temperature below freezing Area: land area in square miles

Tidy as one variable per column and one observation per row. This way, we can take the required variable/column using $ or [ ] very easily using df$Life_Exp and calculate its mean using mean(df$Life_Exp).

Activity2. Province population in Canada

The following data frame includes province population in different provinces of Canada.

##    source destination migrants distance  pops66  pops71 popd66 popd71
## 1     PEI        NFLD      255      924  108535  111641 493396 522104
## 2      NS        NFLD     2380      952  756039  788960 493396 522104
## 3      NB        NFLD     1140     1119  616788  534557 493396 522104
## 4     QUE        NFLD     2145     1641 5780845 6027764 493396 522104
## 5     ONT        NFLD     6295     1996 6960870 7703106 493396 522104
## 6     MAN        NFLD      215     3159  963066  988247 493396 522104
## 7    SASK        NFLD      185     3542  955344  926242 493396 522104
## 8    ALTA        NFLD      425     4059 1463203 1627874 493396 522104
## 9      BC        NFLD      425     4838 1873674 2184621 493396 522104
## 10   NFLD         PEI      340      924  493396  522104 108535 111641

Is this data Tidy? Press P to get the answer!

7 / 45

source: Province of Canada pops66: 1966 population of source province. pops71: 1971 population of source province.

Untidy as two variables given in columns (population 1966 and 1971) are not variables but values of another variable.

Activity3. WHO TB rates

The following data frame includes Tuberculosis rates (TB) of different countries:

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Is this data Tidy? Press P to get the answer!

8 / 45

country: country name year: year rate: TB rate given in (cases/population)

Untidy. Tuberculosis rates (TB) cannot be evaluated in this format. TB rate column consists of two variables namely: TB cases and population.

Tidy data Cont.

The principles of tidy data might seem so obvious that you wonder if you’ll ever encounter a data set that isn’t tidy. Unfortunately, however, most real data is untidy. There are two main reasons:

Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

9 / 45

Long vs. wide format data

A single data set can be rearranged in many different ways.
One of the ways is called "long format". In this layout, the data set is arranged in such a way that a single subject's information is stored in multiple rows.
In the wide format, a single subject's information is stored in multiple columns.
The main difference between a wide layout and a long layout is that the wide layout contains all the measured information in different columns.

10 / 45

Long vs. wide & tidy data

Tidy data is typically in a long format:

Tidy data follows the rules of long format: each variable is a column, and each observation is a row.
Long format is the most common format used in tidy data.
Wide format is often used for summarizing or comparing data but is not as suitable for analysis and is not considered tidy data.

11 / 45

The `tidyr` package

tidyr is a one such package which was built for the sole purpose of simplifying the process of creating tidy data.
There are four fundamental functions of data tidying that tidyr provides:

pivot_longer() OR gather() makes "wide" data longer.
pivot_wider() OR spread() makes "long" data wider.
separate() splits a single column into multiple columns.
unite() combines multiple columns into a single column.

12 / 45

Tidying data with `tidyr`

Source

13 / 45

`pivot_longer()` (`gather()`) function

Objective: Reshaping wide format to long format.

14 / 45

`pivot_longer()` function arguments

Function:       pivot_longer(data, names_to = " ", values_to = " ", cols = )
Arguments:
        data:           data frame
        names_to:       name of the column to create from the data stored in the column names of data
        values_to:      name of the column to create from the data stored in cell values
        cols:           columns that should be pivot into longer format.

15 / 45

`gather()` function arguments

Function:       gather(data, key, value,...)
Arguments:
        data:           data frame
        key:            column name representing new variable
        value:          column name representing variable values
        ...:            names of columns to gather (or not gather)

16 / 45

`gather()` function

17 / 45

`gather()` function

18 / 45

`gather()` function

19 / 45

`gather()` function

20 / 45

Example

Consider the following data set:

##    Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
## 1      1 2006    15    16    19    17
## 2      1 2007    12    13    27    23
## 3      1 2008    22    22    24    20
## 4      1 2009    10    14    20    16
## 5      2 2006    12    13    25    18
## 6      2 2007    16    14    21    19
## 7      2 2008    13    11    29    15
## 8      2 2009    23    20    26    20
## 9      3 2006    11    12    22    16
## 10     3 2007    13    11    27    21
## 11     3 2008    17    12    23    19
## 12     3 2009    14     9    31    24

Wide or long format data ?

21 / 45

Example

Consider the following data set:

##    Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
## 1      1 2006    15    16    19    17
## 2      1 2007    12    13    27    23
## 3      1 2008    22    22    24    20
## 4      1 2009    10    14    20    16
## 5      2 2006    12    13    25    18
## 6      2 2007    16    14    21    19
## 7      2 2008    13    11    29    15
## 8      2 2009    23    20    26    20
## 9      3 2006    11    12    22    16
## 10     3 2007    13    11    27    21
## 11     3 2008    17    12    23    19
## 12     3 2009    14     9    31    24

Wide or long format data ?
This data is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable.

21 / 45

Example Cont.

We need to gather each quarter within one column/variable and also gather values associated with each quarter in a second column.


# long_DF <- DF %>% gather(Quarter, Revenue, Qtr.1:Qtr.4) 
long_DF <- DF %>% pivot_longer(cols = Qtr.1:Qtr.4,names_to = "Quarter",values_to = "Revenue")
head(long_DF, 24)  # note, for brevity, I only show the data for the first two years 
##    Group Year Quarter Revenue
## 1      1 2006   Qtr.1      15
## 2      1 2007   Qtr.1      12
## 3      1 2008   Qtr.1      22
## 4      1 2009   Qtr.1      10
## 5      2 2006   Qtr.1      12
## 6      2 2007   Qtr.1      16
## 7      2 2008   Qtr.1      13
## 8      2 2009   Qtr.1      23
## 9      3 2006   Qtr.1      11
## 10     3 2007   Qtr.1      13
## ..   ...  ...     ...     ...

22 / 45

Example Cont.

##    Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
## 1      1 2006    15    16    19    17
## 2      1 2007    12    13    27    23
## 3      1 2008    22    22    24    20
...
...

Note that all of these produce the same result:

DF %>% pivot_longer(cols = Qtr.1:Qtr.4,names_to = "Quarter",values_to = "Revenue")
DF %>%pivot_longer(cols = -c(Group, Year),names_to = "Quarter",values_to = "Revenue")
DF %>% pivot_longer(cols = 3:6,names_to = "Quarter",values_to = "Revenue")
DF %>% pivot_longer(cols = c(Qtr.1, Qtr.2, Qtr.3, Qtr.4),names_to = "Quarter",values_to = "Revenue")

23 / 45

`pivot_wider()` (`spread()`) function

Objective: Reshaping long format to wide format.

This function is a complement to pivot_wider()/spread().

24 / 45

`pivot_wider()` function arguments

There are times when we are required to turn long formatted data into wide formatted data.
When multiple variables are stored in rows, the pivot_wider() function generates columns from rows. This function spreads a key-value pair across multiple columns.

Function:       pivot_wider(data, names_from = " ", values_from = " ")
Arguments:
        data:           data frame
        names_from:      column that contains variable names
        values_from:     column that contains values from multiple variables

25 / 45

`spread()` function arguments

Function:       spread(data, key, value)
Arguments:
        data:           data frame
        key:            column values to convert to multiple columns
        value:          single column values to convert to multiple columns' values

26 / 45

`spread()` function

27 / 45

`spread()` function

28 / 45

`spread()` function

29 / 45

Example

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Suppose you want to calculate tuberculosis rate from (rate = cases/population).
To achieve this, cases and population needs to be separately given in columns.

30 / 45

Example Cont.

#spread(table2, key = type, value = count)
pivot_wider(table2,names_from = type,values_from = count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

31 / 45

`separate( )` function

Objective: Splitting a single variable into two.

Many times multiple variables are stored in one column and you want to split them according to a separator character.

Function:       separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as:        data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)
Arguments:
        data:           data frame
        col:            column name representing current variable
        into:           names of variables representing new variables
        sep:            how to separate current variable (char, num, or symbol)
        remove:         if TRUE, remove input column from output data frame
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

32 / 45

`separate( )` function

33 / 45

`separate( )` function

34 / 45

`separate( )` function

35 / 45

`separate( )` function

36 / 45

Example `seperate()` function

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

The rate column contains both cases and population variables, and we need to split it into two variables.

table3 %>%separate(rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

37 / 45

`unite( )` function

Objective: Merging two variables into one.

There may be a time in which we would like to combine the values of two variables. The unite() function combine multiple columns into a single column.

Function:       unite(data, col, ..., sep = " ", remove = TRUE)
Same as:        data %>% unite(col, ..., sep = " ", remove = TRUE)
Arguments:
        data:           data frame
        col:            column name of new "merged" column
        ...:            names of columns to merge
        sep:            separator to use between merged values
        remove:         if TRUE, remove input column from output data frame

38 / 45

`unite( )` function

39 / 45

`unite( )` function

40 / 45

`unite( )` function

41 / 45

`unite( )` function

42 / 45

Example `unite()` function

## # A tibble: 5 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272

Assume that we want to combine the century and year variables into one variable called new_year.

table5 %>%unite(new_year, century, year, sep="")

## # A tibble: 6 × 3
##   country     new_year rate             
##   <chr>       <chr>    <chr>            
## 1 Afghanistan 1999     745/19987071     
## 2 Afghanistan 2000     2666/20595360    
## 3 Brazil      1999     37737/172006362  
## 4 Brazil      2000     80488/174504898  
## 5 China       1999     212258/1272915272
## 6 China       2000     213766/1280428583

43 / 45

What do you need to know by Week 4

Distinguish tidy vs untidy data sets.
Distinguish wide vs long format data.
Understand tidy data principles.
Use tidyr package functions.

Practice!

44 / 45

Worksheet questions:

Complete the following worksheet:

Module 4.1 Worksheet

Once completed, feel free to work on your Assessments.

Return to Course Website

45 / 45

Recall

Now that you have imported your data and you understand the basics of managing your data structure.

The next thing you probably want to do is jump into exploratory data analysis (i.e. basic descriptive statistics or data visualizations).

However, prior to that, it is important to make sure your data frame is properly prepared for analysis.

This may require you to do some basic manipulation and ensure your data is in a "tidy" format.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Module 4.1 Demonstration

Tidy & Manipulate: Part I - Tidy

Recall

%>% (Pipe) Operator

What is Tidy data?

Why Tidy data ?

Why Tidy data ?

Why Tidy data ?

Activity1. States of the United States of America

Activity2. Province population in Canada

Activity3. WHO TB rates

Tidy data Cont.

Long vs. wide format data

Long vs. wide & tidy data

The tidyr package

Tidying data with tidyr

pivot_longer() (gather()) function

pivot_longer() function arguments

gather() function arguments

gather() function

gather() function

gather() function

gather() function

Example

Example

Example Cont.

Example Cont.

pivot_wider() (spread()) function

pivot_wider() function arguments

spread() function arguments

spread() function

spread() function

spread() function

Example

Example Cont.

separate( ) function

separate( ) function

separate( ) function

separate( ) function

separate( ) function

Example seperate() function

unite( ) function

unite( ) function

unite( ) function

unite( ) function

unite( ) function

Example unite() function

What do you need to know by Week 4

Worksheet questions:

Recall

Help

The `tidyr` package

Tidying data with `tidyr`

`pivot_longer()` (`gather()`) function

`pivot_longer()` function arguments

`gather()` function arguments

`gather()` function

`gather()` function

`gather()` function

`gather()` function

`pivot_wider()` (`spread()`) function

`pivot_wider()` function arguments

`spread()` function arguments

`spread()` function

`spread()` function

`spread()` function

`separate( )` function

`separate( )` function

`separate( )` function

`separate( )` function

`separate( )` function

Example `seperate()` function

`unite( )` function

`unite( )` function

`unite( )` function

`unite( )` function

`unite( )` function

Example `unite()` function