tidyr
packagepivot_longer()
functionpivot_wider()
functionseparate()
functionunite()
functiondplyr
packageselect()
functionfilter()
functionarrange()
functionmutate()
functionsummarise()
(or summarize()
)
functiongroup_by()
+ summarise()
functionIn this module, I will present Hadley Wickham’s “Tidy
Data” principles (Hadley Wickham et al.
(2014)) and discuss the main benefits of following these
principles. We will identify most common problems with messy data sets
and explore the powerful tidyr
package to tidy messy data
sets. Lastly, we will cover the “Grammar of Data
Manipulation” - the powerful dplyr
package using
examples.
In preparation of this section, I heavily used our recommended textbooks (Boehmke (2016), Hadley Wickham, Çetinkaya-Rundel, and Grolemund (2023)), R Studio’s Data wrangling with R and RStudio webinar, tidyr and dplyr reference manuals (H. Wickham (2024), H. Wickham et al. (2023)).
The learning objectives of this module are as follows:
tidyr
package tools.dplyr
package functions.“Happy families are all alike; every unhappy family is unhappy in
its own way.” —Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” —Hadley Wickham
Hadley Wickham wrote a stellar article called “Tidy
Data” in Journal of Statistical Software to provide a standard way
to organise data values within a dataset. In his paper, Wickham
developed the framework of “Tidy Data Principles” to
provide a standard and consistent way of storing data that makes
transformation, visualization, and modelling easier. Along with the tidy
data principles, he also developed the tidyr
package, which
provides a bunch of tools to help tidy up the messy data sets.
In this section, I will give you a practical introduction to tidy
data and the accompanying tools in the tidyr
package. If
you’d like to learn more about the underlying theory, you might enjoy
the Tidy
Data paper published in the Journal of Statistical Software.
Once you’ve imported and understand the structure of your data, it is
a good idea to tidy it. Tidying your data means storing it in a
consistent form that matches the semantics of the data set with the way
it is stored.
In brief, there are three interrelated rules which make a dataset tidy (Hadley Wickham, Çetinkaya-Rundel, and Grolemund (2023)). In tidy data:
The following illustration taken from Hadley Wickham, Çetinkaya-Rundel, and Grolemund (2023) shows these three rules visually:
To demonstrate these rules, we will use a simple data set:
Student Name | Math | English |
---|---|---|
Anna | 86 | 90 |
John | 43 | 75 |
Catherine | 80 | 82 |
In this simple data, there are in fact three variables illustrated in the following table:
First variable is “Student Name”, the second is “Subject” that represents whether the subject is Maths or English, and the third one is the “Grade” information inside the data matrix.
When we arrange each variable in columns and each student in a row then we will get the tidy version of the same data as follows:
No | Student Name | Subject | Grade |
---|---|---|---|
1 | Anna | Math | 86 |
2 | John | Math | 43 |
3 | Catherine | Math | 80 |
4 | Anna | English | 90 |
5 | John | English | 75 |
6 | Catherine | English | 82 |
You can see that in this format, each variable forms a column and each student forms a row:
The main advantage of using tidy principles is it allows R’s vectorized nature to shine. One can extract variables in a simple, standard way. Have a look at the following illustration. Which would you rather work with?
Tidy data is important because the consistent structure lets you focus on questions about the data, not fighting to get the data into the right form for different functions.
Real data sets can, and often do, violate the three principles of tidy data. This section describes most common problems with messy datasets:
In the example above, the column names 2011, 2012, and 2013 represent values of the year variable, and each row represents three observations, not one.
In the example above, date variable actually stores three new variable information, namely; year, month, and day.
In this example, the year, month, and day variables are given separately in the original data but assume that we need to combine these three columns into a single variable called date for the time series analysis.
tidyr
packageMost messy datasets can be tidied with a small set of tools. The
tidyr
package is a very useful package that reshapes the
layout of data sets. In the next section you will be introduced the
tidyr
package and its functions with examples.
We will use the subset of the data contained in the World Health
Organization Global Tuberculosis Report (also given in
tidyr
package documentation) to illustrate the functions in
the tidyr
package. Before loading this dataset, we need to
install and load the package using:
# install the tidyr package
#install.packages("tidyr")
# load the tidyr package
library(tidyr)
The following example shows the same data organized in four different
ways (table1
, table2
, table3
,
table4a
, table4b
). Each dataset shows the same
values of four variables, country
, year
,
population
, and cases, but each dataset organizes the
values in a different way as follows:
# load the example data organized in four different ways
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
pivot_longer()
functionThe function pivot_longer()
is an updated approach to
gather()
function. When column names are values instead of
variables, we need to gather or in other words, we need to transform
data from wide to long format.
To illustrate gather()
function, let’s have a look at
the data given in table4a
:
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To tidy a dataset like this, we need to gather those columns into a
new pair of variables using gather()
function. To describe
that operation, we need three parameters:
The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
The name of the variable whose values form the column names. The
argument name key
stands for that variable. For this
example, the key
argument is year
.
The name of the variable whose values are spread over the cells.
The argument name value
stands for that, in this example
value
argument is the number of cases.
table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
OR using the pivot_longer
function:
table4a %>% pivot_longer(names_to = "year", values_to = "cases", cols = 2:3)
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Here, the arguments names_to
is a string specifying the
name of the column to create from the data stored in the column names of
data, values_to
is a string specifying the name of the
column to create from the data stored in cell values, and
cols
refers to those columns that should be pivot into
longer format.
Note that in the R code below, I used the pipe (%>%
)
operator to take the data first, then use the gather function. The
tidyr
package functions can also be used along with the
pipe operator %>%
which is developed by Stefan Milton
Bache in the R package magrittr
. Remember that the
functions in tidyr
can be used without the pipe operator.
For more information on the pipe operator, its pros and cons please
refer to Dr. James Baglin’s R
Bootcamp Course 1.
pivot_wider()
functionThe function pivot_wider()
is an updated approach to
spread()
function. When multiple variables are stored in
rows, the spread()
function generates columns from rows. In
other words, it transforms data from long to wide format. The
spread()
function is the opposite of gather()
function.
Let’s look at table2
and assume that we are required to
turn long formatted data into wide formatted data by generating columns
from cases.
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
To tidy this up, we first analyse the representation in a similar way
to gather()
. This time, however, we only need two
parameters:
The column that contains variable names, the key column. Here, it’s type.
The column that contains values from multiple variables, the value column. Here, it’s count.
Once we’ve figured that out, we can use spread()
:
spread(table2, key = "type", value = "count")
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
OR equivalently, we have the following using the function
pivot_wider()
with the arguments names_from
and values_from
:
pivot_wider(table2, names_from = "type", values_from = "count")
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Now, cases
and population
are separate
variables given in columns, therefore, generating a new variable from
these two variables is super easy! Let’s calculate the Tuberculosis rate
(rate = cases/population) using:
rate = spread(table2, key = type, value = count)$cases / spread(table2, key = type, value = count)$population
rate
## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488
separate()
functionThe separate()
and extract()
functions both
can be used when multiple variables are stored in one column and you
want to split them according to the separator character. Take
table3
for example:
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate column contains both cases and population variables, and we need to split it into two variables.
table3 %>% separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
unite()
functionunite()
is the inverse of separate()
function. One can use it to combine multiple columns into a single
column.
Now let’s look at this data:
table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
In this data, assume that we want to combine the century
and year
variables into one variable called
new_year
. We can use unite()
for this
purpose:
table5 %>% unite(new_year, century, year)
## # A tibble: 6 × 3
## country new_year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
In this case we also need to use the sep
argument. The
default will place an underscore (_)
between the values
from different columns. Here we don’t want any separator, so we use
sep=""
as follows:
table5 %>% unite(new_year, century, year, sep="")
## # A tibble: 6 × 3
## country new_year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
dplyr
packageAlthough there are many data manipulation packages/functions in R,
most of them lack consistent coding and the ability to easily flow
together. This leads to difficult-to-read nested functions and/or choppy
code. Hadley Wickham developed the very popular dplyr
package to make these data processing tasks more efficient along with a
syntax that is consistent and easier to remember and read.
The dplyr
package is regarded as the “Grammar of
Data Manipulation” in R and it originates from the popular
plyr
package, also developed by Hadley Wickham. The
plyr
package covers data manipulation for a range of data
structures (i.e., data frames, lists, arrays) whereas dplyr
is focused on data frames. In this section, I will focus on
dplyr
. We will cover primary functions inside
dplyr
for data manipulation. The full list of capabilities
can be found in the dplyr
reference manual. I highly recommend going through it as there are
many great functions provided by dplyr
that I will not
cover here.
I will use the nycflights13
package and the data sets to
explore the basic data manipulation verbs of dplyr
. First,
we need to install and load the dplyr
and
nycflights13
packages using:
# install the dplyr package
#install.packages("dplyr")
# load the dplyr package
library(dplyr)
# install the nycflights13 package for the data set
#install.packages("nycflights13")
# load the nycflights13 package
library(nycflights13)
The nycflights13
package includes five data frames
containing information on airlines, airports, flights, weather, and
planes that departed from New York City in 2013. The data comes from the
US
Bureau of Transportation Statistics. Let’s look at the
nycflights13::flights
data set:
# View the flights data set under the nycflights13 package
nycflights13::flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
You might notice that this data frame prints differently from other
data frames you might have used in the past: it only shows the first few
rows and all the columns that fit on one screen. It prints differently
because it’s a tibble. Tibbles are a modern take on data frames. They
are slightly tweaked to work better with tidyr
and
dplyr
(and many others). For now, you don’t need to worry
about the differences (you may refer to here
to learn more on tibbles.
select()
functionWhen working with a large data frame, often we want to only assess
specific variables. The select()
function allows us to
select and/or rename variables.
In addition to the existing functions like :
and
c()
, there are a number of special functions that can work
inside select. Some of them are given in the following table.
Functions | Usage |
---|---|
- |
Select everything but |
: |
Select range |
contains() |
Select columns whose name contains a character string |
ends_with() |
Select columns whose name ends with a string |
everything() |
Select every column |
matches() |
Select columns whose name matches a regular expression |
num_range() |
Select columns named x1, x2, x3, x4, x5 |
one_of() |
Select columns whose names are in a group of names |
starts_with() |
Select columns whose name starts with a character string |
To illustrate we will use the flights
data. Let’s select
year, month and day columns using:
# Select columns: year, month and day
dplyr::select(flights, year, month, day)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
Like tidyr
, dplyr
can also work with the
%>%
operator. Therefore, we can use the following code
to do the same selection:
# Select columns by name using the pipe operator
flights %>% select(year, month, day)
Here are other examples of using select()
:
# Select all columns between year and day (inclusive)
flights %>% select(year:day)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
# Select all columns except those from year to day (inclusive)
flights %>% select( -(year:day) )
## # A tibble: 336,776 × 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 517 515 2 830 819 11 UA
## 2 533 529 4 850 830 20 UA
## 3 542 540 2 923 850 33 AA
## 4 544 545 -1 1004 1022 -18 B6
## 5 554 600 -6 812 837 -25 DL
## 6 554 558 -4 740 728 12 UA
## 7 555 600 -5 913 854 19 B6
## 8 557 600 -3 709 723 -14 EV
## 9 557 600 -3 838 846 -8 B6
## 10 558 600 -2 753 745 8 AA
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
For more information on available functions in select
,
type ?select
.
filter()
functionThe filter()
function allows us to identify or select
observations in which a particular variable matches a specific
value/condition. The condition in the filter()
function can
be any kind of logical comparison and Boolean operators, such as:
Symbol | Usage |
---|---|
< |
Less than |
> |
Greater than |
== |
Equal to |
<= |
Less than or equal to |
>= |
Greater than or equal to |
!= |
Not equal to |
%in% |
Group membership |
is.na |
Is NA |
!is.na |
Is not NA |
& |
Boolean and |
| |
Boolean or |
xor |
exactly or |
! |
not |
any |
any true |
all |
all true |
For example, we can select all flights on January 1st using the following:
# Filter the flights on January 1st
flights %>% filter( month == 1, day == 1)
## # A tibble: 842 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
When you run that line of code, dplyr
executes the
filtering operation and returns a new data frame. dplyr
functions never modify their inputs, so if you want to save the result,
you’ll need to use the assignment operator, <-
:
# Filter the flights on January 1st and save this result
jan1 <- flights %>% filter( month == 1, day == 1)
The following code finds all flights that departed in November or December:
# Filter the flights departing in November or December
x <- flights %>% filter( month == 11 | month == 12)
If we want to find flights that aren’t delayed (on arrival or departure) by more than two hours, we can use either of the following two filters:
# Filter the flights that aren't delayed (on arrival or departure) by more than two hours
flights %>% filter( arr_delay <= 120, dep_delay <= 120 )
## # A tibble: 316,050 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 316,040 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# gives the same result as above
flights %>% filter( ! (arr_delay > 120 | dep_delay > 120) )
For more information on available functions in filter
,
type ?filter
.
arrange()
functionThe arrange()
function allows us to order data by
variables in ascending or descending order.
Let’s order the flights
data in an ascending order using
year, month and day.
# Order the data set according to three variables
flights %>% arrange( year, month, day )
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can also apply a descending argument to rank-order from highest to
lowest. The following shows the same data but in descending order by
applying desc()
within the arrange()
function.
# Order the data set according to departure time in a descending order
flights %>% arrange( desc(dep_time) )
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 30 2400 2359 1 327 337
## 2 2013 11 27 2400 2359 1 515 445
## 3 2013 12 5 2400 2359 1 427 440
## 4 2013 12 9 2400 2359 1 432 440
## 5 2013 12 9 2400 2250 70 59 2356
## 6 2013 12 13 2400 2359 1 432 440
## 7 2013 12 19 2400 2359 1 434 440
## 8 2013 12 29 2400 1700 420 302 2025
## 9 2013 2 7 2400 2359 1 432 436
## 10 2013 2 7 2400 2359 1 443 444
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Note that the missing values will always be sorted at the end.
mutate()
functionThe mutate()
function allows us to add new variables
while preserving the existing variables. Here is the list of some useful
functions used inside the mutate()
.
Functions | Usage |
---|---|
pmin() , pmax() |
Element wise min and max |
cummin() , cummax() |
Cumulative min and max |
cumsum() , cumprod() |
Cumulative sum and product |
between() |
Are values between a and b? |
cume_dist() |
Cumulative distribution of values |
cumall() , cumany() |
Cumulative all and any |
cummean() |
Cumulative mean |
lead() , lag() |
Copy with values one position |
ntile() |
Bin vector into n buckets |
dense_rank() , min_rank() ,
percent_rank() , row_number() |
Various ranking methods |
# Select specific variables from the data and store them in a new data frame
flights_sub<- flights %>% select(arr_delay, dep_delay, air_time)
# Create new variables "gain", "hours", and "gain_per_hour"
mutate(flights_sub, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours)
## # A tibble: 336,776 × 6
## arr_delay dep_delay air_time gain hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11 2 227 9 3.78 2.38
## 2 20 4 227 16 3.78 4.23
## 3 33 2 160 31 2.67 11.6
## 4 -18 -1 183 -17 3.05 -5.57
## 5 -25 -6 116 -19 1.93 -9.83
## 6 12 -4 150 16 2.5 6.4
## 7 19 -5 158 24 2.63 9.11
## 8 -14 -3 53 -11 0.883 -12.5
## 9 -8 -3 140 -5 2.33 -2.14
## 10 8 -2 138 10 2.3 4.35
## # ℹ 336,766 more rows
Note that the new variables will appear at the end of the
flights
data frame.
An alternative to mutate()
is transmute()
which creates a new variable and then drops the other variables.
Essentially, it allows you to create a new data frame with only the new
variables created.
transmute(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours)
## # A tibble: 336,776 × 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 9 3.78 2.38
## 2 16 3.78 4.23
## 3 31 2.67 11.6
## 4 -17 3.05 -5.57
## 5 -19 1.93 -9.83
## 6 16 2.5 6.4
## 7 24 2.63 9.11
## 8 -11 0.883 -12.5
## 9 -5 2.33 -2.14
## 10 10 2.3 4.35
## # ℹ 336,766 more rows
summarise()
(or summarize()
)
functionThe summarise()
(a.k.a. summarize()
)
function allows us to perform the majority of summary statistics when
performing exploratory data analysis. Here is the list of some useful
functions that can be used inside summary()
.
Functions | Usage |
---|---|
min() , max() |
Minimum and maximum values |
mean() |
Mean value |
median() |
Median value |
sum() |
Sum of values |
var() , sd() |
Variance and standard deviation of a vector |
first() |
First value in a vector |
last() |
Last value in a vector |
nth() |
Nth value in a vector |
n() |
The number of values in a vector |
n_distinct() |
The number of distinct values in a vector |
All functions in this list takes a vector of values and returns a single summary value. We can get the average delay using:
# Take the average of departure delay
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
The na.rm = TRUE
argument in mean()
function will allow us to ignore the missing values while computing the
average. We will revisit na.rm = TRUE
argument in the next
section (in Module 5).
group_by()
+ summarise()
functionIf we want to take the summary statistics grouped by a variable, then
we need to use another function called group_by()
.
group_by()
along with summarise()
functions
will allow us to take and compare summary statistics grouped by a factor
variable.
For example, if we applied the same code to a data frame grouped by destination, we can get the average delay for each destination.
# Group by destination and use summarise to calculate the mean delay
flights %>% group_by(dest) %>% summarise(mean_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 105 × 2
## dest mean_delay
## <chr> <dbl>
## 1 ABQ 13.7
## 2 ACK 6.46
## 3 ALB 23.6
## 4 ANC 12.9
## 5 ATL 12.5
## 6 AUS 13.0
## 7 AVL 8.19
## 8 BDL 17.7
## 9 BGR 19.5
## 10 BHM 29.7
## # ℹ 95 more rows
Often, we have separate data frames that can have common and differing variables for similar observations. These types of data sets are referred as relational data sets.
We will revisit the nycflights13
package. The
nycflights13
package contains the following data sets:
airlines
includes the names of airline companies and
their abbreviated code:airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
airports
gives information about each airport,
identified by the airport code (faa
):airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
planes
gives information about each plane, identified
by its tail number (tailnum
):planes
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
weather
gives the weather conditions at each NYC
airport for each hour:weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
Therefore, for nycflights13
:
flights
connects to planes
via a single
variable, tailnum
.flights
connects to airlines
through the
carrier
variable.flights
connects to airports
in two ways:
via the origin
and dest
variables.flights
connects to weather
via
origin
(the location), and year
,
month
, day
, and hour
(the
time).The following illustration (adapted from Hadley Wickham, Çetinkaya-Rundel, and Grolemund
(2023)) shows the relationship between flights
,
airlines
, airports
and weather
data sets, and the key variables connecting them.
nycflights13
(taken from Hadley
Wickham, Çetinkaya-Rundel, and Grolemund (2023))
The dplyr
package offers three sets of joining functions
to provide alternative ways to join data frames. These are:
Mutating joins: This group of functions add new variables to one data frame from matching observations in another.
Filtering joins: This group of functions filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations: This group of functions treat observations as if they were set elements.
The first set of functions to combine data sets is called
“mutating joins”. left_join()
,
right_join()
, inner_join()
, and
full_join()
functions are in this group. The mutating join
functions allow you to combine variables from two tables and add
variables to the right (like mutate
).
Note that, mutating join functions add variables to the right.
Therefore, if you have a lot of variables already in the data, the new
variables won’t get printed out. As flights
data set has
many variables, I will first create a narrower data set named
flights2
to easily show you what’s going on in the
examples.
# Create a new data set named flights2 including year - day, hour, origin, destination, tailnum and carrier variables
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ℹ 336,766 more rows
Imagine you want to add the full airline name (from airlines) to the
flights2
data. You can combine the airlines
and flights2
data frames using left_join()
.
Remember that we will need a key variable and the key variable will be
the carrier
variable to join these two data sets.
# joining flights2 and airlines using the carrier name.
flights2 %>% left_join(airlines, by = "carrier")
## # A tibble: 336,776 × 9
## year month day hour origin dest tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
## 4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
## 5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
## 8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
## 10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
## # ℹ 336,766 more rows
Each mutating join takes an argument by
that controls
which variables are used to match observations in the two data sets.
There are a few ways to specify it:
NULL
: The default value. dplyr
will use
all variables that appear in both tables, a natural join. For example,
the flights
and weather
data sets match on
their common variables: year
, month
,
day
, hour
and origin
.# joining flights2 and weather using the default key = NULL.
flights2 %>% left_join(weather)
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 336,776 × 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # ℹ 336,766 more rows
## # ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
by = "x"
. For example,
flights
and planes
have tailnum
in common.# joining flights2 and planes using the tailnum.
flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 × 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## # seats <int>, speed <int>, engine <chr>
by = c("a" = "b")
. This will
match variable a
in table x
to variable
b
in table y
. This is useful when the key
variables in both data sets are not given the same name. For example,
flights data set has the destination airport code (dest
)
and the airports data set has the faa
code. Essentially
these two are equivalent. Therefore, we can use the following to join
these two data sets:flights2 %>% left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 × 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
To help you learn how different types of xxx_join()
functions work, I’m going to use Hadley Wickham’s visual representation
(Hadley Wickham, Çetinkaya-Rundel, and Grolemund
(2023)):
Filtering joins match observations in the same way as mutating joins,
but affect the observations, not the variables. There are two types of
filtering joins: semi_join()
and
anti_join()
.
semi_join(x, y)
: keeps all observations in x that
have a match in y.
anti_join(x, y)
: drops all observations in x that
have a match in y.
Anti-joins are useful for diagnosing join mismatches. For example,
when connecting flights
and planes
, you might
be interested to know that there are many flights that don’t have a
match in planes:
flights %>% anti_join(planes, by = "tailnum") %>% count(tailnum, sort = TRUE)
## # A tibble: 722 × 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # ℹ 712 more rows
Here is a visual representation of these two filtering joins.
Suppose you have flights2
(X) and weather
(Y) data sets. Recall, these data sets match on their common variables:
year
, month
, day
,
hour
and origin
.
If you want to add/combine columns to/from data set X from/to data set Y, you need to use mutating joins.
LEFT_JOIN: Let’s say, you want to see the weather of
flights departing at 6 in the morning. You will need to use
left_join()
to ADD the weather
columns to the
flights2
data set.
flights2_h6 <- flights2 %>% filter(hour == 6)
# joining flights2 and weather using the default key = NULL.
flights2_h6 %>% left_join(weather)
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 25,951 × 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 2 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 3 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 4 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 5 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## 6 2013 1 1 6 JFK PBI N793JB B6 37.9 27.0 64.3
## 7 2013 1 1 6 JFK TPA N657JB B6 37.9 27.0 64.3
## 8 2013 1 1 6 JFK LAX N29129 UA 37.9 27.0 64.3
## 9 2013 1 1 6 EWR SFO N53441 UA 37.9 28.0 67.2
## 10 2013 1 1 6 LGA DFW N3DUAA AA 39.9 25.0 54.8
## # ℹ 25,941 more rows
## # ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
RIGHT_JOIN: Let’s say, you want to see the flights
that has weather visibility of 5. You will need to use
right_join()
to ADD the flights2
columns to
the weather
data set.
weather_v5 <- weather %>% filter(visib == 5)
flights2 %>% right_join(weather_v5) %>% select(visib, everything())
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 4,916 × 18
## visib year month day hour origin dest tailnum carrier temp dewp humid
## <dbl> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 5 2013 1 9 7 LGA BOS N947UW US 37.0 33.1 85.4
## 2 5 2013 1 9 7 LGA DCA N712US US 37.0 33.1 85.4
## 3 5 2013 1 9 7 LGA MIA N953DL DL 37.0 33.1 85.4
## 4 5 2013 1 9 7 LGA ORD N466AA AA 37.0 33.1 85.4
## 5 5 2013 1 9 7 LGA MIA N3CDAA AA 37.0 33.1 85.4
## 6 5 2013 1 9 7 LGA XNA N730MQ MQ 37.0 33.1 85.4
## 7 5 2013 1 9 7 LGA ORD N438UA UA 37.0 33.1 85.4
## 8 5 2013 1 9 7 LGA MCO N506JB B6 37.0 33.1 85.4
## 9 5 2013 1 9 7 LGA MKE N979AT FL 37.0 33.1 85.4
## 10 5 2013 1 9 7 LGA RSW N593JB B6 37.0 33.1 85.4
## # ℹ 4,906 more rows
## # ℹ 6 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, time_hour <dttm>
INNER_JOIN: Let’s say, you want to see the flights
that departs at 6 in the morning and has weather visibility of 5. You
will need to use inner_join()
to COMBINE the
flights2_h6
and weather_v5
columns into a
single data set with hour = 6 and visib = 5.
flights2_h6 %>% inner_join(weather_v5) %>% select(hour, visib, everything())
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 497 × 18
## hour visib year month day origin dest tailnum carrier temp dewp humid
## <dbl> <dbl> <int> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 6 5 2013 1 29 JFK FLL N588JB B6 37.0 35.1 92.5
## 2 6 5 2013 1 29 JFK IAD N830AS EV 37.0 35.1 92.5
## 3 6 5 2013 1 29 JFK RSW N516JB B6 37.0 35.1 92.5
## 4 6 5 2013 1 29 JFK PHL N949UW US 37.0 35.1 92.5
## 5 6 5 2013 1 29 JFK LAX N67134 UA 37.0 35.1 92.5
## 6 6 5 2013 1 29 JFK SFO N532UA UA 37.0 35.1 92.5
## 7 6 5 2013 1 29 JFK PBI N768JB B6 37.0 35.1 92.5
## 8 6 5 2013 1 29 JFK SJU N793JB B6 37.0 35.1 92.5
## 9 6 5 2013 1 29 JFK ATL N197PQ 9E 37.0 35.1 92.5
## 10 6 5 2013 1 29 JFK BOS N775JB B6 37.0 35.1 92.5
## # ℹ 487 more rows
## # ℹ 6 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, time_hour <dttm>
FULL_JOIN: Let’s say, you want to see the flights
that departs at 6 in the morning or has weather visibility of 5. You
will need to use full_join()
to COMBINE the
flights2_h6
and weather_v5
columns into a
single data set with hour = 6 OR visib = 5.
flights2_h6 %>% full_join(weather_v5) %>% select(hour, visib, everything())
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 26,315 × 18
## hour visib year month day origin dest tailnum carrier temp dewp humid
## <dbl> <dbl> <int> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 6 NA 2013 1 1 LGA ATL N668DN DL NA NA NA
## 2 6 NA 2013 1 1 EWR FLL N516JB B6 NA NA NA
## 3 6 NA 2013 1 1 LGA IAD N829AS EV NA NA NA
## 4 6 NA 2013 1 1 JFK MCO N593JB B6 NA NA NA
## 5 6 NA 2013 1 1 LGA ORD N3ALAA AA NA NA NA
## 6 6 NA 2013 1 1 JFK PBI N793JB B6 NA NA NA
## 7 6 NA 2013 1 1 JFK TPA N657JB B6 NA NA NA
## 8 6 NA 2013 1 1 JFK LAX N29129 UA NA NA NA
## 9 6 NA 2013 1 1 EWR SFO N53441 UA NA NA NA
## 10 6 NA 2013 1 1 LGA DFW N3DUAA AA NA NA NA
## # ℹ 26,305 more rows
## # ℹ 6 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, time_hour <dttm>
Meanwhile, if you want to filter records in data set X using data set Y retaining only columns of data set X, you need to use filtering joins.
SEMI_JOIN: Let’s say, you want to see flights with
weather visibility of 5. You will need to use semi_join()
to FILTER the flights2
data set using the
weather_v5
data set.
flights2 %>% semi_join(weather_v5)
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 4,809 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 9 7 LGA BOS N947UW US
## 2 2013 1 9 7 LGA DCA N712US US
## 3 2013 1 9 7 LGA MIA N953DL DL
## 4 2013 1 9 7 LGA ORD N466AA AA
## 5 2013 1 9 7 LGA MIA N3CDAA AA
## 6 2013 1 9 7 LGA XNA N730MQ MQ
## 7 2013 1 9 7 LGA ORD N438UA UA
## 8 2013 1 9 7 LGA MCO N506JB B6
## 9 2013 1 9 7 LGA MKE N979AT FL
## 10 2013 1 9 7 LGA RSW N593JB B6
## # ℹ 4,799 more rows
ANTI_JOIN: Let’s say, you want to see flights with
weather visibility not equal to 5. You will need to use
anti_join()
to FILTER the flights2
data set
using the weather_v5
dataset.
flights2 %>% anti_join(weather_v5)
## Joining with `by = join_by(year, month, day, hour, origin)`
## # A tibble: 331,967 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ℹ 331,957 more rows
Set operations expect the x and y inputs to have the same variables
and treat the observations like sets. There are three types of set
operations:
intersect(x, y)
: return only observations in both x
and y.
union(x, y)
: return unique observations in x and
y.
setdiff(x, y)
: return observations in x, but not in
y.
Often you may just need to merge data frames by row and column. The
bind_rows()
and bind_cols()
bind the multiple
data frames by row and column, respectively.
bind_rows(x, y)
: Append y to x as new rows.
bind_cols(x, y)
: Append y to x as new
columns.
You can refer to the tidyr
package manual (H. Wickham (2024)) and
the Tidy
Data paper for a detailed information on tidy data principles and
tidyr
package.
Our recommended textbooks (Boehmke
(2016) and Hadley Wickham,
Çetinkaya-Rundel, and Grolemund (2023)), R
Studio’s Data wrangling with R and RStudio webinar, and dplyr
reference manual (H. Wickham (2024), H. Wickham et al. (2023)) are great resources to
excel your knowledge in Data Manipulation with dplyr
.
If you are interested in improving your coding style, I would
recommend using the tidyverse
style
guide. Note that a good coding style is like correct punctuation,
you can manage without it, but a good coding style makes things easier
to read for you and for the others who are gonna use your codes!