Data types in general
Data types in R
R’s data structures
Converting Data Types/Structures
is.
functionsas.
functionsLet’s think about your daily coffee consumption:
One can say that I drink every day. Then it will be a nominal variable.
If you say that I drink 4 cups every day, then it will be discrete.
If you say that I drink 80 grams of coffee then it will be continuous.
R is a programming language, it has own definitions of data types and structures.
Technically, R classifies all the different types of data into four classes:
Useful functions in R:
class()
to check the class of an object.typeof()
to check whether a numeric object is integer or double.levels()
to see the levels of a factor object.Logical class consists of TRUE or FALSE (binary) values.
A logical value is often created via comparison between variables.
x <- 10y <- (x > 0)y
## [1] TRUE
class(y)
## [1] "logical"
Numeric (integer or double): Quantitative values are numeric in R.
Numeric class can be integer or double.
Integer types can be seen as discrete values (e.g., 2) whereas, double class will have floating point numbers (e.g., 2.16).
To create a double numeric variable:
var1 <- c(4, 7.5, 14.5)
L
directly after each number:var2 <- c(4L, 7L, 14L)
var1 <- c(4, 7.5, 14.5)var2 <- c(4L, 7L, 14L)
class(var1)
## [1] "numeric"
class(var2)
## [1] "integer"
typeof()
.typeof(var1)
## [1] "double"
typeof(var2)
## [1] "integer"
Character: A character class is used to represent string values in R.
To generate a character object, use quotation marks " "
and assign a string/text to an object:
var3 <- c("debit", "credit", "Paypal")class(var3)
## [1] "character"
Factor class is used to represent qualitative data in R.
Factors can be ordered or unordered.
They store the nominal values as a vector of integers in the range 1…k (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
Factor objects can be created with the factor()
function:
var4 <- factor( c("Male", "Female", "Male", "Male") )var4
## [1] Male Female Male Male ## Levels: Female Male
class(var4)
## [1] "factor"
levels()
function will be used:levels(var4)
## [1] "Female" "Male"
levels()
function will be used:levels(var4)
## [1] "Female" "Male"
By default, levels of the factors will be ordered alphabetically.
Using the levels()
argument, we can control the ordering of the levels while creating a factor:
var5 <- factor( c("Male", "Female", "Male", "Male"), levels = c("Male", "Female") )var5
## [1] Male Female Male Male ## Levels: Male Female
levels(var5)
## [1] "Male" "Female"
ordered = TRUE
argument:var6 <-factor( c("DI", "HD", "PA", "NN", "CR", "DI", "HD", "PA"), levels = c("NN", "PA", "CR", "DI", "HD"), ordered=TRUE )var6
## [1] DI HD PA NN CR DI HD PA## Levels: NN < PA < CR < DI < HD
NN < PA < CR < DI < HD
in the output.Factors are also created during the data import. Many import functions like read.csv()
, read_cvs()
, read.table()
etc. have stringsAsFactors
option that determines how the character data is read in R.
The default is stringsAsFactors = False
, but with setting it to TRUE
all columns that are detected to be character/strings are converted to factor variables.
read.csv()
:pets <- read.csv("../data/VIC_pet.csv", stringsAsFactors = TRUE)#pets1 <- read.csv("../data/VIC_pet.csv")str(pets)
## 'data.frame': 40 obs. of 8 variables:## $ id : Factor w/ 40 levels "10396v","104515v",..: 15 9 39 21 18 4 30 6 14 25 ...## $ State : Factor w/ 1 level "Victoria": 1 1 1 1 1 1 1 1 1 1 ...## $ Region : Factor w/ 7 levels "Ballarat","Colac Otway",..: 2 7 1 3 3 1 3 5 7 3 ...## $ Animal_Type : Factor w/ 7 levels "Cat","Cat ",..: 7 2 1 4 4 4 4 5 6 1 ...## $ Animal_Name : Factor w/ 28 levels "","Bailey","Blacky",..: 9 1 2 24 28 20 14 1 1 10 ...## $ Breed_Description: Factor w/ 34 levels "","American Staffordshire Terrier",..: 23 11 6 14 24 16 28 28 2 11 ...## $ Colour : Factor w/ 3 levels "","NULL","WHI ": 2 1 1 1 1 1 1 1 1 1 ...## $ Animal_Desexed : Factor w/ 4 levels "","N","y","Y": 1 1 1 1 1 1 1 4 1 1 ...
Animal_Type
variable and check its levels using:levels(pets$Animal_Type)
## [1] "Cat" ## [2] "Cat " ## [3] "dog" ## [4] "Dog" ## [5] "DOG " ## [6] "Dog " ## [7] "Dog "
Note that actually there are two unique levels for the Animal_Type
i.e. dog and cat.
However due to the automatic conversion of different strings to factors we observe seven different levels in Animal_Type
.
Therefore it is a good practice to read such strings as characters and then apply string manipulations (which will be covered in Module 8) to standardize all strings to "dog" and "cat".
id
variable which contains the unique identification number of the pets:levels(pets$id)
## [1] "10396v" "104515v" "110188v" "114898v" "129666v" "13234v" "137135v"## [8] "141587v" "142785v" "143032v" "151452v" "151569v" "151921v" "154462v"## [15] "17819v" "1828v" "18714v" "35939v" "3654v" "39333v" "46906v" ## [22] "49872v" "51127v" "54848v" "55483v" "5754v" "61112v" "64560v" ## [29] "66701v" "70244v" "70794v" "7089v" "77361v" "81001v" "84561v" ## [36] "88946v" "92359v" "93485v" "97268v" "97957v"
There is no need to factorize id
variable as there are 40 observations and 40 different levels for the id
level.
Therefore, any factorization of id
variable would be inefficient. For such cases, it is better to leave this column as character (stringsAsFactors = FALSE
) during the data import.
A data set is a collection of measurements or records which can be in any class (i.e., logical, character, numeric, factor, etc.).
Typically, data sets contain many variables of different length and type of values.
In R, we can store data sets using vectors, lists, matrices and data frames and these are called "Data Structures".
A vector is the basic structure in R, which consists of one-dimensional sequence of data elements of the same basic type (i.e., integer , double , logical, or character).
Vectors are created by combining multiple elements into one dimensional array using the combine c()
function.
The one-dimensional examples illustrated previously are considered vectors:
var1 <- c(4, 7.5, 14.5) # a double numeric vectorvar2 <- c(4L, 7L, 14L) # an integer vectorvar3 <- c(T, F, T, T) # a logical vector
Vector of characters + numerics:
ex1 <- c("a", "b", "c", 1, 2, 3)
Vector of numerics + logical:
ex2 <- c(1, 2, 3, TRUE, FALSE)
Vector of logical + characters:
ex3 <- c(TRUE, FALSE, "a", "b", "c")
Vector of characters + numerics:
ex1 <- c("a", "b", "c", 1, 2, 3)
Vector of numerics + logical:
ex2 <- c(1, 2, 3, TRUE, FALSE)
Vector of logical + characters:
ex3 <- c(TRUE, FALSE, "a", "b", "c")
--> a character vector
## [1] "character"
--> a numeric vector
## [1] "numeric"
--> a character vector
## [1] "character"
Vector of characters + numerics:
ex1 <- c("a", "b", "c", 1, 2, 3)
Vector of numerics + logical:
ex2 <- c(1, 2, 3, TRUE, FALSE)
Vector of logical + characters:
ex3 <- c(TRUE, FALSE, "a", "b", "c")
--> a character vector
## [1] "character"
--> a numeric vector
## [1] "numeric"
--> a character vector
## [1] "character"
To add additional elements to a vector use c()
function.
Let's add two elements (4 and 6) to the ex2
vector:
ex4 <- c(ex2, 4, 6)ex4
## [1] 1 2 3 1 0 4 6
[ ]
with positive or negative integers, logical values or names.ex4
## [1] 1 2 3 1 0 4 6
Take the third element ex4
:
ex4[3]
## [1] 3
Take first three elements in ex4
:
ex4[1:3]
## [1] 1 2 3
[ ]
with positive or negative integers, logical values or names.ex4
## [1] 1 2 3 1 0 4 6
Take the third element ex4
:
ex4[3]
## [1] 3
Take first three elements in ex4
:
ex4[1:3]
## [1] 1 2 3
Take the 1st, 3rd, and 5th element:
ex4[c(1,3,5)]
## [1] 1 3 0
Take all elements except first:
ex4[-1]
## [1] 2 3 1 0 4 6
Take all elements less than 3:
ex4[ ex4 < 3 ]
## [1] 1 2 1 0
A list is an R structure that allows you to combine elements of different types and lengths.
In order to create a list we can use the list()
function.
list1 <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.5, 4.2))
str()
:str(list1)
## List of 4## $ : int [1:3] 1 2 3## $ : chr "a"## $ : logi [1:3] TRUE FALSE TRUE## $ : num [1:2] 2.5 4.2
append()
function. Let's add a fifth element to the list1
and store it as list2
:list2 <- append(list1, list(c("credit", "debit", "Paypal")))str(list2)
## List of 5## $ : int [1:3] 1 2 3## $ : chr "a"## $ : logi [1:3] TRUE FALSE TRUE## $ : num [1:2] 2.5 4.2## $ : chr [1:3] "credit" "debit" "Paypal"
These metadata can be very useful in that they help to describe the object. Some examples of R object attributes are:
Attributes of an object (if any) can be accessed using the attributes()
function. Let's check if list2
has any attributes.
attributes(list2)
## NULL
names()
function.# add names to a pre-existing listnames(list2) <- c ("item1", "item2", "item3", "item4", "item5")str(list2)
## List of 5## $ item1: int [1:3] 1 2 3## $ item2: chr "a"## $ item3: logi [1:3] TRUE FALSE TRUE## $ item4: num [1:2] 2.5 4.2## $ item5: chr [1:3] "credit" "debit" "Paypal"
$
sign.$
sign, square brackets [ ]
or double square brackets [[ ]]
:list2[1] # take the first list item in list2
## $item1## [1] 1 2 3
list2[[1]] # take the first list item in list2 without attributes
## [1] 1 2 3
list2$item1 # take the first list item in list2 using $
## [1] 1 2 3
list2$item1[3] # take the third element out of first list item
## [1] 3
[ ]
and double square brackets [[ ]]
in subsetting lists:A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.
In R, the elements of a matrix must be of same class (i.e. all elements must be numeric, or character, etc.) and all columns of a matrix must be of same length.
matrix()
function using nrow
and ncol
arguments. m1 <- matrix(1:6, nrow = 2, ncol = 3)m1
## [,1] [,2] [,3]## [1,] 1 3 5## [2,] 2 4 6
Matrices can also be created using the column-bind cbind()
and row-bind rbind()
functions.
Note that the vectors that are being binded must be of equal length and mode.
v1 <- c( 1, 4, 5)v2 <- c( 6, 8, 10)
# create a matrix using column-bindm2 <- cbind(v1, v2) m2
## v1 v2## [1,] 1 6## [2,] 4 8## [3,] 5 10
# create a matrix using row-bindm3 <- rbind(v1, v2) m3
## [,1] [,2] [,3]## v1 1 4 5## v2 6 8 10
cbind()
and rbind()
functions to add onto matrices.v3 <- c(9, 8, 7)m4 <- rbind(m3, v3)m4
## [,1] [,2] [,3]## v1 1 4 5## v2 6 8 10## v3 9 8 7
rownames
and colnames
. rownames(m4) <- c("subject1", "subject2", "subject3")colnames(m4) <- c("var1", "var2", "var3")attributes(m4)
## $dim## [1] 3 3## ## $dimnames## $dimnames[[1]]## [1] "subject1" "subject2" "subject3"## ## $dimnames[[2]]## [1] "var1" "var2" "var3"
In order to subset matrices we use the [ ]
operator.
As matrices have two dimensions we need to specify subsetting arguments for both row and column dimensions like: matrix[rows, columns]:
m4
## var1 var2 var3## subject1 1 4 5## subject2 6 8 10## subject3 9 8 7
m4[1,2] # take the value in the first row and second column
## [1] 4
m4[1:2, ] # subset for rows 1 and 2 but keep all columns
## var1 var2 var3## subject1 1 4 5## subject2 6 8 10
m4[ , c(1, 3)] # subset for columns 1 and 3 but keep all rows
## var1 var3## subject1 1 5## subject2 6 10## subject3 9 7
m4[1:2, c(1, 3)] # subset for both rows and columns
## var1 var3## subject1 1 5## subject2 6 10
The most common way of storing data in R and, generally, is the data structure most often used for data analyses.
A data frame (DF) is a list of equal-length vectors and they can store different classes of objects in each column (i.e., numeric, character, factor).
DFs are usually created by importing/reading in a data set using the functions covered in Module 2.
Can also be created explicitly with the data.frame()
function or they can be coerced from other types of objects like lists.
df1 <- data.frame( col1 = 1:3, col2 = c ("credit", "debit", "Paypal"), col3 = c (TRUE, FALSE, TRUE), col4 = c (25.5, 44.2, 54.9), stringsAsFactors = TRUE)str(df1)
## 'data.frame': 3 obs. of 4 variables:## $ col1: int 1 2 3## $ col2: Factor w/ 3 levels "credit","debit",..: 1 2 3## $ col3: logi TRUE FALSE TRUE## $ col4: num 25.5 44.2 54.9
In the example above, col2
is converted to a column of factors. This is because of using stringsAsFactors = TRUE
that converts character columns to factors.
stringsAsFactors = FALSE
):df1 <- data.frame (col1 = 1:3, col2 = c ("credit", "debit", "Paypal"), col3 = c (TRUE, FALSE, TRUE), col4 = c (25.5, 44.2, 54.9))str(df1)
## 'data.frame': 3 obs. of 4 variables:## $ col1: int 1 2 3## $ col2: chr "credit" "debit" "Paypal"## $ col3: logi TRUE FALSE TRUE## $ col4: num 25.5 44.2 54.9
cbind()
and rbind()
functions:# create a new vectorv4 <- c("VIC", "NSW", "TAS")# add a column (variable) to df1df2 <- cbind(df1, v4)
rownames()
and colnames()
.rownames(df2) <- c("subj1", "subj2", "subj3") # add row namescolnames(df2) <- c("number", "card_type", "fraud", "transaction", "state") # add column names str(df2)
## 'data.frame': 3 obs. of 5 variables:## $ number : int 1 2 3## $ card_type : chr "credit" "debit" "Paypal"## $ fraud : logi TRUE FALSE TRUE## $ transaction: num 25.5 44.2 54.9## $ state : chr "VIC" "NSW" "TAS"
attributes(df2)
## $names## [1] "number" "card_type" "fraud" "transaction" "state" ## ## $class## [1] "data.frame"## ## $row.names## [1] "subj1" "subj2" "subj3"
Data frames possess the characteristics of both lists and matrices.
If you subset with a single vector, they behave like lists and will return the selected columns with all rows and if you subset with two vectors, they behave like matrices and can be subset by row and column.
df2
## number card_type fraud transaction state## subj1 1 credit TRUE 25.5 VIC## subj2 2 debit FALSE 44.2 NSW## subj3 3 Paypal TRUE 54.9 TAS
df2[2:3, ] # subset by row numbers, take second and third rows only
## number card_type fraud transaction state## subj2 2 debit FALSE 44.2 NSW## subj3 3 Paypal TRUE 54.9 TAS
df2[c("subj2", "subj3"), ] # same as above but uses row names
## number card_type fraud transaction state## subj2 2 debit FALSE 44.2 NSW## subj3 3 Paypal TRUE 54.9 TAS
df2[, c(1,4)] # subset by column numbers, take first and forth columns only
## number transaction## subj1 1 25.5## subj2 2 44.2## subj3 3 54.9
df2[, c("number", "transaction")] # same as above but uses column names
## number transaction## subj1 1 25.5## subj2 2 44.2## subj3 3 54.9
df2[2:3, c(1, 4)] # subset by row and column numbers
## number transaction## subj2 2 44.2## subj3 3 54.9
df2[c("subj2", "subj3"), c("number", "transaction")] # same as above but uses row and column names
## number transaction## subj2 2 44.2## subj3 3 54.9
df2$fraud # subset using $: take the column (variable) fraud
## [1] TRUE FALSE TRUE
df2$fraud[2] # take the second element in the fraud column
## [1] FALSE
In traditional programming languages, you need to specify the type of data as a given variable can contain i.e. either integer, character, string or decimal.
R is smart enough to "guess/create" the data type based on the values provided for a variable. However, R is not that smart (thanks to that! Otherwise why we need analysts!) to guess the correct data type within the context of analysis.
library(readr)bank <- read_csv("../data/banksim.csv")str(bank)
## spc_tbl_ [15 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)## $ id : num [1:15] 1 2 3 4 5 6 7 8 9 10 ...## $ age : chr [1:15] "44" "88" "36" "25<=" ...## $ marital : chr [1:15] "married" "married" "divorced" "single" ...## $ education: chr [1:15] "secondary" "secondary" "secondary" "secondary" ...## $ job : chr [1:15] "blue-collar" "admin." "blue-collar" "technician" ...## $ balance : num [1:15] 16178 330 853 616 310 ...## $ day : num [1:15] 21 2 20 28 12 16 15 5 26 14 ...## $ month : chr [1:15] "nov" "dec" "jun" "jul" ...## $ duration : num [1:15] 297 357 15 117 54 -268 129 156 168 216 ...## - attr(*, "spec")=## .. cols(## .. id = col_double(),## .. age = col_character(),## .. marital = col_character(),## .. education = col_character(),## .. job = col_character(),## .. balance = col_double(),## .. day = col_double(),## .. month = col_character(),## .. duration = col_double()## .. )## - attr(*, "problems")=<externalptr>
The str()
output reveals how R guesses the data types of each variable.
Accordingly, id
, day
and duration
are read as numeric values, and the rest are read as characters. However, according to the variable definitions given above, the correct data type for age
and balance
variables should be numeric (or integer).
As seen from the output, row 4 of age
column has "<=" and row 12 of balance
column is "528D", therefore these characters forced columns to be read as characters even if they have a numeric nature.
A good practice is always to:
check the definitions of variables, and understand their types within the context;
then apply proper type conversions if they are not in the correct data type.
as.
functions will convert the object to a given type (whenever possible) and is.
functions will test for the given data type and return a logical value (TRUE
or FALSE
).as. Functions |
Changes type to | is. Functions |
Checks if type is | |
---|---|---|---|---|
as.numeric() |
numeric | is.numeric() |
numeric | |
as.integer() |
integer | is.integer() |
integer | |
as.double() |
double | is.double() |
double | |
as.character() |
character | is.character() |
character | |
as.factor() |
factor | is.factor() |
factor | |
as.logical() |
logical | is.logical() |
logical | |
as.vector() |
vector | is.vector() |
vector | |
as.list() |
list | is.list() |
list | |
as.matrix() |
matrix | is.matrix() |
matrix | |
as.data.frame() |
data frame | is.data.frame() |
data frame |
Understand R’s basic data types (i.e., character, numeric, integer, factor, and logical).
Understand R’s basic data structures (i.e., vector, list, matrix, and data frame) and main differences between them.
Learn to check attributes (i.e., name, dimension, class, levels etc.) of R objects.
Learn how to convert between data types/structures and understand coercion rules.
Data types in general
Data types in R
R’s data structures
Converting Data Types/Structures
is.
functionsas.
functionsKeyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |