Module 8-2 Demonstration

class: center, middle, inverse, title-slide

.title[
# Module 8-2 Demonstration
]
.subtitle[
## Special Operations: Dealing with character variables
]

---

# Creating Strings

- The most basic way to create strings is to use quotation marks and assign a string to an object.

``` r
quote <- "Most valuable thing you have as a leader is clear data"

author <- "Ruth Porat"
```

- The `paste()` function under Base R is used for creating and building strings. `str_c()` is equivalent to the `paste()` function.

``` r
paste(quote, "by", author)  
```

```
## [1] "Most valuable thing you have as a leader is clear data by Ruth Porat"
```

- Use `paste0()` to paste without spaces between characters.

``` r
paste0("I", "love",  "Data", "Wrangling") 
```

```
## [1] "IloveDataWrangling"
```

---

# Converting to Strings

- Strings and characters can be tested with `is.character()` and any other data format can be converted into strings/characters with `as.character()`.

``` r
is.character(quote)
```

```
## [1] TRUE
```

``` r
as.character(3.54)
```

```
## [1] "3.54"
```

---

# Printing Strings

Printing strings/characters can be done with the following:

Function | Usage
---------|-------
`print()`  |                  generic printing
`noquote()` | print with no quotes
`cat()` | concatenate and print with no quotes & no line number

``` r
print( paste(quote,author) , quote = FALSE)
```

```
## [1] Most valuable thing you have as a leader is clear data Ruth Porat
```

``` r
noquote( paste(quote,author) )
```

```
## [1] Most valuable thing you have as a leader is clear data Ruth Porat
```

``` r
cat( paste(quote,author) )
```

```
## Most valuable thing you have as a leader is clear data Ruth Porat
```

---

# Printing Strings Cont.

``` r
# basic printing of alphabet

cat(letters)             
```

```
## a b c d e f g h i j k l m n o p q r s t u v w x y z
```

``` r
# specify a seperator between the combined characters

cat(letters, sep = "-") 
```

```
## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z
```

---

# Printing Strings Cont.

- To format the line width for printing long strings use the  `fill` argument.

``` r
# No breaks between lines
cat(quote, author, quote, author, fill = FALSE)
```

```
## Most valuable thing you have as a leader is clear data Ruth Porat Most valuable thing you have as a leader is clear data Ruth Porat
```

``` r
# Breaks between lines
cat(quote, author, quote, author, fill = TRUE)
```

```
## Most valuable thing you have as a leader is clear data Ruth Porat 
## Most valuable thing you have as a leader is clear data Ruth Porat
```

---

# Counting string elements and characters

- To count the number of elements in a string use length().

``` r
length("How many elements are in this string?")
```

```
## [1] 1
```

``` r
length( c("How", "many", "elements", "are", "in", "this", "string?") )
```

```
## [1] 7
```

- To count the number of characters in a string use `nchar()`.

``` r
nchar("How many characters are in this string?")
```

```
## [1] 39
```

``` r
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
```

```
## [1]  3  4 10  3  2  4  7
```

---

# String manipulation with Base R

- Basic string manipulation typically includes:
    - case conversion;
    - simple character replacement;
    - pattern replacement;
    - abbreviating; 
    - substring replacement; 
    - adding/removing white space;
    - set operations.

- These operations can all be performed with base R functions; however, some operations are greatly simplified with the  `stringr` package.

---

# Upper/lower case conversion

- To convert all upper case characters to lower case use `tolower()`.

- To convert all lower case characters to upper case use `toupper()`.

``` r
a <- "MATH2349 is AWesomE"

tolower(a)
```

```
## [1] "math2349 is awesome"
```

``` r
toupper(a)
```

```
## [1] "MATH2349 IS AWESOME"
```

---

# Simple Character Replacement

- To replace a character (or multiple characters) in a string use `chartr()`.

``` r
# replace 'z' with 's'

american <- "This is how we analyze."

chartr(old = "z", new = "s", american)
```

```
## [1] "This is how we analyse."
```

``` r
# replace 'i' with 'w', 'X' with 'h' and 's' with 'y'

x <- "MiXeD cAsE 123"
chartr(old ="iXs", new ="why", x)
```

```
## [1] "MwheD cAyE 123"
```

---

# Pattern Replacement

- To replace a pattern in a string use `gsub()`.

``` r
# replace "ot" pattern with "ut"

x <- "R Totorial"

gsub(pattern = "ot", replacement="ut", x)
```

```
## [1] "R Tutorial"
```

---

# String Abbreviations

- To abbreviate strings we can use `abbreviate()`.

``` r
streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")

# default abbreviations
abbreviate(streets)
```

```
## Victoria    Yarra  Russell Williams Swanston 
##   "Vctr"   "Yarr"   "Rssl"   "Wllm"   "Swns"
```

``` r
# set minimum length of abbreviation
abbreviate(streets, minlength = 5)
```

```
## Victoria    Yarra  Russell Williams Swanston 
##  "Victr"  "Yarra"  "Rssll"  "Wllms"  "Swnst"
```

---

# Extract/Replace Substrings

- The purpose of `substr()` is to extract and replace substrings with specified starting and stopping characters.

``` r
alphabet <- paste(LETTERS, collapse = "")

alphabet
```

```
## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
```

``` r
# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)
```

```
## [1] "RSTUVWX"
```

``` r
# replace 19-24th characters with `R`

substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet
```

```
## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"
```

---

# Extract/Replace Substrings

- To split the elements of a character string use `strsplit()`.

``` r
z <- "Victoria Yarra Russell Williams Swanston"
strsplit(z, split = " ")
```

```
## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"
```

``` r
a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-") 
```

```
## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"
```

- Note that the output of `strsplit()` is a list. To convert the output to a simple atomic vector simply use `unlist()`.

``` r
* unlist(strsplit(a, split = "-")) 
```

```
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"
```

---

# Set operatons for character strings

Function | Usage
---------|-------
`union()`  | obtain union between two character vectors
`intersect()` | obtain the common elements of two character vectors
`setdiff()` | obtain the non-common elements, or the difference
`setequal()` | tests if two vectors contain the same elements regardless of order
`identical()` | tests if two character vectors are equal in content and order

---

# Set operatons for character strings Cont.

``` r
set_1 <- c("VIC", "NSW", "WA", "TAS")
set_2 <- c("TAS", "QLD", "SA", "NSW")

union(set_1, set_2)
```

```
## [1] "VIC" "NSW" "WA"  "TAS" "QLD" "SA"
```

``` r
intersect(set_1, set_2)
```

```
## [1] "NSW" "TAS"
```

``` r
setdiff(set_1, set_2)
```

```
## [1] "VIC" "WA"
```

``` r
setdiff(set_2, set_1)
```

```
## [1] "QLD" "SA"
```

---

# String manipulation with stringr

- The `stringr` package was developed by Hadley Wickham to provide a consistent and simple wrappers to common string operations.

- These functions are closely related to their base R equivalents:

- Concatenate with `str_c()` ( `$\sim$` `paste()` and `paste0()`).
    
    - Number of characters with `str_length()`  ( `$\sim$` `nchar()`).
    
    - Substring with `str_sub()` ( `$\sim$` `substr()` ).

---

# Duplicate Characters within a String

- In addition, the `stringr` has a new functionality using `str_dup()`

``` r
str_dup("Data", times = 4)
```

```
## [1] "DataDataDataData"
```

``` r
str_dup("Data", times = 1:4)
```

```
## [1] "Data"             "DataData"         "DataDataData"     "DataDataDataData"
```

---

# Remove Leading and Trailing White space

- In string processing, a common task is parsing text into individual words.

- Often, this results in words having blank spaces (white spaces) on either end of the word. The `str_trim()` can be used to remove these spaces.

``` r
text <- c("Text ", "  with", " whitespace ")
text
```

```
## [1] "Text "        "  with"       " whitespace "
```

``` r
str_trim(text, side = "left")
```

```
## [1] "Text "       "with"        "whitespace "
```

``` r
str_trim(text, side = "right")
```

```
## [1] "Text"        "  with"      " whitespace"
```

``` r
str_trim(text, side = "both")
```

```
## [1] "Text"       "with"       "whitespace"
```

---

# Pad a String with White space

- Conversely, to add whitespace, or to pad a string, we can use `str_pad()`.

``` r
str_pad("Data", width = 10, side = "left")
```

```
## [1] "      Data"
```

``` r
str_pad("Data", width = 10, side = "both")
```

```
## [1] "   Data   "
```

- Use `str_pad()` to pad a string with specified characters. The `width` argument will give width of padded strings and the `pad` argument will specify the padding characters.

``` r
str_pad("Data", width = 10, side = "right", pad = "!")
```

```
## [1] "Data!!!!!!"
```

---

# Pattern matching

- The vast majority of string manipulations require pattern matching for a given text.

- Good news is, `stringr` package has pattern matching functions to detect, subset, locate, count, extract, and replace strings.

---

# Pattern detection with str_detect()

- `str_detect()` detects the presence or absence of a pattern and returns a logical vector.

``` r
# detects pattern "ea"

x <- c("apple", "banana", "pear","pEAr")

str_detect(x, pattern ="ea")
```

```
## [1] FALSE FALSE  TRUE FALSE
```

``` r
#same as above

str_detect(x, "ea")
```

```
## [1] FALSE FALSE  TRUE FALSE
```

---

# Remark: Regular expressions (Regex)

- While matching patterns, you can also use the **regular expressions**.

- Regular expressions (a.k.a. regex's) are a language that allow you to describe patterns in strings.

``` r
# Same as above using regex

x <- c("apple", "banana", "pear","pEAr")

str_detect(x, regex("ea"))
```

```
## [1] FALSE FALSE  TRUE FALSE
```

- You can perform a case-insensitive match using `ignore_case = TRUE`.

``` r
str_detect(x, regex("ea",ignore_case = TRUE))
```

```
## [1] FALSE FALSE  TRUE  TRUE
```

---

# Remark: Regular expressions (Regex) Cont.

- With regex, you can create your own character classes using `[ ]`. For example:

* `[abc]`: matches a, b, or c.
* `[a-z]`: matches every character between a and z (in Unicode code point order).
* `[^abc]`: matches anything except a, b, or c.
* `[\^\-]`: matches ^ or -.

- They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.

- For more information on the regex capabilities, please refer to [regular expressions vignette](https://stringr.tidyverse.org/articles/regular-expressions.html) under stringr package.

---

# Remark: Regular expressions (Regex) Cont.

- There are a number of **pre-built classes** that you can use inside `[ ]`:

* `[:punct:]`: punctuation.
* `[:alpha:]`: letters.
* `[:lower:]`: lowercase letters.
* `[:upper:]`: upperclass letters.
* `[:digit:]`: digits.
* `[:xdigit:]`: hex digits.
* `[:alnum:]`: letters and numbers.
* `[:cntrl:]`: control characters.
* `[:graph:]`: letters, numbers, and punctuation.
* `[:print:]`: letters, numbers, punctuation, and white space.
* `[:space:]`: space characters (basically equivalent to \s).
* `[:blank:]`: space and tab.

---

# Your turn!

- Using the commonly used words (in English) data set under stringr.

``` r
library(stringr)
head(words)
```

```
## [1] "a"        "able"     "about"    "absolute" "accept"   "account"
```

``` r
length(words)
```

```
## [1] 980
```

- Task 1. Find out how many words have "ing" pattern?

- Task 2. Find out how many words end in "ing"? Hint: (Use anchors)[https://stringr.tidyverse.org/articles/regular-expressions.html#anchors].

- Task 3. Find out which words end with "ing"?

???

``` r
#Task 1:

str_detect(words, pattern = regex("ing")) %>% sum()
```

```
## [1] 10
```

``` r
# Same as above:

str_detect(words, "ing") %>% sum()
```

```
## [1] 10
```

``` r
# Task 2:

str_detect(words, "ing$") %>% sum()
```

```
## [1] 9
```

``` r
# Task 3:

words[str_detect(words, "ing$")]
```

```
## [1] "bring"   "during"  "evening" "king"    "meaning" "morning" "ring"   
## [8] "sing"    "thing"
```

---

# String subsetting with str_subset()

- `str_subset()` returns the **elements** of a character vector that match a regular expression.

- Using `starwars` data set, let's subset the character names that contain any punctuation.

``` r
head(starwars$name)
```

```
## [1] "Luke Skywalker" "C-3PO"          "R2-D2"          "Darth Vader"   
## [5] "Leia Organa"    "Owen Lars"
```

``` r
* str_subset(starwars$name, "[:punct:]") 
```

```
## [1] "C-3PO"          "R2-D2"          "R5-D4"          "Obi-Wan Kenobi"
## [5] "IG-88"          "Qui-Gon Jinn"   "Ki-Adi-Mundi"   "R4-P17"
```

---

# String extract using str_extract()

- `str_extract()` extracts text corresponding to the **first match**, returning a character vector.

``` r
str_extract(starwars$name, "[:punct:]")
```

```
##  [1] NA  "-" "-" NA  NA  NA  NA  "-" NA  "-" NA  NA  NA  NA  NA  NA  NA  NA  NA 
## [20] NA  NA  "-" NA  NA  NA  NA  NA  NA  NA  NA  "-" NA  NA  NA  NA  NA  NA  NA 
## [39] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  "-" NA  NA  NA  NA  NA  NA 
## [58] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  "-" NA  NA 
## [77] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
```

---

# Finding pattern locations using str_locate()

- `str_locate()` locates the **first position** of a pattern and returns a numeric matrix with columns start and end whereas `str_locate_all()` locates **all positions** of a given pattern.

``` r
str_locate(starwars$name, "[:punct:]") %>% head()
```

```
##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]    NA  NA
## [5,]    NA  NA
## [6,]    NA  NA
```

---

# Pattern counting using str_count()

- `str_count()` counts the number of matches for a given string.

``` r
str_count(starwars$name, "[:punct:]")
```

```
##  [1] 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [77] 0 0 0 0 0 0 0 0 0 0 0
```

---

# String replacing with str_replace()

- `str_replace()` replaces a string with another one.

- The `pattern` argument will give the string that is going to be replaced and `replacement` argument will specify the replacement string.

``` r
head(fruit)
```

```
## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"
```

``` r
# Replace berry with berries

str_replace(fruit, pattern = "berry", replacement = "berries")
```

```
##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberries"       
##  [7] "blackberries"      "blackcurrant"      "blood orange"     
## [10] "blueberries"       "boysenberries"     "breadfruit"       
## [13] "canary melon"      "cantaloupe"        "cherimoya"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberries"      "coconut"           "cranberries"      
## [22] "cucumber"          "currant"           "damson"           
## [25] "date"              "dragonfruit"       "durian"           
## [28] "eggplant"          "elderberries"      "feijoa"           
## [31] "fig"               "goji berries"      "gooseberries"     
## [34] "grape"             "grapefruit"        "guava"            
## [37] "honeydew"          "huckleberries"     "jackfruit"        
## [40] "jambul"            "jujube"            "kiwi fruit"       
## [43] "kumquat"           "lemon"             "lime"             
## [46] "loquat"            "lychee"            "mandarine"        
## [49] "mango"             "mulberries"        "nectarine"        
## [52] "nut"               "olive"             "orange"           
## [55] "pamelo"            "papaya"            "passionfruit"     
## [58] "peach"             "pear"              "persimmon"        
## [61] "physalis"          "pineapple"         "plum"             
## [64] "pomegranate"       "pomelo"            "purple mangosteen"
## [67] "quince"            "raisin"            "rambutan"         
## [70] "raspberries"       "redcurrant"        "rock melon"       
## [73] "salal berries"     "satsuma"           "star fruit"       
## [76] "strawberries"      "tamarillo"         "tangerine"        
## [79] "ugli fruit"        "watermelon"
```

---

# String replacing with str_replace() Cont.

``` r
#replace first l with "" (delete first l)

str_replace("Hello world", pattern = "l", replacement = "")
```

```
## [1] "Helo world"
```

``` r
# replace all l's with "" (delete l's)

str_replace_all("Hello world", pattern = "l", replacement = "")
```

```
## [1] "Heo word"
```

---

# Functions to Remember for Week 11

- String manipulations using BaseR and `stringr`.

- Usage of regular expressions.

- Pattern matching functions.

- Practice!

---

# Your turn! Class Worksheet

- Working in small groups, complete the following worksheet:

[Module 8-2 Worksheet](../worksheets/Week_11_Worksheet.html)

- Once completed, feel free to work on your Assessments.

[Return to Course Website](../index.html)