class: center, middle, inverse, title-slide .title[ # Module 8-2 Demonstration ] .subtitle[ ## Special Operations: Dealing with character variables ] --- # Creating Strings - The most basic way to create strings is to use quotation marks and assign a string to an object. ``` r quote <- "Most valuable thing you have as a leader is clear data" author <- "Ruth Porat" ``` - The `paste()` function under Base R is used for creating and building strings. `str_c()` is equivalent to the `paste()` function. ``` r paste(quote, "by", author) ``` ``` ## [1] "Most valuable thing you have as a leader is clear data by Ruth Porat" ``` - Use `paste0()` to paste without spaces between characters. ``` r paste0("I", "love", "Data", "Wrangling") ``` ``` ## [1] "IloveDataWrangling" ``` --- # Converting to Strings - Strings and characters can be tested with `is.character()` and any other data format can be converted into strings/characters with `as.character()`. ``` r is.character(quote) ``` ``` ## [1] TRUE ``` ``` r as.character(3.54) ``` ``` ## [1] "3.54" ``` --- # Printing Strings Printing strings/characters can be done with the following: Function | Usage ---------|------- `print()` | generic printing `noquote()` | print with no quotes `cat()` | concatenate and print with no quotes & no line number ``` r print( paste(quote,author) , quote = FALSE) ``` ``` ## [1] Most valuable thing you have as a leader is clear data Ruth Porat ``` ``` r noquote( paste(quote,author) ) ``` ``` ## [1] Most valuable thing you have as a leader is clear data Ruth Porat ``` ``` r cat( paste(quote,author) ) ``` ``` ## Most valuable thing you have as a leader is clear data Ruth Porat ``` --- # Printing Strings Cont. ``` r # basic printing of alphabet cat(letters) ``` ``` ## a b c d e f g h i j k l m n o p q r s t u v w x y z ``` ``` r # specify a seperator between the combined characters cat(letters, sep = "-") ``` ``` ## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z ``` --- # Printing Strings Cont. - To format the line width for printing long strings use the `fill` argument. ``` r # No breaks between lines cat(quote, author, quote, author, fill = FALSE) ``` ``` ## Most valuable thing you have as a leader is clear data Ruth Porat Most valuable thing you have as a leader is clear data Ruth Porat ``` ``` r # Breaks between lines cat(quote, author, quote, author, fill = TRUE) ``` ``` ## Most valuable thing you have as a leader is clear data Ruth Porat ## Most valuable thing you have as a leader is clear data Ruth Porat ``` --- # Counting string elements and characters - To count the number of elements in a string use length(). ``` r length("How many elements are in this string?") ``` ``` ## [1] 1 ``` ``` r length( c("How", "many", "elements", "are", "in", "this", "string?") ) ``` ``` ## [1] 7 ``` - To count the number of characters in a string use `nchar()`. ``` r nchar("How many characters are in this string?") ``` ``` ## [1] 39 ``` ``` r nchar(c("How", "many", "characters", "are", "in", "this", "string?")) ``` ``` ## [1] 3 4 10 3 2 4 7 ``` --- # String manipulation with Base R - Basic string manipulation typically includes: - case conversion; - simple character replacement; - pattern replacement; - abbreviating; - substring replacement; - adding/removing white space; - set operations. - These operations can all be performed with base R functions; however, some operations are greatly simplified with the `stringr` package. --- # Upper/lower case conversion - To convert all upper case characters to lower case use `tolower()`. - To convert all lower case characters to upper case use `toupper()`. ``` r a <- "MATH2349 is AWesomE" tolower(a) ``` ``` ## [1] "math2349 is awesome" ``` ``` r toupper(a) ``` ``` ## [1] "MATH2349 IS AWESOME" ``` --- # Simple Character Replacement - To replace a character (or multiple characters) in a string use `chartr()`. ``` r # replace 'z' with 's' american <- "This is how we analyze." chartr(old = "z", new = "s", american) ``` ``` ## [1] "This is how we analyse." ``` ``` r # replace 'i' with 'w', 'X' with 'h' and 's' with 'y' x <- "MiXeD cAsE 123" chartr(old ="iXs", new ="why", x) ``` ``` ## [1] "MwheD cAyE 123" ``` --- # Pattern Replacement - To replace a pattern in a string use `gsub()`. ``` r # replace "ot" pattern with "ut" x <- "R Totorial" gsub(pattern = "ot", replacement="ut", x) ``` ``` ## [1] "R Tutorial" ``` --- # String Abbreviations - To abbreviate strings we can use `abbreviate()`. ``` r streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston") # default abbreviations abbreviate(streets) ``` ``` ## Victoria Yarra Russell Williams Swanston ## "Vctr" "Yarr" "Rssl" "Wllm" "Swns" ``` ``` r # set minimum length of abbreviation abbreviate(streets, minlength = 5) ``` ``` ## Victoria Yarra Russell Williams Swanston ## "Victr" "Yarra" "Rssll" "Wllms" "Swnst" ``` --- # Extract/Replace Substrings - The purpose of `substr()` is to extract and replace substrings with specified starting and stopping characters. ``` r alphabet <- paste(LETTERS, collapse = "") alphabet ``` ``` ## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ" ``` ``` r # extract 18-24th characters in alphabet substr(alphabet, start = 18, stop = 24) ``` ``` ## [1] "RSTUVWX" ``` ``` r # replace 19-24th characters with `R` substr(alphabet, start = 19, stop = 24) <- "RRRRRR" alphabet ``` ``` ## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ" ``` --- # Extract/Replace Substrings - To split the elements of a character string use `strsplit()`. ``` r z <- "Victoria Yarra Russell Williams Swanston" strsplit(z, split = " ") ``` ``` ## [[1]] ## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston" ``` ``` r a <- "Victoria-Yarra-Russell-Williams-Swanston" strsplit(a, split = "-") ``` ``` ## [[1]] ## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston" ``` - Note that the output of `strsplit()` is a list. To convert the output to a simple atomic vector simply use `unlist()`. ``` r * unlist(strsplit(a, split = "-")) ``` ``` ## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston" ``` --- # Set operatons for character strings Function | Usage ---------|------- `union()` | obtain union between two character vectors `intersect()` | obtain the common elements of two character vectors `setdiff()` | obtain the non-common elements, or the difference `setequal()` | tests if two vectors contain the same elements regardless of order `identical()` | tests if two character vectors are equal in content and order --- # Set operatons for character strings Cont. ``` r set_1 <- c("VIC", "NSW", "WA", "TAS") set_2 <- c("TAS", "QLD", "SA", "NSW") union(set_1, set_2) ``` ``` ## [1] "VIC" "NSW" "WA" "TAS" "QLD" "SA" ``` ``` r intersect(set_1, set_2) ``` ``` ## [1] "NSW" "TAS" ``` ``` r setdiff(set_1, set_2) ``` ``` ## [1] "VIC" "WA" ``` ``` r setdiff(set_2, set_1) ``` ``` ## [1] "QLD" "SA" ``` --- # String manipulation with stringr - The `stringr` package was developed by Hadley Wickham to provide a consistent and simple wrappers to common string operations. - These functions are closely related to their base R equivalents: - Concatenate with `str_c()` ( `\(\sim\)` `paste()` and `paste0()`). - Number of characters with `str_length()` ( `\(\sim\)` `nchar()`). - Substring with `str_sub()` ( `\(\sim\)` `substr()` ). --- # Duplicate Characters within a String - In addition, the `stringr` has a new functionality using `str_dup()` <!-- in which base R does not have a specific function for character duplication. --> ``` r str_dup("Data", times = 4) ``` ``` ## [1] "DataDataDataData" ``` ``` r str_dup("Data", times = 1:4) ``` ``` ## [1] "Data" "DataData" "DataDataData" "DataDataDataData" ``` --- # Remove Leading and Trailing White space - In string processing, a common task is parsing text into individual words. - Often, this results in words having blank spaces (white spaces) on either end of the word. The `str_trim()` can be used to remove these spaces. ``` r text <- c("Text ", " with", " whitespace ") text ``` ``` ## [1] "Text " " with" " whitespace " ``` ``` r str_trim(text, side = "left") ``` ``` ## [1] "Text " "with" "whitespace " ``` ``` r str_trim(text, side = "right") ``` ``` ## [1] "Text" " with" " whitespace" ``` ``` r str_trim(text, side = "both") ``` ``` ## [1] "Text" "with" "whitespace" ``` --- # Pad a String with White space - Conversely, to add whitespace, or to pad a string, we can use `str_pad()`. ``` r str_pad("Data", width = 10, side = "left") ``` ``` ## [1] " Data" ``` ``` r str_pad("Data", width = 10, side = "both") ``` ``` ## [1] " Data " ``` - Use `str_pad()` to pad a string with specified characters. The `width` argument will give width of padded strings and the `pad` argument will specify the padding characters. ``` r str_pad("Data", width = 10, side = "right", pad = "!") ``` ``` ## [1] "Data!!!!!!" ``` --- # Pattern matching - The vast majority of string manipulations require pattern matching for a given text. - Good news is, `stringr` package has pattern matching functions to detect, subset, locate, count, extract, and replace strings. <!-- - Note that, all functions in this section has the same first two arguments, a character vector of strings to process and a single pattern to match specified by the `pattern =` argument. --> --- # Pattern detection with str_detect() - `str_detect()` detects the presence or absence of a pattern and returns a logical vector. ``` r # detects pattern "ea" x <- c("apple", "banana", "pear","pEAr") str_detect(x, pattern ="ea") ``` ``` ## [1] FALSE FALSE TRUE FALSE ``` ``` r #same as above str_detect(x, "ea") ``` ``` ## [1] FALSE FALSE TRUE FALSE ``` --- # Remark: Regular expressions (Regex) - While matching patterns, you can also use the **regular expressions**. - Regular expressions (a.k.a. regex's) are a language that allow you to describe patterns in strings. ``` r # Same as above using regex x <- c("apple", "banana", "pear","pEAr") str_detect(x, regex("ea")) ``` ``` ## [1] FALSE FALSE TRUE FALSE ``` - You can perform a case-insensitive match using `ignore_case = TRUE`. ``` r str_detect(x, regex("ea",ignore_case = TRUE)) ``` ``` ## [1] FALSE FALSE TRUE TRUE ``` --- # Remark: Regular expressions (Regex) Cont. - With regex, you can create your own character classes using `[ ]`. For example: * `[abc]`: matches a, b, or c. * `[a-z]`: matches every character between a and z (in Unicode code point order). * `[^abc]`: matches anything except a, b, or c. * `[\^\-]`: matches ^ or -. - They take a little while to get your head around, but once you understand them, you’ll find them extremely useful. - For more information on the regex capabilities, please refer to [regular expressions vignette](https://stringr.tidyverse.org/articles/regular-expressions.html) under stringr package. --- # Remark: Regular expressions (Regex) Cont. - There are a number of **pre-built classes** that you can use inside `[ ]`: * `[:punct:]`: punctuation. * `[:alpha:]`: letters. * `[:lower:]`: lowercase letters. * `[:upper:]`: upperclass letters. * `[:digit:]`: digits. * `[:xdigit:]`: hex digits. * `[:alnum:]`: letters and numbers. * `[:cntrl:]`: control characters. * `[:graph:]`: letters, numbers, and punctuation. * `[:print:]`: letters, numbers, punctuation, and white space. * `[:space:]`: space characters (basically equivalent to \s). * `[:blank:]`: space and tab. --- # Your turn! - Using the commonly used words (in English) data set under stringr. ``` r library(stringr) head(words) ``` ``` ## [1] "a" "able" "about" "absolute" "accept" "account" ``` ``` r length(words) ``` ``` ## [1] 980 ``` - Task 1. Find out how many words have "ing" pattern? - Task 2. Find out how many words end in "ing"? Hint: (Use anchors)[https://stringr.tidyverse.org/articles/regular-expressions.html#anchors]. - Task 3. Find out which words end with "ing"? ??? ``` r #Task 1: str_detect(words, pattern = regex("ing")) %>% sum() ``` ``` ## [1] 10 ``` ``` r # Same as above: str_detect(words, "ing") %>% sum() ``` ``` ## [1] 10 ``` ``` r # Task 2: str_detect(words, "ing$") %>% sum() ``` ``` ## [1] 9 ``` ``` r # Task 3: words[str_detect(words, "ing$")] ``` ``` ## [1] "bring" "during" "evening" "king" "meaning" "morning" "ring" ## [8] "sing" "thing" ``` --- # String subsetting with str_subset() - `str_subset()` returns the **elements** of a character vector that match a regular expression. - Using `starwars` data set, let's subset the character names that contain any punctuation. ``` r head(starwars$name) ``` ``` ## [1] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ## [5] "Leia Organa" "Owen Lars" ``` ``` r * str_subset(starwars$name, "[:punct:]") ``` ``` ## [1] "C-3PO" "R2-D2" "R5-D4" "Obi-Wan Kenobi" ## [5] "IG-88" "Qui-Gon Jinn" "Ki-Adi-Mundi" "R4-P17" ``` --- # String extract using str_extract() - `str_extract()` extracts text corresponding to the **first match**, returning a character vector. ``` r str_extract(starwars$name, "[:punct:]") ``` ``` ## [1] NA "-" "-" NA NA NA NA "-" NA "-" NA NA NA NA NA NA NA NA NA ## [20] NA NA "-" NA NA NA NA NA NA NA NA "-" NA NA NA NA NA NA NA ## [39] NA NA NA NA NA NA NA NA NA NA NA NA "-" NA NA NA NA NA NA ## [58] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA "-" NA NA ## [77] NA NA NA NA NA NA NA NA NA NA NA ``` --- # Finding pattern locations using str_locate() - `str_locate()` locates the **first position** of a pattern and returns a numeric matrix with columns start and end whereas `str_locate_all()` locates **all positions** of a given pattern. ``` r str_locate(starwars$name, "[:punct:]") %>% head() ``` ``` ## start end ## [1,] NA NA ## [2,] 2 2 ## [3,] 3 3 ## [4,] NA NA ## [5,] NA NA ## [6,] NA NA ``` --- # Pattern counting using str_count() - `str_count()` counts the number of matches for a given string. ``` r str_count(starwars$name, "[:punct:]") ``` ``` ## [1] 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ## [39] 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ## [77] 0 0 0 0 0 0 0 0 0 0 0 ``` --- # String replacing with str_replace() - `str_replace()` replaces a string with another one. - The `pattern` argument will give the string that is going to be replaced and `replacement` argument will specify the replacement string. ``` r head(fruit) ``` ``` ## [1] "apple" "apricot" "avocado" "banana" "bell pepper" ## [6] "bilberry" ``` ``` r # Replace berry with berries str_replace(fruit, pattern = "berry", replacement = "berries") ``` ``` ## [1] "apple" "apricot" "avocado" ## [4] "banana" "bell pepper" "bilberries" ## [7] "blackberries" "blackcurrant" "blood orange" ## [10] "blueberries" "boysenberries" "breadfruit" ## [13] "canary melon" "cantaloupe" "cherimoya" ## [16] "cherry" "chili pepper" "clementine" ## [19] "cloudberries" "coconut" "cranberries" ## [22] "cucumber" "currant" "damson" ## [25] "date" "dragonfruit" "durian" ## [28] "eggplant" "elderberries" "feijoa" ## [31] "fig" "goji berries" "gooseberries" ## [34] "grape" "grapefruit" "guava" ## [37] "honeydew" "huckleberries" "jackfruit" ## [40] "jambul" "jujube" "kiwi fruit" ## [43] "kumquat" "lemon" "lime" ## [46] "loquat" "lychee" "mandarine" ## [49] "mango" "mulberries" "nectarine" ## [52] "nut" "olive" "orange" ## [55] "pamelo" "papaya" "passionfruit" ## [58] "peach" "pear" "persimmon" ## [61] "physalis" "pineapple" "plum" ## [64] "pomegranate" "pomelo" "purple mangosteen" ## [67] "quince" "raisin" "rambutan" ## [70] "raspberries" "redcurrant" "rock melon" ## [73] "salal berries" "satsuma" "star fruit" ## [76] "strawberries" "tamarillo" "tangerine" ## [79] "ugli fruit" "watermelon" ``` --- # String replacing with str_replace() Cont. ``` r #replace first l with "" (delete first l) str_replace("Hello world", pattern = "l", replacement = "") ``` ``` ## [1] "Helo world" ``` ``` r # replace all l's with "" (delete l's) str_replace_all("Hello world", pattern = "l", replacement = "") ``` ``` ## [1] "Heo word" ``` --- # Functions to Remember for Week 11 - String manipulations using BaseR and `stringr`. - Usage of regular expressions. - Pattern matching functions. - Practice! --- # Your turn! Class Worksheet <center><img src="../images/done.gif" width="300px" /></center> - Working in small groups, complete the following worksheet: [Module 8-2 Worksheet](../worksheets/Week_11_Worksheet.html) - Once completed, feel free to work on your Assessments. <br> <br> <br> [Return to Course Website](../index.html)