class: center, middle, inverse, title-slide # Regular Expressions and stringr ## Pavitra Chakravarty ### R-Ladies Cologne, R-Ladies Gaborone --- <style> hide { display: none; } .remark-slide-content h1 { font-size: 45px; } h1 { font-size: 2em; margin-block-start: 0.67em; margin-block-end: 0.67em; } .remark-slide-content { font-size: 16px } .remark-code { font-size: 14px; } code.r { font-size: 14px; } pre { margin-top: 0px; margin-bottom: 0px; } .red { color: #FF0000; } .footnote { color: #800020; font-size: 9px; } </style> ### What are regular expressions? + Regular expression is a pattern that describes a specific set of strings with a common structure + Heavily used for string matching / replacing in all programming languages + Heart and soul for string operations --- ### Regular expression syntax 6 basic canonical characteristics of regular expressions + __basic pattern matching__: Using functions from stringr package with exact sequence of characters + `str_detect()`, `str_subset()`, `str_view()`, `str_view_all()` + __anchors__: Indicate start and stop of sentence + `^: indicating start of sentence`, `$: indicating end of sentence` + __escape characters__: special characters cannot be directly coded in string + `\`: if you want to find strings with single quote `'`, "escape" single quote by preceding it with `\` --- + __character classes__: specify entire classes of characters, such as numbers, letters, etc using either `[:` and `:]` around predefined name or `\` and a special character + `[:digit:]` or `\d`: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to `[0-9]` + `\D`: non-digits, equivalent to `[^0-9]` + `[:lower:]`: lower-case letters, equivalent to `[a-z]` + `[:upper:]`: upper-case letters, equivalent to `[A-Z]` + `[:alpha:]`: alphabetic characters, equivalent to `[[:lower:][:upper:]]` or `[A-z]` + `[:alnum:]`: alphanumeric characters, equivalent to `[[:alpha:][:digit:]]` or `[A-z0-9]` + `\w`: word characters, equivalent to `[[:alnum:]_]` or `[A-z0-9_]` + `\W`: not word, equivalent to `[^A-z0-9_]` + `[:blank:]`: blank characters, i.e. space and tab * `[:space:]`: space characters: tab, newline, vertical tab, form feed, carriage return, space * `\s`: space, ` ` * `\S`: not space + __quantifiers__: Quantifiers specify how many repetitions of the pattern + `*`: matches at least 0 times + `+`: matches at least 1 times + `?`: matches at most 1 times + `{n}`: matches exactly n times + `{n,}`: matches at least n times + `{n,m}`: matches between n and m times + __character clusters__: Use of paranthesis to keep pattern together + `()`: use with pattern-matching characters to create groups --- ### Dataset being used today ```r library(tidyverse) enron <- read_csv("https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/data/enron/enron.csv") %>% drop_na() glimpse(enron) ``` ``` ## Rows: 214,195 ## Columns: 3 ## $ mail_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 1… ## $ person <chr> "allen-p", "allen-p", "allen-p", "allen-p", "allen-p", "allen… ## $ email <chr> "Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>", … ``` ```r head(enron, n=50) ``` ``` ## # A tibble: 50 × 3 ## mail_num person email ## <dbl> <chr> <chr> ## 1 1 allen-p Message-ID: <18782981.1075855378110.JavaMail.evans@thyme> ## 2 1 allen-p Date: Mon, 14 May 2001 16:39:00 -0700 (PDT) ## 3 1 allen-p From: phillip.allen@enron.com ## 4 1 allen-p To: tim.belden@enron.com ## 5 1 allen-p Subject: ## 6 1 allen-p Mime-Version: 1.0 ## 7 1 allen-p Content-Type: text/plain; charset=us-ascii ## 8 1 allen-p Content-Transfer-Encoding: 7bit ## 9 1 allen-p X-From: Phillip K Allen ## 10 1 allen-p X-To: Tim Belden <Tim Belden/Enron@EnronXGate> ## # … with 40 more rows ``` --- ### Canonical principle #1: Basic pattern-matching ```r enron %>% filter(str_detect(enron$person, "Allen")) ``` ``` ## # A tibble: 0 × 3 ## # … with 3 variables: mail_num <dbl>, person <chr>, email <chr> ``` ```r str_subset(enron$email, "tracy.ngo") ``` ``` ## [1] "To: tracy.ngo@enron.com" ## [2] "To: tracy.ngo@enron.com" ## [3] "To: tim.belden@enron.com, steve.c.hall@enron.com, tracy.ngo@enron.com," ``` ```r str_view_all(enron$email, "tracy.ngo") ``` --- ### Canonical principle #2: Anchors + `^`: matches the start of the string. + `$`: matches the end of the string. + `\b`: matches the empty string at either edge of a _word_. Don't confuse it with `^ $` which marks the edge of a _string_. + `\B`: matches the empty string provided it is not at an edge of a word. ```r enron %>% filter(str_detect(enron$email, "@ECT")) %>% select ``` ``` ## # A tibble: 6,524 × 0 ``` ```r enron %>% filter(str_detect(enron$email, "weekend$")) ``` ``` ## # A tibble: 45 × 3 ## mail_num person email ## <dbl> <chr> <chr> ## 1 69 allen-p morning I sent you the roll did you get it? Did you need m… ## 2 94 carson-m Subject: This weekend ## 3 94 carson-m Subject: This weekend ## 4 95 carson-m Subject: Re: This weekend ## 5 69 davis-d Subject: Manual JE info for cutover weekend ## 6 69 davis-d Subject: Manual JE info for cutover weekend ## 7 69 davis-d Subject: Manual JE info for cutover weekend ## 8 1 dean-c Subject: RE: This weekend ## 9 1 dean-c Subject: RE: This weekend ## 10 1 dean-c Subject: RE: This weekend ## # … with 35 more rows ``` --- ### Canonical principle #3: Escape characters ```r x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351") str_view(x, "(\\d\\d\\d)\\d\\d\\d-\\d\\d\\d\\d") ```
--- ```r str_view("so it goes $^$ here", "\\$\\^\\$") ```
--- ### Canonical principle #4: Character Classes ```r str_view(stringr::words, "^[yx]", match=TRUE) ```
--- ```r str_view(stringr::words, "[^e]ed$", match = TRUE) ```
--- ```r str_view(c("red", "reed"), "[^e]ed$", match = FALSE) ```
--- ```r str_view(stringr::words, "^(thr)*", match = TRUE) ```
### Canonical principle #5: Quantifiers + `*`: matches at least 0 times. + `+`: matches at least 1 times. + `?`: matches at most 1 times. + `{n}`: matches exactly n times. + `{n,}`: matches at least n times. + `{n,m}`: matches between n and m times. ```r x <- c("dkl kls. klk. _", "(425) 591-6020", "her number is (581) 434-3242", "442", " dsi") str_view(x, "^[dkh]*$") ```
--- ```r x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351") str_view(x, "\\([0-9][0-9][0-9]\\)[ ]*[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]") ```
```r x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351") str_view(x, "\\([0-9][0-9][0-9]\\)[ ]+[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]") ```
--- ```r x <- c("123456-7890", "(123) 456-7890", "(123)456-7890", "1235-2351") str_view(x, "\\([0-9][0-9][0-9]\\)[ ]?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]") ```
--- ```r x <- c("4444-22-22", "test", "333-4444-22") str_view(x, "\\d{4}-\\d{2}-\\d{2}") ```
--- ### Canonical principle #6: Character Clusters ```r enron %>% filter(str_detect(email, "@.*\\.(edu|net)")) %>% select(email) ``` ``` ## # A tibble: 1,646 × 1 ## email ## <chr> ## 1 "<retwell@mail.sanmarcos.net>" ## 2 "cc: \"Larry Lewter\" <retwell@mail.sanmarcos.net>, \"Claudia L. Crocker\"" ## 3 "\"Bob McKinney\" <capstone@texas.net> on 11/27/2000 09:46:13 AM" ## 4 "To: \"Capstone\" <capstone@texas.net>" ## 5 "Brian_Hoskins@enron.net" ## 6 "Brian_Hoskins@enron.net" ## 7 "Brian_Hoskins@enron.net" ## 8 "Brian_Hoskins@enron.net" ## 9 "To: adam.r.bayer@vanderbilt.edu" ## 10 "X-To: \"Adam Bayer\" <adam.r.bayer@vanderbilt.edu> @ ENRON" ## # … with 1,636 more rows ``` ```r enron %>% filter(str_detect(email, "@.*(ns)\\.(net)")) %>% select(email) ``` ``` ## # A tibble: 6 × 1 ## email ## <chr> ## 1 "\"Karen Edson\" <kedson@ns.net> on 07/08/2000 03:06:40 PM" ## 2 "cc: \"Julee Malinowski-Ball (E-mail)\" <jmball@ns.net>, \"Ray McNally (E-mai… ## 3 "kedson@ns.net" ## 4 "<fotinb@bc-mail.com>; \"Bill Hannah\" <hannahs@wans.net>; \"Bill Harvey\"" ## 5 "\"Harvey Wax\" <HLWAX@aol.com>; \"J. D Zikuda\" <jdzikuda@netins.net>; \"Jam… ## 6 "<rndyhbnr@midplains.net>; \"Ray Clary\" <rclrec@mindspring.com>; \"Rich Hari… ``` --- ### Lets Play! https://regexcrossword.com/challenges/beginner/puzzles/1 --- ### Acknowledgements Material has been borrowed heavily from the STAT 545 course. This course was started by Jenny Bryan: https://stat545.stat.ubc.ca/notes/notes-b05/ More STAT 545 resources: https://stat545.com/character-vectors.html, https://youtu.be/I0dJ1zpxAtU R for Data Science chapter on Strings: https://r4ds.had.co.nz/strings.html Solution set for R4DS on Strings: https://brshallo.github.io/r4ds_solutions/14-strings.html#matching-patterns-w-regex Regex Puzzle Builder: https://regexcrossword.com/puzzlebuilder