regex

What is regex

Regular Expression or (regex) is a powerful tool for managing strings. I this course, we will use regex with the stringr package (part of the tidyverse) to find, extract, or remove bits of text. regex is actually a mini-language of its own, and you can use it in R, Python, C, Java, JavaScript, and other languages (with slight local changes). In the example below, we’ll use regex in the pattern argument of stringr commands such as str_detect() and str_count(). Let’s start with an example:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.4     ✔ purrr   1.0.2
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

some_text <- c("2c","apple pie","3ss","dw")
str_detect(string = some_text, pattern = "\\d")

## [1]  TRUE FALSE  TRUE FALSE

The strings “2c” and “3ss” in some_text don’t have the letters “d”, but str_detect() returned TRUE. Strangely, str_detect() returned FALSE for “dw”. What’s going on?

The power of regex comes from metacharacters, which represent ranges or types of characters. In R, regex metacharacters are marked by the double backslash: adding the \\ transforms the meaning of a character. In the example above, \\d means a digit, not the letter “d” If we remove the double backslash, we can search for the “literal” letter “d”. That returns TRUE for the last string, which contains the “literal” letter “d”. As you can see, the double backslash “turns on” regex.

some_text

## [1] "2c"        "apple pie" "3ss"       "dw"

str_detect(string = some_text, pattern = "d")

## [1] FALSE FALSE FALSE  TRUE

Metacharacters: class and position

There are three main metacharacters for classes, “d”,“s” and “w”.

metacharacter	name	what it matches
\\d	digit	0 through 9
\\D	non-digit	anything not 0 through 9
\\w	word	“word” characters, a-z, A-Z, and 0-9
\\W	non-word	anything not a-z, A-Z, or 0-9
\\s	whitespace	whitespace characters, including space, tab, and new line
\\S	non-whitespace	non-whitespace characters

Let’s try “s” as both a “literal” and a metacharacter. We’ll switch to str_count() and drop verbose syntax.

some_text

## [1] "2c"        "apple pie" "3ss"       "dw"

str_count(some_text,"s")

## [1] 0 0 2 0

str_count(some_text,"\\s")

## [1] 0 1 0 0

There are also metacharacters that focus on position rather than class. These also use the double backslash, but they are metacharacters by default. We use the double backslash to turn them into literals. The dot is possibly the most powerful regex character. It means ANYTHING.

metacharacter	name	what it matches
^	caret	start of string or line
$	dollar	end of string or line
.	dot	any single character

We’ll use string_extract() to get the first character of every element of some_text

some_text

## [1] "2c"        "apple pie" "3ss"       "dw"

str_extract(some_text,"^.")

## [1] "2" "a" "3" "d"

Metacharacters: quantifiers and operators

Regex also has operators for repetition, grouping and logical functions. The logical operators are fairly straightforward

metacharacter	name	what it matches/does	example
[ ]	square bracket	anything within the bracket	[ab] means “a” or “b”
[^ ]	negation	anything NOT within the bracket	[^ab] means neither “a” nor “b”
\|	or operator	or	a\|b also means “a” or “b”
( )	parenthesis	groups letters	[(cat)\|(dog)] means “cat” or “dog”

Here are some examples:

str_extract(c("apple","pie","Peter","polo","append"),"p[aeiou]") ## any lowercase "p" followed by [aeiou], which means "any lowercase vowel"

## [1] NA   "pi" NA   "po" "pe"

str_extract(c("apple","pie","Peter","polo","append"),"[Pp][aeiou]") ## any "p" or "P" followed by "any lowercase vowel"

## [1] NA   "pi" "Pe" "po" "pe"

str_extract(c("apple","pie","Peter","polo","append"),"((pp)[aeiou])") ## "pi" followed by "any lowercase vowel"

## [1] NA    NA    NA    NA    "ppe"

str_extract(c("apple","pie","Peter","polo","append"),"((pp)[^aeiou])") ## "pp" followed anything other than a lowercase vowel

## [1] "ppl" NA    NA    NA    NA

Quantifiers for repetition are a bit more conceptual.

metacharacter	name	what it matches/does
?	question mark	the preceding character zero or once
*	star	the preceding character zero or multiple times
+	plus	the preceding character one or more times
{}	curly bracket	the preceding character in a range of times, e.g. {3,4}

Here are some examples:

str_extract(string = "a", pattern = "a.") ## there's nothing after "a", so the result is NA

## [1] NA

str_extract(string = "a", pattern = "a.*") ## there's nothing after "a", but ".*" accepts zero repetitions of "."

## [1] "a"

str_extract(string = "a", pattern = "a.?") ## there's nothing after "a", but ".?" accepts zero repetition of  "."

## [1] "a"

str_extract(string = "a", "a.+") ## there's nothing after "a", and  ".+" wants at least one repetition of  "."

## [1] NA

For longer strings, the “*” and “?” return completely different results. The “star” is greedy and finds as many repetitions as possible and the “?” is “not greedy” and stops as soon as it finds one repetition.

str_extract("Captain Ahab", "a.?") # not greedy, so stops are one repetition

## [1] "ap"

str_extract("Captain Ahab", "a.+") # greedy, so stops only at the last a.

## [1] "aptain Ahab"

str_extract("Captain Ahab", "a.*") # greedy, so stops only at the last a.

## [1] "aptain Ahab"

“Data cleaning”: an example

With these limited metacharacters we already have enough regex to clean up some messy data on height

umpire_biodata.df <- read_csv("https://www.retrosheet.org/BIOFILE.TXT")

## Rows: 21913 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): PLAYERID, LAST, FIRST, NICKNAME, BIRTHDATE, BIRTH CITY, BIRTH STAT...
## dbl  (1): WEIGHT
## lgl  (2): NAME CHG, BAT CHG
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

unique(umpire_biodata.df$HEIGHT)

##  [1] "6-05"   "6-00"   "6-03"   "6-01"   "6-02"   "5-11"   "5-08.5" "5-10"  
##  [9] "6-06"   "6-04"   "5-09"   NA       "6-02.5" "5-08"   "6-07"   "6-01.5"
## [17] "5-06"   "5-07"   "6-08"   "5-10.5" "6-00.5" "5-11.5" "5-05.5" "5-06.5"
## [25] "5-09.5" "5-04.5" "5-04"   "6-03.5" "5-07.5" "6-04.5" "5-05"   "6-0"   
## [33] "6-05.5" "6-10"   "6-09"   "5-03"   "5-03.5" "5-9.5"  "3-07"   "5-9"   
## [41] "6-11"   "6-06.5" "6-07.5" "5-7"

It looks that there are some “gremlins” in the data like “44327.” All the good height data begins with either 5 or 6. We can extract “5 or 6 at the start of a string” with [^56]

umpire_biodata.df$height_feet <- str_extract(umpire_biodata.df$HEIGHT, "^[65]") ## "^" means beginning of string and [65] means 6 or 5
unique(umpire_biodata.df$height_feet)

## [1] "6" "5" NA

For the height in inches we want everything after the “-” until the end of the string. We can use a “greedy” quantifier.

umpire_biodata.df$height_inches <- str_extract(umpire_biodata.df$HEIGHT, "-.*") ## ".*" means "beginning of string and [65] means 6 or 5"all repetitions of anything
unique(umpire_biodata.df$height_inches)

##  [1] "-05"   "-00"   "-03"   "-01"   "-02"   "-11"   "-08.5" "-10"   "-06"  
## [10] "-04"   "-09"   NA      "-02.5" "-08"   "-07"   "-01.5" "-10.5" "-00.5"
## [19] "-11.5" "-05.5" "-06.5" "-09.5" "-04.5" "-03.5" "-07.5" "-0"    "-9.5" 
## [28] "-9"    "-7"

Looks good, but we need to remove the dash.

umpire_biodata.df$height_inches <- str_remove(umpire_biodata.df$height_inches, "-")
unique(umpire_biodata.df$height_inches)

##  [1] "05"   "00"   "03"   "01"   "02"   "11"   "08.5" "10"   "06"   "04"  
## [11] "09"   NA     "02.5" "08"   "07"   "01.5" "10.5" "00.5" "11.5" "05.5"
## [21] "06.5" "09.5" "04.5" "03.5" "07.5" "0"    "9.5"  "9"    "7"

If we convert the strings into numbers, we can now calculate the total height in inches. As you may know, height data commonly follows a “bell curve” or “normal distribution” for a common demographic group (e.g. adult men). A simple histogram of the heights of umpires suggests, at first glance, that the data is fairly “normal.”

umpire_biodata.df$height_inches <- as.numeric(umpire_biodata.df$height_inches)
umpire_biodata.df$height_feet <- as.numeric(umpire_biodata.df$height_feet)
umpire_biodata.df$calculated_height <- umpire_biodata.df$height_feet*12+umpire_biodata.df$height_inches
hist(umpire_biodata.df$calculated_height)

As an exploratory exercise, let’s use regex to create some coherent categories from the skin color data for whaling crews. For example, see how we can get all the strange variants of “mullato”:

crewentries.df <- read.csv("http://laits.utexas.edu/~mr56267/TLAH_Maps_2023/Mapping_textbook/Data/AmericanOffshoreWhalingCrewlists/crewentries_20200302.csv")

sort(unique(crewentries.df$skin))

##   [1] ""                   " "                  " black"            
##   [4] " Dark"              " Dark _"            " Indian"           
##   [7] "-"                  "?"                  "? Ruddy"           
##  [10] "??"                 "'"                  "(--)"              
##  [13] "(-)"                "(Boy)"              "(brown)"           
##  [16] "(Brown)"            "(Copper cld)"       "(Copper)"          
##  [19] "(Dark)"             "(Fair)"             "(flesh)"           
##  [22] "(L-)"               "(light)"            "(Mullatto)"        
##  [25] "(rather) dark"      "(Sambo)"            "(Yellow)"          
##  [28] "[ dark ]"           "[-------]"          "[-]"               
##  [31] "[?]"                "[Coloured]"         "[dark]"            
##  [34] "[Light]"            "[Mixed]"            "`dark"             
##  [37] "african"            "African"            "Albino"            
##  [40] "B"                  "bblk"               "bl"                
##  [43] "black"              "Black"              "black "            
##  [46] "Black "             "Black Indian"       "Black man"         
##  [49] "Black Man"          "black mulatto"      "Black Negro"       
##  [52] "Black Wooly"        "black, colored"     "blackish"          
##  [55] "blackman"           "Blackman"           "blk"               
##  [58] "Blk"                "blk "               "blk man"           
##  [61] "blk Negro"          "Blond"              "Blonde"            
##  [64] "Blue"               "Bricky"             "brown"             
##  [67] "Brown"              "brown "             "Brown "            
##  [70] "Brown(Dark)"        "Brsh"               "coffee"            
##  [73] "Coffee"             "Cold"               "cold."             
##  [76] "Cold."              "Collerd"            "Collourd"          
##  [79] "Colloured"          "Collured"           "colored"           
##  [82] "Colored"            "Colored "           "Colored (Black)"   
##  [85] "Colored [black]"    "Colored Boy"        "Colored man"       
##  [88] "Colored Man"        "Colored sable"      "Colored. Black"    
##  [91] "coloured"           "Coloured"           "coloured "         
##  [94] "Coloured "          "Coloured Dark"      "Coloured man"      
##  [97] "Coloured Man"       "cop? col?"          "Cop. Cold."        
## [100] "Cop. Coloured"      "cop.col."           "cop.cold."         
## [103] "copper"             "Copper"             "copper cd."        
## [106] "Copper cold."       "Copper Coloured"    "copper?"           
## [109] "copperish"          "Copr."              "Coulerd"           
## [112] "coulered"           "Coulered"           "couloured"         
## [115] "Dandy"              "dark"               "Dark"              
## [118] "dark "              "Dark "              "Dark  "            
## [121] "Dark _"             "Dark (Colored)"     "Dark (Mulatto)"    
## [124] "Dark (Negro)"       "Dark Brown"         "Dark Colored"      
## [127] "Dark Complexion"    "dark copper"        "Dark Copper"       
## [130] "Dark Mulatto"       "Dark Negro"         "Dark Ruddy"        
## [133] "Dark Sallow"        "dark yellow"        "Darker"            
## [136] "darkish"            "Darkish"            "darky"             
## [139] "Darl"               "Dlack"              "Do"                
## [142] "F"                  "fair"               "Fair"              
## [145] "fair "              "Fair "              "Fair Ruddy"        
## [148] "fare"               "Flored"             "florid"            
## [151] "Florid"             "florid "            "Florid "           
## [154] "Florid Complection" "flush"              "Fr"                
## [157] "Frecked"            "freckled"           "Freckled"          
## [160] "Freckles"           "Free"               "fresh"             
## [163] "Fresh"              "G"                  "Gray"              
## [166] "H"                  "Honed"              "indian"            
## [169] "Indian"             "Indian "            "Indian ?"          
## [172] "Indian black"       "Indian Dark"        "K"                 
## [175] "Lb"                 "Lifht"              "light"             
## [178] "lIght"              "Light"              "LIght"             
## [181] "light "             "Light "             "Light  "           
## [184] "light   "           "Light ?"            "light (-)"         
## [187] "Light Black"        "light brown"        "Light Brown"       
## [190] "light chestnut"     "Light Dark"         "Light Freckled"    
## [193] "Light Mulatto"      "Light Sambo"        "Light Sandy"       
## [196] "lightish"           "Lightish"           "lite"              
## [199] "Lite"               "Llght"              "Loight"            
## [202] "lt"                 "lt black"           "Lught"             
## [205] "Mac"                "Malatto"            "Medium"            
## [208] "mixed"              "Mixed"              "Mohegan Indian"    
## [211] "Molatto"            "Mulato"             "mulatto"           
## [214] "Mulatto"            "Mulatto Black"      "Mulatto Dark"      
## [217] "Mulatto or Indian"  "Mullato"            "mullatto"          
## [220] "Mullatto"           "Mullatto "          "Mulotto"           
## [223] "N"                  "native"             "Native"            
## [226] "Native Indian"      "negro"              "Negro"             
## [229] "Negro Black"        "negroe"             "of colour"         
## [232] "Olive"              "Pale"               "Pockmarked"        
## [235] "Portguguese"        "Portugal"           "Portugal "         
## [238] "Portuges"           "Portugese"          "portugues"         
## [241] "Portugues"          "Portugues dark"     "Portuguese"        
## [244] "Quadroon"           "rather dark"        "Red"               
## [247] "Redish"             "ruddy"              "Ruddy"             
## [250] "rudy"               "Rudy"               "S"                 
## [253] "Sable"              "sable "             "sallow"            
## [256] "Sallow"             "Sallow ?"           "Sambo"             
## [259] "San"                "sandy"              "Sandy"             
## [262] "Seppia"             "Sw"                 "swarthy"           
## [265] "Swarthy"            "Tawny"              "Tolerably fair"    
## [268] "Unknown"            "Very Dark"          "white"             
## [271] "White"              "Wooley"             "wooly"             
## [274] "yellow"             "Yellow"             "yellow "           
## [277] "Yellow "            "yellow   "          "yellow Indian"     
## [280] "Yellow/Mullatto"    "yellowish"          "Yellowish"         
## [283] "Yellowish "

unique(crewentries.df$skin[which(str_detect(crewentries.df$skin, "[mM][ou]l"))])

##  [1] "mulatto"           "Mulatto"           "Mulatto Dark"     
##  [4] "Molatto"           "Mulato"            "Mulatto or Indian"
##  [7] "Mullatto"          "Mullatto "         "Mulatto Black"    
## [10] "Yellow/Mullatto"   "(Mullatto)"        "mullatto"         
## [13] "Mullato"           "Light Mulatto"     "Mulotto"          
## [16] "black mulatto"     "Dark (Mulatto)"    "Dark Mulatto"

Notes:

This page borrows heavily from the Sara A. Metwalli’s medium.com poston regex for Python.