Regular expressions

Lecture 13

2025-04-20

Introduction to Regular Expressions

Prerequisites

  • This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings.
# Install necessary packages if not already installed
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(babynames)) install.packages("babynames")

# Load the libraries
library(tidyverse)
library(babynames)

Pattern basics

Pattern basics

When this is supplied, str_view() will show only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.

str_view(fruit, "berry")
 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>

Pattern basics

  • Letters and numbers match exactly and are called literal characters.

  • Most punctuation characters, like ., +, *, [,], and ?, have special meanings and are called metacharacters.

Pattern example 1

For example, . will match any character, so "a." will match any string that contains an “a” followed by another character

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>

Pattern example 2

For example, we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”

str_view(fruit, "a...e")
 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry

Quantifiers

Quantifiers

Quantifiers control how many times a pattern can match:

  • ? makes a pattern optional (i.e. it matches 0 or 1 times)

  • + lets a pattern repeat (i.e. it matches at least once)

  • * lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).

Quantifier ?

# ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b

Quantifier +

# ab+ matches an "a", followed by at least one "b".
str_view(c("a", "ab", "abb"), "ab+")
[2] │ <ab>
[3] │ <abb>

Quantifier *

# ab* matches an "a", followed by any number of "b"s.
str_view(c("a", "ab", "abb"), "ab*")
[1] │ <a>
[2] │ <ab>
[3] │ <abb>

Character classes

Character classes

Character classes are defined by [] and let you match a set of characters

  • [abcd] matches “a”, “b”, “c”, or “d”.

  • You can also invert the match by starting with ^[^abcd] matches anything except “a”, “b”, “c”, or “d”.

Example 1

We can use this idea to find the words containing an “x” surrounded by vowels:

str_view(words, "[aeiou]x[aeiou]")
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st

Example 2

We can use this idea to find “y” surrounded by consonants:

str_view(words, "[^aeiou]y[^aeiou]")
[836] │ <sys>tem
[901] │ <typ>e

Alternation

You can use alternation|, to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “melon”, or “nut”.

str_view(fruit, "apple|melon|nut")
 [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>

Key functions

Detect matches

str_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise

str_detect(c("a", "b", "c"), "[aeiou]")
[1]  TRUE FALSE FALSE

Count matches

The next step up in complexity from str_detect() is str_count(): rather than a true or false, it tells you how many matches there are in each string.

x <- c("apple", "banana", "pear")
str_count(x, "p")
[1] 2 0 1

Replace values

We can modify values with str_replace() and str_replace_all().

str_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, "")

Replace values

x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
[1] "-ppl-"  "p--r"   "b-n-n-"

Extract variables

Consider the following dataset

df <- tribble(
  ~str,
  "<Sheryl>-F_34",
  "<Kisha>-F_45", 
  "<Brandon>-N_33",
  "<Sharon>-F_38", 
  "<Penny>-F_58",
  "<Justin>-M_41", 
  "<Patricia>-F_84", 
)

Extract variables

df |> 
  separate_wider_regex(
    str,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".",
      "_",
      age = "[0-9]+"
    )
  )
# A tibble: 7 × 3
  name     gender age  
  <chr>    <chr>  <chr>
1 Sheryl   F      34   
2 Kisha    F      45   
3 Brandon  N      33   
4 Sharon   F      38   
5 Penny    F      58   
6 Justin   M      41   
7 Patricia F      84   

Pattern details

Escaping

In order to match a literal ., you need an escape which tells the regular expression to match metacharacters literally.

Like strings, regexps use the backslash for escaping.

  • So, to match a ., you need the regexp \..

  • Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings.

Escaping

So to create the regular expression \. we need the string "\\.", as the following example shows.

# To create the regular expression \., we need to use \\.
dot <- "\\."

# But the expression itself only contains one \
str_view(dot)
[1] │ \.
cat("\n") 
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>

Escaping and literal

If \ is used as an escape character in regular expressions, how do you match a literal \?

  • Well, you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

Escaping and literal

x <- "a\\b"
str_view(x)
[1] │ a\b
cat("\n") 
str_view(x, "\\\\")
[1] │ a<\>b

Matching literals

If you’re trying to match a literal .$|*+?{}(), there’s an alternative to using a backslash escape: you can use a character class: [.][$][|], … all match the literal values.

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
[2] │ <a.c>

Anchors

By default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^ to match the start or $ to match the end:

str_view(fruit, "^a")
[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado

Anchors

str_view(fruit, "a$")
 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>

Anchors

To force a regular expression to match only the full string, anchor it with both ^ and $:

str_view(fruit, "apple")
 [1] │ <apple>
[62] │ pine<apple>
cat("\n") 
str_view(fruit, "^apple$")
[1] │ <apple>

Anchors

You can also match the boundary between words (i.e. the start or end of a word) with \b.

x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)
cat("\n") 
str_view(x, "\\bsum\\b")
[4] │ <sum>(x)

Anchors

When used alone, anchors will produce a zero-width match

str_view("abc", c("$", "^", "\\b"))
[1] │ abc<>
[2] │ <>abc
[3] │ <>abc<>

Anchors

str_replace_all("abc", c("$", "^", "\\b"), "--")
[1] "abc--"   "--abc"   "--abc--"

Character classes

A character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with [], where [abc] matches “a”, “b”, or “c” and [^abc] matches any character except “a”, “b”, or “c”. Apart from ^ there are two other characters that have special meaning inside of []:

  • - defines a range, e.g., [a-z] matches any lower case letter and [0-9] matches any number.
  • \ escapes special characters, so [\^\-\]] matches ^, -, or ].

Character classes example 1

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "[abc]+")
[1] │ <abc>d ABCD 12345 -!@#%.
cat("\n") 
str_view(x, "[a-z]+")
[1] │ <abcd> ABCD 12345 -!@#%.
cat("\n") 
str_view(x, "[^a-z0-9]+")
[1] │ abcd< ABCD >12345< -!@#%.>

Character classes example 2

# You need an escape to match characters that are otherwise
# special inside of []
str_view("a-b-c", "[a-c]")
[1] │ <a>-<b>-<c>
cat("\n")
str_view("a-b-c", "[a\\-c]")
[1] │ <a><->b<-><c>

Character classes

Some character classes are used so commonly that they get their own shortcut. You’ve already seen ., which matches any character apart from a newline. There are three other particularly useful pairs:

  • \d matches any digit;
    \D matches anything that isn’t a digit.
  • \s matches any whitespace (e.g., space, tab, newline);
    \S matches anything that isn’t whitespace.
  • \w matches any “word” character, i.e. letters and numbers;
    \W matches any “non-word” character.

Character classes

The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
[1] │ abcd ABCD <12345> -!@#%.
str_view(x, "\\D+")
[1] │ <abcd ABCD >12345< -!@#%.>
str_view(x, "\\s+")
[1] │ abcd< >ABCD< >12345< >-!@#%.
str_view(x, "\\S+")
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
str_view(x, "\\w+")
[1] │ <abcd> <ABCD> <12345> -!@#%.
str_view(x, "\\W+")
[1] │ abcd< >ABCD< >12345< -!@#%.>

Quantifiers

Quantifiers control how many times a pattern matches. Previously, we learned about ? (0 or 1 matches), + (1 or more matches), and * (0 or more matches). For example, colou?r will match American or British spelling, \d+ will match one or more digits, and \s? will optionally match a single item of whitespace. You can also specify the number of matches precisely with {}:

  • {n} matches exactly n times.
  • {n,} matches at least n times.
  • {n,m} matches between n and m times.

Operator precedence and parentheses

  • Regular expressions have their own precedence rules

  • Quantifiers have high precedence

  • Alternation has low precedence

    which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$).

Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.

Grouping and capturing

As well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.

For example, the following pattern finds all fruits that have a repeated pair of letters:

str_view(fruit, "(..)\\1")
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry

Grouping and capturing

And this one finds all words that start and end with the same pair of letters:

str_view(words, "^(..).*\\1$")
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>

Grouping and capturing

You can also use back references in str_replace(). For example, this code switches the order of the second and third words in sentences:

sentences |> 
  str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |> 
  str_view()
 [1] │ The canoe birch slid on the smooth planks.
 [2] │ Glue sheet the to the dark blue background.
 [3] │ It's to easy tell the depth of a well.
 [4] │ These a days chicken leg is a rare dish.
 [5] │ Rice often is served in round bowls.
 [6] │ The of juice lemons makes fine punch.
 [7] │ The was box thrown beside the parked truck.
 [8] │ The were hogs fed chopped corn and garbage.
 [9] │ Four of hours steady work faced us.
[10] │ A size large in stockings is hard to sell.
[11] │ The was boy there when the sun rose.
[12] │ A is rod used to catch pink salmon.
[13] │ The of source the huge river is the clear spring.
[14] │ Kick ball the straight and follow through.
[15] │ Help woman the get back to her feet.
[16] │ A of pot tea helps to pass the evening.
[17] │ Smoky lack fires flame and heat.
[18] │ The cushion soft broke the man's fall.
[19] │ The breeze salt came across from the sea.
[20] │ The at girl the booth sold fifty bonds.
... and 700 more