Lecture 13
2025-04-20
When this is supplied, str_view()
will show only the elements of the string vector that match, surrounding each match with <>
, and, where possible, highlighting the match in blue.
Letters and numbers match exactly and are called literal characters.
Most punctuation characters, like .
, +
, *
, [
,]
, and ?
, have special meanings and are called metacharacters.
For example, .
will match any character, so "a."
will match any string that contains an “a” followed by another character
For example, we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”
Quantifiers control how many times a pattern can match:
?
makes a pattern optional (i.e. it matches 0 or 1 times)
+
lets a pattern repeat (i.e. it matches at least once)
*
lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
?
+
*
Character classes are defined by []
and let you match a set of characters
[abcd]
matches “a”, “b”, “c”, or “d”.
You can also invert the match by starting with ^
: [^abcd]
matches anything except “a”, “b”, “c”, or “d”.
We can use this idea to find the words containing an “x” surrounded by vowels:
We can use this idea to find “y” surrounded by consonants:
You can use alternation, |
, to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “melon”, or “nut”.
str_detect()
returns a logical vector that is TRUE
if the pattern matches an element of the character vector and FALSE
otherwise
The next step up in complexity from str_detect()
is str_count()
: rather than a true or false, it tells you how many matches there are in each string.
We can modify values with str_replace()
and str_replace_all()
.
str_replace()
replaces the first match,
str_replace_all()
replaces all matches.
str_remove()
and str_remove_all()
are handy shortcuts for str_replace(x, pattern, "")
Consider the following dataset
In order to match a literal .
, you need an escape which tells the regular expression to match metacharacters literally.
Like strings, regexps use the backslash for escaping.
So, to match a .
, you need the regexp \.
.
Unfortunately this creates a problem. We use strings to represent regular expressions, and \
is also used as an escape symbol in strings.
So to create the regular expression \.
we need the string "\\."
, as the following example shows.
If \
is used as an escape character in regular expressions, how do you match a literal \
?
\\
. To create that regular expression, you need to use a string, which also needs to escape \
. That means to match a literal \
you need to write "\\\\"
— you need four backslashes to match one!If you’re trying to match a literal .
, $
, |
, *
, +
, ?
, {
, }
, (
, )
, there’s an alternative to using a backslash escape: you can use a character class: [.]
, [$]
, [|]
, … all match the literal values.
By default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^
to match the start or $
to match the end:
To force a regular expression to match only the full string, anchor it with both ^
and $
:
You can also match the boundary between words (i.e. the start or end of a word) with \b
.
When used alone, anchors will produce a zero-width match
A character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with []
, where [abc]
matches “a”, “b”, or “c” and [^abc]
matches any character except “a”, “b”, or “c”. Apart from ^
there are two other characters that have special meaning inside of []:
-
defines a range, e.g., [a-z]
matches any lower case letter and [0-9]
matches any number.\
escapes special characters, so [\^\-\]]
matches ^
, -
, or ]
.Some character classes are used so commonly that they get their own shortcut. You’ve already seen .
, which matches any character apart from a newline. There are three other particularly useful pairs:
\d
matches any digit;\D
matches anything that isn’t a digit.\s
matches any whitespace (e.g., space, tab, newline);\S
matches anything that isn’t whitespace.\w
matches any “word” character, i.e. letters and numbers;\W
matches any “non-word” character.The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.
[1] │ abcd ABCD <12345> -!@#%.
[1] │ <abcd ABCD >12345< -!@#%.>
[1] │ abcd< >ABCD< >12345< >-!@#%.
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
[1] │ <abcd> <ABCD> <12345> -!@#%.
[1] │ abcd< >ABCD< >12345< -!@#%.>
Quantifiers control how many times a pattern matches. Previously, we learned about ?
(0 or 1 matches), +
(1 or more matches), and *
(0 or more matches). For example, colou?r
will match American or British spelling, \d+
will match one or more digits, and \s?
will optionally match a single item of whitespace. You can also specify the number of matches precisely with {}
:
{n}
matches exactly n times.{n,}
matches at least n times.{n,m}
matches between n and m times.Regular expressions have their own precedence rules
Quantifiers have high precedence
Alternation has low precedence
which means that ab+
is equivalent to a(b+)
, and ^a|b$
is equivalent to (^a)|(b$)
.
Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.
As well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.
For example, the following pattern finds all fruits that have a repeated pair of letters:
And this one finds all words that start and end with the same pair of letters:
You can also use back references in str_replace()
. For example, this code switches the order of the second and third words in sentences
:
[1] │ The canoe birch slid on the smooth planks.
[2] │ Glue sheet the to the dark blue background.
[3] │ It's to easy tell the depth of a well.
[4] │ These a days chicken leg is a rare dish.
[5] │ Rice often is served in round bowls.
[6] │ The of juice lemons makes fine punch.
[7] │ The was box thrown beside the parked truck.
[8] │ The were hogs fed chopped corn and garbage.
[9] │ Four of hours steady work faced us.
[10] │ A size large in stockings is hard to sell.
[11] │ The was boy there when the sun rose.
[12] │ A is rod used to catch pink salmon.
[13] │ The of source the huge river is the clear spring.
[14] │ Kick ball the straight and follow through.
[15] │ Help woman the get back to her feet.
[16] │ A of pot tea helps to pass the evening.
[17] │ Smoky lack fires flame and heat.
[18] │ The cushion soft broke the man's fall.
[19] │ The breeze salt came across from the sea.
[20] │ The at girl the booth sold fifty bonds.
... and 700 more