Jorge Cimentada
Imagine you are playing a treasure hunt game and you are given a clue that says “the treasure is hidden in a sentence with the word ‘treasure’ in it.” You would use regex to help you quickly find the sentence with the word “treasure” in it, like a treasure hunter using a metal detector to find gold.
[1] "I like hourglasses, maps, [DATE] century typography."
Not very handy because you know the word you’re extracting 😅
Matches any letter, number or any type of character.
borges <- "I like hourglasses, maps, eighteenth Eighteenth century typography."
str_view_all(borges, ".ighteenth")
Also spaces..😥
The .
is often used with +
which means that the character .
needs to be repeated one or more times. This example is a great motivation:
What you want is:
What if you wanted to actually match .
?
borges <- "I like hourglasses. I also like maps, eighteenth century typography"
str_view_all(borges, "I like .+")
Good but matches the entire phrase. How do we match until the .
?
|
) operatorYou’ll often want to match either one regexp or another. Suppose we want to extract the century from the vector below.
|
) operatorUse |
to separate regex. Here we use it to say: replace either map,
(contains the space) or century
(contains a space):
Anchors are ^
for beginning and $
for ending. If you start with power, you end up with money.
borges_long <- c(
"I like cars. I like hourglasses, maps, eighteenth century typography",
"I like computers. I like hourglasses, maps, eighteenth century typography"
)
str_view_all(borges_long, "I like .+")
Both phrases have “I like” at the beginning but also have “I like” after the end of the first sentence. What if we want to match only the first sentence?
Instead, we use anchors:
🤩
\\d
\\d
matches digits
\\d
\\d
matches single digits
\\d
Instead we want:
[]
Shortcut to extend the regexp |
(or) and make it more flexible:
# A tibble: 41 × 6
Country Men Women Year Notes References
<chr> <chr> <chr> <chr> <chr> <chr>
1 Albania 65 61 2020 "" [1]
2 Austria 65 60 2018 "In … [1][3]
3 Belarus 62.5 57.5 2021 "By … [5]
4 Belgium 65 65 2018 "In … [5]
5 Bosnia and Herzegovina 65 65 2011 "" [1]
6 Bulgaria 64 (and 4 months) 61 (and 8 mo… 2021 "In … [2]
7 Croatia 65 62 2018 "By … [2]
8 Cyprus 65 65 2018 "" [1][3]
9 Czech Republic 63 (and 4 months) 58 (and 8 mo… 2018 "In … [7]
10 Denmark 67 67 2022 "In … [3][5]
# … with 31 more rows
[]
See output at here.
[]
Note that []
works the same way for anything: numbers, letters, punctuation, spaces, etc..
[^]
[^abc]
: matches anything except a, b, or c.All ages which are below 65 will be matched.
scrapex
contains a copy of the Wikipedia page “History of France”:
Let’s load the HTML:
ul
tags are very common and subsetting only for ul
tags will bring many matches. XPath is the topic of next chapter but bear with me.
//ul
: bring all ul
tags from the document
[.//a]
: subset all a
tags that are below all ul
tags (notice the .
)
[contains(@title, "House of Valois")]
: where this a
tag needs to have a title
attribute that contains “House of Valois”.
Let’s pick the second slot and redo the XPath
all_txt <- history_france %>%
xml_find_all("//ul[.//a[contains(@title, 'House of Valois')]][2]") %>%
xml_text()
all_txt
[1] "House of Valois\nLouis XI the Prudent, 1461–83\nCharles VIII the Affable, 1483–98\nLouis XII, 1498–1515\nFrancis I, 1515–47\nHenry II, 1547–59\nFrancis II, 1559–60\nCharles IX, 1560–74 (1560–63 under regency of Catherine de' Medici)\nHenry III, 1574–89\nHouse of Bourbon\nHenry IV the Great, 1589–1610\nthe Regency of Marie de Medici, 1610–17\nLouis XIII the Just and his minister Cardinal Richelieu, 1610–43\nthe Regency of Anne of Austria and her minister Cardinal Mazarin, 1643–51\nLouis XIV the Sun King and his minister Jean-Baptiste Colbert, 1643–1715\nthe Régence, a period of regency under Philip II of Orléans, 1715–23\nLouis XV the Beloved and his minister Cardinal André-Hercule de Fleury, 1715–74\nLouis XVI, 1774–92"
This is where regex comes in. Let’s use our ninja skills 🦾!
[1] "House of Valois"
[2] "Louis XI the Prudent, 1461–83"
[3] "Charles VIII the Affable, 1483–98"
[4] "Louis XII, 1498–1515"
[5] "Francis I, 1515–47"
[6] "Henry II, 1547–59"
[7] "Francis II, 1559–60"
[8] "Charles IX, 1560–74 (1560–63 under regency of Catherine de' Medici)"
[9] "Henry III, 1574–89"
[10] "House of Bourbon"
[11] "Henry IV the Great, 1589–1610"
[12] "the Regency of Marie de Medici, 1610–17"
[13] "Louis XIII the Just and his minister Cardinal Richelieu, 1610–43"
[14] "the Regency of Anne of Austria and her minister Cardinal Mazarin, 1643–51"
[15] "Louis XIV the Sun King and his minister Jean-Baptiste Colbert, 1643–1715"
[16] "the Régence, a period of regency under Philip II of Orléans, 1715–23"
[17] "Louis XV the Beloved and his minister Cardinal André-Hercule de Fleury, 1715–74"
[18] "Louis XVI, 1774–92"
We have 16 kings/queens but there are two slots that separated them by houses: House of Valois and House of Bourbon.
[1] "Louis XI the Prudent, 1461–83"
[2] "Charles VIII the Affable, 1483–98"
[3] "Louis XII, 1498–1515"
[4] "Francis I, 1515–47"
[5] "Henry II, 1547–59"
[6] "Francis II, 1559–60"
[7] "Charles IX, 1560–74 (1560–63 under regency of Catherine de' Medici)"
[8] "Henry III, 1574–89"
[9] "Henry IV the Great, 1589–1610"
[10] "the Regency of Marie de Medici, 1610–17"
[11] "Louis XIII the Just and his minister Cardinal Richelieu, 1610–43"
[12] "the Regency of Anne of Austria and her minister Cardinal Mazarin, 1643–51"
[13] "Louis XIV the Sun King and his minister Jean-Baptiste Colbert, 1643–1715"
[14] "the Régence, a period of regency under Philip II of Orléans, 1715–23"
[15] "Louis XV the Beloved and his minister Cardinal André-Hercule de Fleury, 1715–74"
[16] "Louis XVI, 1774–92"
There we go, 16 names and their corresponding period.
We want to extract:
name of monarch
years in power
Problem is:
These are just clarifications so we should just remove.
Strategy: (1) escape parenthesis and (2) match anything inside the parenthesis
[1] "Louis XI the Prudent, 1461–83"
[2] "Charles VIII the Affable, 1483–98"
[3] "Louis XII, 1498–1515"
[4] "Francis I, 1515–47"
[5] "Henry II, 1547–59"
[6] "Francis II, 1559–60"
[7] "Charles IX, 1560–74 "
[8] "Henry III, 1574–89"
[9] "Henry IV the Great, 1589–1610"
[10] "the Regency of Marie de Medici, 1610–17"
[11] "Louis XIII the Just and his minister Cardinal Richelieu, 1610–43"
[12] "the Regency of Anne of Austria and her minister Cardinal Mazarin, 1643–51"
[13] "Louis XIV the Sun King and his minister Jean-Baptiste Colbert, 1643–1715"
[14] "the Régence, a period of regency under Philip II of Orléans, 1715–23"
[15] "Louis XV the Beloved and his minister Cardinal André-Hercule de Fleury, 1715–74"
[16] "Louis XVI, 1774–92"
Extract years in power:
[[1]]
[1] "1461" "83"
[[2]]
[1] "1483" "98"
[[3]]
[1] "1498" "1515"
[[4]]
[1] "1515" "47"
[[5]]
[1] "1547" "59"
[[6]]
[1] "1559" "60"
[[7]]
[1] "1560" "74"
[[8]]
[1] "1574" "89"
[[9]]
[1] "1589" "1610"
[[10]]
[1] "1610" "17"
[[11]]
[1] "1610" "43"
[[12]]
[1] "1643" "51"
[[13]]
[1] "1643" "1715"
[[14]]
[1] "1715" "23"
[[15]]
[1] "1715" "74"
[[16]]
[1] "1774" "92"
Problem: two or four digit year.
convert_time_period <- function(x) {
start_year <- x[1]
end_year <- x[2]
# If end year has only 2 digits
if (nchar(end_year) == 2) {
# Extract the first two years from the start year
end_year_prefix <- str_sub(start_year, 1, 2)
# Paste together the correct year for the end year
end_year <- paste0(end_year_prefix, end_year)
}
# Replace correct end year
x[2] <- end_year
as.numeric(x)
}
sequence_kings <- lapply(res, convert_time_period)
[[1]]
[1] 1461 1483
[[2]]
[1] 1483 1498
[[3]]
[1] 1498 1515
[[4]]
[1] 1515 1547
[[5]]
[1] 1547 1559
[[6]]
[1] 1559 1560
[[7]]
[1] 1560 1574
[[8]]
[1] 1574 1589
[[9]]
[1] 1589 1610
[[10]]
[1] 1610 1617
[[11]]
[1] 1610 1643
[[12]]
[1] 1643 1651
[[13]]
[1] 1643 1715
[[14]]
[1] 1715 1723
[[15]]
[1] 1715 1774
[[16]]
[1] 1774 1792
Extract name of monarch
[1] "Louis XI the Prudent,"
[2] "Charles VIII the Affable,"
[3] "Louis XII,"
[4] "Francis I,"
[5] "Henry II,"
[6] "Francis II,"
[7] "Charles IX,"
[8] "Henry III,"
[9] "Henry IV the Great,"
[10] "the Regency of Marie de Medici,"
[11] "Louis XIII the Just and his minister Cardinal Richelieu,"
[12] "the Regency of Anne of Austria and her minister Cardinal Mazarin,"
[13] "Louis XIV the Sun King and his minister Jean-Baptiste Colbert,"
[14] "the Régence, a period of regency under Philip II of Orléans,"
[15] "Louis XV the Beloved and his minister Cardinal André-Hercule de Fleury,"
[16] "Louis XVI,"
We just have to replace the comma with an empty space to make it cleaner:
[1] "Louis XI the Prudent"
[2] "Charles VIII the Affable"
[3] "Louis XII"
[4] "Francis I"
[5] "Henry II"
[6] "Francis II"
[7] "Charles IX"
[8] "Henry III"
[9] "Henry IV the Great"
[10] "the Regency of Marie de Medici"
[11] "Louis XIII the Just and his minister Cardinal Richelieu"
[12] "the Regency of Anne of Austria and her minister Cardinal Mazarin"
[13] "Louis XIV the Sun King and his minister Jean-Baptiste Colbert"
[14] "the Régence a period of regency under Philip II of Orléans"
[15] "Louis XV the Beloved and his minister Cardinal André-Hercule de Fleury"
[16] "Louis XVI"
Combine everything into tidy data frame
# Combine into data frames
sequence_kings_df <- lapply(sequence_kings, function(x) data.frame(start = x[1], end = x[2]))
final_kings <- do.call(rbind, sequence_kings_df)
# Add king names
final_kings$event <- names_kings
final_kings$start <- make_date(final_kings$start, 1, 1)
final_kings$end <- make_date(final_kings$end, 1, 1)
# Final data frame
final_kings <- as_tibble(final_kings)
final_kings
# A tibble: 16 × 3
start end event
<date> <date> <chr>
1 1461-01-01 1483-01-01 Louis XI the Prudent
2 1483-01-01 1498-01-01 Charles VIII the Affable
3 1498-01-01 1515-01-01 Louis XII
4 1515-01-01 1547-01-01 Francis I
5 1547-01-01 1559-01-01 Henry II
6 1559-01-01 1560-01-01 Francis II
7 1560-01-01 1574-01-01 Charles IX
8 1574-01-01 1589-01-01 Henry III
9 1589-01-01 1610-01-01 Henry IV the Great
10 1610-01-01 1617-01-01 the Regency of Marie de Medici
11 1610-01-01 1643-01-01 Louis XIII the Just and his minister Cardinal Richelieu
12 1643-01-01 1651-01-01 the Regency of Anne of Austria and her minister Cardin…
13 1643-01-01 1715-01-01 Louis XIV the Sun King and his minister Jean-Baptiste …
14 1715-01-01 1723-01-01 the Régence a period of regency under Philip II of Orl…
15 1715-01-01 1774-01-01 Louis XV the Beloved and his minister Cardinal André-H…
16 1774-01-01 1792-01-01 Louis XVI