I was wondering the realms of StackOver Flow answering some questions when I encoutered a question that looked to extract some parts of a string based on a regex. I thought I knew how to do this with the package stringr
using, for example, str_sub
but I found it a bit difficult to map how str_locate
complements str_sub
.
str_locate
and str_locate_all
give back the locations of your regex inside the desired string as a matrix
or a list
respectively. However, that didn’t look very intuitive to pass on to str_sub
which (I thought) only accepted numeric vectors with the indices of the parts of the strings that you want to extract. However, to my surprise, str_sub
accepts not only numeric vectors but also a matrix with two columns, precisely the result of str_locate
.
Let’s create a set of random strings from which we want to extract the word special*word
, where *
represents a random number.
library(stringr)
test_string <-
replicate(
100,
paste0(
sample(c(letters, LETTERS, paste0("special", sample(1:10, 1),"word")), 15),
collapse = "")
)
head(test_string)
## [1] "pZTQHcDVObnaCFS" "qBxfbIHjauyEmgspecial10word"
## [3] "TKgbmQAEFoJHOVh" "VoBdUAuzfPrmCGX"
## [5] "dGgJOspecial5wordiFpbvXzUD" "WOfLjNospecial4wordEeGkyTA"
Using str_locate
returns a matrix with the positions of all matches for every string. This is what’s called vectorised functions in R.
location_matrix <-
str_locate(test_string, pattern = "special[0-9]word")
head(location_matrix)
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] 6 17
## [6,] 8 19
For this example this wouldn’t work, but I was also interested in checking how the result of str_locate_all
would fit in this workflow. str_locate_all
is the same as str_locate
but since it can find more than one match per string, it returns a list with the same slots as there are strings in test_string
with a matrix per slot showing the indices of the matches. Since many of the strings in test_string
might not have special*word
, we need to fill out those matches with NA
:
location_list <-
str_locate_all(test_string, pattern = "special[0-9]word") %>%
lapply(function(.x) if (all(is.na(.x))) matrix(c(NA, NA), ncol = 2) else .x) %>%
{do.call(rbind, .)}
head(location_list)
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] 6 17
## [6,] 8 19
Now that we have everything ready, str_sub
can give our desires results using both numeric vectors as well as the entire matrix:
# Using numeric vectors from str_locate
str_sub(test_string, location_matrix[, 1], location_matrix[, 2])
## [1] NA NA NA NA "special5word"
## [6] "special4word" NA NA "special5word" NA
## [11] NA NA NA NA NA
## [16] NA NA NA NA NA
## [21] NA NA NA "special5word" "special6word"
## [26] NA NA NA NA NA
## [31] "special4word" NA NA NA NA
## [36] NA NA NA "special7word" NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
## [51] NA NA NA NA NA
## [56] NA NA NA NA NA
## [61] NA NA "special4word" NA NA
## [66] NA NA NA NA NA
## [71] NA NA NA "special7word" "special9word"
## [76] NA NA NA NA NA
## [81] "special4word" NA NA "special5word" NA
## [86] NA NA NA "special9word" "special9word"
## [91] NA NA NA NA NA
## [96] "special6word" NA NA "special3word" "special1word"
# Using numeric vectors from str_locate_all
str_sub(test_string, location_list[, 1], location_list[, 2])
## [1] NA NA NA NA "special5word"
## [6] "special4word" NA NA "special5word" NA
## [11] NA NA NA NA NA
## [16] NA NA NA NA NA
## [21] NA NA NA "special5word" "special6word"
## [26] NA NA NA NA NA
## [31] "special4word" NA NA NA NA
## [36] NA NA NA "special7word" NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
## [51] NA NA NA NA NA
## [56] NA NA NA NA NA
## [61] NA NA "special4word" NA NA
## [66] NA NA NA NA NA
## [71] NA NA NA "special7word" "special9word"
## [76] NA NA NA NA NA
## [81] "special4word" NA NA "special5word" NA
## [86] NA NA NA "special9word" "special9word"
## [91] NA NA NA NA NA
## [96] "special6word" NA NA "special3word" "special1word"
# Using the entire matrix
str_sub(test_string, location_matrix)
## [1] NA NA NA NA "special5word"
## [6] "special4word" NA NA "special5word" NA
## [11] NA NA NA NA NA
## [16] NA NA NA NA NA
## [21] NA NA NA "special5word" "special6word"
## [26] NA NA NA NA NA
## [31] "special4word" NA NA NA NA
## [36] NA NA NA "special7word" NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
## [51] NA NA NA NA NA
## [56] NA NA NA NA NA
## [61] NA NA "special4word" NA NA
## [66] NA NA NA NA NA
## [71] NA NA NA "special7word" "special9word"
## [76] NA NA NA NA NA
## [81] "special4word" NA NA "special5word" NA
## [86] NA NA NA "special9word" "special9word"
## [91] NA NA NA NA NA
## [96] "special6word" NA NA "special3word" "special1word"
A much easier approach to doing the above (which is cumbersome and verbose) is to use str_extract
:
str_extract(test_string, "special[0-9]word")
## [1] NA NA NA NA "special5word"
## [6] "special4word" NA NA "special5word" NA
## [11] NA NA NA NA NA
## [16] NA NA NA NA NA
## [21] NA NA NA "special5word" "special6word"
## [26] NA NA NA NA NA
## [31] "special4word" NA NA NA NA
## [36] NA NA NA "special7word" NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
## [51] NA NA NA NA NA
## [56] NA NA NA NA NA
## [61] NA NA "special4word" NA NA
## [66] NA NA NA NA NA
## [71] NA NA NA "special7word" "special9word"
## [76] NA NA NA NA NA
## [81] "special4word" NA NA "special5word" NA
## [86] NA NA NA "special9word" "special9word"
## [91] NA NA NA NA NA
## [96] "special6word" NA NA "special3word" "special1word"
However, the whole objecive behind this exercise was to clearly map out how to connect str_locate
to str_sub
and it’s much clearer if you can pass the entire matrix. However, converting str_locate_all
is still a bit tricky.