Locating parts of a string with `stringr`

PUBLISHED ON DEC 8, 2019 — R

I was wondering the realms of StackOver Flow answering some questions when I encoutered a question that looked to extract some parts of a string based on a regex. I thought I knew how to do this with the package stringr using, for example, str_sub but I found it a bit difficult to map how str_locate complements str_sub.

str_locate and str_locate_all give back the locations of your regex inside the desired string as a matrix or a list respectively. However, that didn’t look very intuitive to pass on to str_sub which (I thought) only accepted numeric vectors with the indices of the parts of the strings that you want to extract. However, to my surprise, str_sub accepts not only numeric vectors but also a matrix with two columns, precisely the result of str_locate.

Let’s create a set of random strings from which we want to extract the word special*word, where * represents a random number.

library(stringr)    

test_string <-
  replicate(
    100,
    paste0(
      sample(c(letters, LETTERS, paste0("special", sample(1:10, 1),"word")), 15),
      collapse = "")
  )

head(test_string)
## [1] "pZTQHcDVObnaCFS"             "qBxfbIHjauyEmgspecial10word"
## [3] "TKgbmQAEFoJHOVh"             "VoBdUAuzfPrmCGX"            
## [5] "dGgJOspecial5wordiFpbvXzUD"  "WOfLjNospecial4wordEeGkyTA"

Using str_locate returns a matrix with the positions of all matches for every string. This is what’s called vectorised functions in R.

location_matrix <-
  str_locate(test_string, pattern =  "special[0-9]word")

head(location_matrix)
##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]    NA  NA
## [5,]     6  17
## [6,]     8  19

For this example this wouldn’t work, but I was also interested in checking how the result of str_locate_all would fit in this workflow. str_locate_all is the same as str_locate but since it can find more than one match per string, it returns a list with the same slots as there are strings in test_string with a matrix per slot showing the indices of the matches. Since many of the strings in test_string might not have special*word, we need to fill out those matches with NA:

location_list <-
  str_locate_all(test_string, pattern =  "special[0-9]word") %>% 
  lapply(function(.x) if (all(is.na(.x))) matrix(c(NA, NA), ncol = 2) else .x) %>%
  {do.call(rbind, .)}

head(location_list)
##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]    NA  NA
## [5,]     6  17
## [6,]     8  19

Now that we have everything ready, str_sub can give our desires results using both numeric vectors as well as the entire matrix:

# Using numeric vectors from str_locate
str_sub(test_string, location_matrix[, 1], location_matrix[, 2])
##   [1] NA             NA             NA             NA             "special5word"
##   [6] "special4word" NA             NA             "special5word" NA            
##  [11] NA             NA             NA             NA             NA            
##  [16] NA             NA             NA             NA             NA            
##  [21] NA             NA             NA             "special5word" "special6word"
##  [26] NA             NA             NA             NA             NA            
##  [31] "special4word" NA             NA             NA             NA            
##  [36] NA             NA             NA             "special7word" NA            
##  [41] NA             NA             NA             NA             NA            
##  [46] NA             NA             NA             NA             NA            
##  [51] NA             NA             NA             NA             NA            
##  [56] NA             NA             NA             NA             NA            
##  [61] NA             NA             "special4word" NA             NA            
##  [66] NA             NA             NA             NA             NA            
##  [71] NA             NA             NA             "special7word" "special9word"
##  [76] NA             NA             NA             NA             NA            
##  [81] "special4word" NA             NA             "special5word" NA            
##  [86] NA             NA             NA             "special9word" "special9word"
##  [91] NA             NA             NA             NA             NA            
##  [96] "special6word" NA             NA             "special3word" "special1word"
# Using numeric vectors from str_locate_all
str_sub(test_string, location_list[, 1], location_list[, 2])
##   [1] NA             NA             NA             NA             "special5word"
##   [6] "special4word" NA             NA             "special5word" NA            
##  [11] NA             NA             NA             NA             NA            
##  [16] NA             NA             NA             NA             NA            
##  [21] NA             NA             NA             "special5word" "special6word"
##  [26] NA             NA             NA             NA             NA            
##  [31] "special4word" NA             NA             NA             NA            
##  [36] NA             NA             NA             "special7word" NA            
##  [41] NA             NA             NA             NA             NA            
##  [46] NA             NA             NA             NA             NA            
##  [51] NA             NA             NA             NA             NA            
##  [56] NA             NA             NA             NA             NA            
##  [61] NA             NA             "special4word" NA             NA            
##  [66] NA             NA             NA             NA             NA            
##  [71] NA             NA             NA             "special7word" "special9word"
##  [76] NA             NA             NA             NA             NA            
##  [81] "special4word" NA             NA             "special5word" NA            
##  [86] NA             NA             NA             "special9word" "special9word"
##  [91] NA             NA             NA             NA             NA            
##  [96] "special6word" NA             NA             "special3word" "special1word"
# Using the entire matrix
str_sub(test_string, location_matrix)
##   [1] NA             NA             NA             NA             "special5word"
##   [6] "special4word" NA             NA             "special5word" NA            
##  [11] NA             NA             NA             NA             NA            
##  [16] NA             NA             NA             NA             NA            
##  [21] NA             NA             NA             "special5word" "special6word"
##  [26] NA             NA             NA             NA             NA            
##  [31] "special4word" NA             NA             NA             NA            
##  [36] NA             NA             NA             "special7word" NA            
##  [41] NA             NA             NA             NA             NA            
##  [46] NA             NA             NA             NA             NA            
##  [51] NA             NA             NA             NA             NA            
##  [56] NA             NA             NA             NA             NA            
##  [61] NA             NA             "special4word" NA             NA            
##  [66] NA             NA             NA             NA             NA            
##  [71] NA             NA             NA             "special7word" "special9word"
##  [76] NA             NA             NA             NA             NA            
##  [81] "special4word" NA             NA             "special5word" NA            
##  [86] NA             NA             NA             "special9word" "special9word"
##  [91] NA             NA             NA             NA             NA            
##  [96] "special6word" NA             NA             "special3word" "special1word"

A much easier approach to doing the above (which is cumbersome and verbose) is to use str_extract:

str_extract(test_string, "special[0-9]word")
##   [1] NA             NA             NA             NA             "special5word"
##   [6] "special4word" NA             NA             "special5word" NA            
##  [11] NA             NA             NA             NA             NA            
##  [16] NA             NA             NA             NA             NA            
##  [21] NA             NA             NA             "special5word" "special6word"
##  [26] NA             NA             NA             NA             NA            
##  [31] "special4word" NA             NA             NA             NA            
##  [36] NA             NA             NA             "special7word" NA            
##  [41] NA             NA             NA             NA             NA            
##  [46] NA             NA             NA             NA             NA            
##  [51] NA             NA             NA             NA             NA            
##  [56] NA             NA             NA             NA             NA            
##  [61] NA             NA             "special4word" NA             NA            
##  [66] NA             NA             NA             NA             NA            
##  [71] NA             NA             NA             "special7word" "special9word"
##  [76] NA             NA             NA             NA             NA            
##  [81] "special4word" NA             NA             "special5word" NA            
##  [86] NA             NA             NA             "special9word" "special9word"
##  [91] NA             NA             NA             NA             NA            
##  [96] "special6word" NA             NA             "special3word" "special1word"

However, the whole objecive behind this exercise was to clearly map out how to connect str_locate to str_sub and it’s much clearer if you can pass the entire matrix. However, converting str_locate_all is still a bit tricky.

TAGS: R, REGEX
comments powered by Disqus