Web Scraping

Jorge Cimentada

A primer on webscraping

Primer on webscraping

  • Skip the usual ‘basics’ right into coding

  • Webscraping is the subtle art of being a ninja

  • The plot above shows the election results for all political parties in Spain since 1978, all completely scraped from Wikipedia.

  • Chapter 2 from book

Scrapex

  1. Webscraping tutorials are doomed to change


  2. API examples are doomed to change


  3. Predictability is important

scrapex is an R package with completely self-standing web scraping/API examples for eternity

Getting website data into R

library(scrapex)
library(rvest)
library(httr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

link <- history_elections_spain_ex()
link
[1] "/home/runner/.local/share/renv/cache/v5/R-4.2/x86_64-pc-linux-gnu/scrapex/0.0.1.9999/9aa78cbb3eccaaa811756b4bf07fdd1a/scrapex/extdata/history_elections_spain//Elections_Spain.html"

Getting website data into R

browseURL(prep_browser(link))

Getting website data into R

Ethical web scraping is identifying yourself

set_config(
  user_agent("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0; Jorge Cimentada / cimentadaj@gmail.com")
)

Where can you find it?

Getting website data into R

html_website <- link %>% read_html()
html_website
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...


  • Difficult to read – raw HTML code

  • Often too long to actually explore it

  • This output is mostly to confirm it was read and the header of the HTML file

Getting website data into R

What we actually want:

Getting website data into R

  • The output from read_html() is difficult to understand

  • Goal is to find actual election data on the website

  • rvest package has html_table() function that can extract tables from website and import into R.


all_tables <-
  html_website %>%
  html_table()

Getting website data into R

  • Grabs all tables

  • We only pick the one we’re interested in.

elections_data <- all_tables[[5]]
elections_data
# A tibble: 16 × 18
   Election  `UCD[a]`     PSOE `PP[b]` `IU[c]` `CDC[d]`   PNV `ERC[e]` `BNG[f]`
   <chr>     <chr>       <dbl>   <dbl> <chr>      <dbl> <dbl>    <dbl>    <dbl>
 1 Election  ""           NA      NA   ""          NA    NA       NA       NA  
 2 1977      "34.4"       29.3     8.3 "9.3"        2.8   1.7      0.8      0.1
 3 1979      "34.8"       30.4     6.1 "10.8"       1.7   1.6      0.7      0.3
 4 1982      "6.8"        48.1    26.4 "4.0"        3.7   1.9      0.7      0.2
 5 1986      "Dissolved"  44.1    26   "4.6"        5     1.5      0.4      0.1
 6 1989      "Dissolved"  39.6    25.8 "9.1"        5     1.2      0.4      0.2
 7 1993      "Dissolved"  38.8    34.8 "9.6"        4.9   1.2      0.8      0.5
 8 1996      "Dissolved"  37.6    38.8 "10.5"       4.6   1.3      0.7      0.9
 9 2000      "Dissolved"  34.2    44.5 "5.4"        4.2   1.5      0.8      1.3
10 2004      "Dissolved"  42.6    37.7 "5.0"        3.2   1.6      2.5      0.8
11 2008      "Dissolved"  43.9    39.9 "3.8"        3     1.2      1.2      0.8
12 2011      "Dissolved"  28.8    44.6 "6.9"        4.2   1.3      1.1      0.8
13 2015      "Dissolved"  22      28.7 "3.7"        2.2   1.2      2.4      0.3
14 2016      "Dissolved"  22.6    33   "[k]"        2     1.2      2.6      0.2
15 Apr. 2019 "Dissolved"  28.7    16.7 "[l]"        1.9   1.5      3.9      0.4
16 Nov. 2019 "Dissolved"  28      20.8 "[l]"        2.2   1.6      3.6      0.5
# … with 9 more variables: `EHB[g]` <chr>, `CDS[h]` <chr>, `CC[i]` <dbl>,
#   UPyD <chr>, Cs <chr>, Com. <chr>, `Pod.[j]` <dbl>, Vox <dbl>, MP <dbl>

Data cleaning

  1. The first row is empty
  2. The “Election” column is a character column and contains string values
  3. Columns contain values which are not numbers
  4. Column names also have footnote values

The process of web scraping requires knowledge of regular expressions and data cleaning techniques.

Data cleaning

elections_data %>% select_if(is.character)
# A tibble: 16 × 8
   Election  `UCD[a]`    `IU[c]` `EHB[g]`    `CDS[h]`    UPyD        Cs    Com. 
   <chr>     <chr>       <chr>   <chr>       <chr>       <chr>       <chr> <chr>
 1 Election  ""          ""      ""          ""          ""          ""    ""   
 2 1977      "34.4"      "9.3"   "0.2"       ""          ""          ""    ""   
 3 1979      "34.8"      "10.8"  "1.0"       ""          ""          ""    ""   
 4 1982      "6.8"       "4.0"   "1.0"       "2.9"       ""          ""    ""   
 5 1986      "Dissolved" "4.6"   "1.1"       "9.2"       ""          ""    ""   
 6 1989      "Dissolved" "9.1"   "1.1"       "7.9"       ""          ""    ""   
 7 1993      "Dissolved" "9.6"   "0.9"       "1.8"       ""          ""    ""   
 8 1996      "Dissolved" "10.5"  "0.7"       "0.2"       ""          ""    ""   
 9 2000      "Dissolved" "5.4"   "Boycotted" "0.1"       ""          ""    ""   
10 2004      "Dissolved" "5.0"   "Banned"    "0.1"       ""          ""    ""   
11 2008      "Dissolved" "3.8"   "Banned"    "0.0"       "1.2"       "0.2" ""   
12 2011      "Dissolved" "6.9"   "1.4"       "Dissolved" "4.7"       "Did… "0.5"
13 2015      "Dissolved" "3.7"   "0.9"       "Dissolved" "0.6"       "13.… "[k]"
14 2016      "Dissolved" "[k]"   "0.8"       "Dissolved" "0.2"       "13.… "[k]"
15 Apr. 2019 "Dissolved" "[l]"   "1.0"       "Dissolved" "Did not r… "15.… "0.7"
16 Nov. 2019 "Dissolved" "[l]"   "1.2"       "Dissolved" "[m]"       "6.8" "[n]"

Data cleaning

  • Build regular expression to replace all unwanted parts of the string
wrong_labels <- c(
  "Dissolved",
  "[k]",
  "[l]",
  "[m]",
  "n",
  "Banned",
  "Boycotted",
  "Did not run"
)

wrong_labels <- paste0(wrong_labels, collapse = "|")
wrong_labels
[1] "Dissolved|[k]|[l]|[m]|n|Banned|Boycotted|Did not run"

Data cleaning

  • Replace the regex using str_replace_all
semi_cleaned_data <-
  elections_data %>%
  mutate_if(
    is.character,
    ~ str_replace_all(string = .x, pattern = wrong_labels, replacement = NA_character_)
  )

semi_cleaned_data %>% select_if(is.character)
# A tibble: 16 × 8
   Election  `UCD[a]` `IU[c]` `EHB[g]` `CDS[h]` UPyD  Cs     Com. 
   <chr>     <chr>    <chr>   <chr>    <chr>    <chr> <chr>  <chr>
 1 <NA>      ""       ""      ""       ""       ""    ""     ""   
 2 1977      "34.4"   "9.3"   "0.2"    ""       ""    ""     ""   
 3 1979      "34.8"   "10.8"  "1.0"    ""       ""    ""     ""   
 4 1982      "6.8"    "4.0"   "1.0"    "2.9"    ""    ""     ""   
 5 1986       <NA>    "4.6"   "1.1"    "9.2"    ""    ""     ""   
 6 1989       <NA>    "9.1"   "1.1"    "7.9"    ""    ""     ""   
 7 1993       <NA>    "9.6"   "0.9"    "1.8"    ""    ""     ""   
 8 1996       <NA>    "10.5"  "0.7"    "0.2"    ""    ""     ""   
 9 2000       <NA>    "5.4"    <NA>    "0.1"    ""    ""     ""   
10 2004       <NA>    "5.0"    <NA>    "0.1"    ""    ""     ""   
11 2008       <NA>    "3.8"    <NA>    "0.0"    "1.2" "0.2"  ""   
12 2011       <NA>    "6.9"   "1.4"     <NA>    "4.7"  <NA>  "0.5"
13 2015       <NA>    "3.7"   "0.9"     <NA>    "0.6" "13.9"  <NA>
14 2016       <NA>     <NA>   "0.8"     <NA>    "0.2" "13.1"  <NA>
15 Apr. 2019  <NA>     <NA>   "1.0"     <NA>     <NA> "15.9" "0.7"
16 Nov. 2019  <NA>     <NA>   "1.2"     <NA>     <NA> "6.8"   <NA>

Data cleaning

  • Replace extra patterns of months

  • We replace again here because we want to replace with "" instead of NA

semi_cleaned_data <-
  semi_cleaned_data %>%
  mutate(
    Election = str_replace_all(string = Election, pattern = "Apr. |Nov. ", replacement = "")
  )

semi_cleaned_data %>% select_if(is.character)
# A tibble: 16 × 8
   Election `UCD[a]` `IU[c]` `EHB[g]` `CDS[h]` UPyD  Cs     Com. 
   <chr>    <chr>    <chr>   <chr>    <chr>    <chr> <chr>  <chr>
 1 <NA>     ""       ""      ""       ""       ""    ""     ""   
 2 1977     "34.4"   "9.3"   "0.2"    ""       ""    ""     ""   
 3 1979     "34.8"   "10.8"  "1.0"    ""       ""    ""     ""   
 4 1982     "6.8"    "4.0"   "1.0"    "2.9"    ""    ""     ""   
 5 1986      <NA>    "4.6"   "1.1"    "9.2"    ""    ""     ""   
 6 1989      <NA>    "9.1"   "1.1"    "7.9"    ""    ""     ""   
 7 1993      <NA>    "9.6"   "0.9"    "1.8"    ""    ""     ""   
 8 1996      <NA>    "10.5"  "0.7"    "0.2"    ""    ""     ""   
 9 2000      <NA>    "5.4"    <NA>    "0.1"    ""    ""     ""   
10 2004      <NA>    "5.0"    <NA>    "0.1"    ""    ""     ""   
11 2008      <NA>    "3.8"    <NA>    "0.0"    "1.2" "0.2"  ""   
12 2011      <NA>    "6.9"   "1.4"     <NA>    "4.7"  <NA>  "0.5"
13 2015      <NA>    "3.7"   "0.9"     <NA>    "0.6" "13.9"  <NA>
14 2016      <NA>     <NA>   "0.8"     <NA>    "0.2" "13.1"  <NA>
15 2019      <NA>     <NA>   "1.0"     <NA>     <NA> "15.9" "0.7"
16 2019      <NA>     <NA>   "1.2"     <NA>     <NA> "6.8"   <NA>

Data cleaning

  • Remove first row
semi_cleaned_data <-
  semi_cleaned_data %>%
  mutate_all(as.numeric) %>%
  filter(!is.na(Election))

semi_cleaned_data
# A tibble: 15 × 18
   Election `UCD[a]`  PSOE `PP[b]` `IU[c]` `CDC[d]`   PNV `ERC[e]` `BNG[f]`
      <dbl>    <dbl> <dbl>   <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl>
 1     1977     34.4  29.3     8.3     9.3      2.8   1.7      0.8      0.1
 2     1979     34.8  30.4     6.1    10.8      1.7   1.6      0.7      0.3
 3     1982      6.8  48.1    26.4     4        3.7   1.9      0.7      0.2
 4     1986     NA    44.1    26       4.6      5     1.5      0.4      0.1
 5     1989     NA    39.6    25.8     9.1      5     1.2      0.4      0.2
 6     1993     NA    38.8    34.8     9.6      4.9   1.2      0.8      0.5
 7     1996     NA    37.6    38.8    10.5      4.6   1.3      0.7      0.9
 8     2000     NA    34.2    44.5     5.4      4.2   1.5      0.8      1.3
 9     2004     NA    42.6    37.7     5        3.2   1.6      2.5      0.8
10     2008     NA    43.9    39.9     3.8      3     1.2      1.2      0.8
11     2011     NA    28.8    44.6     6.9      4.2   1.3      1.1      0.8
12     2015     NA    22      28.7     3.7      2.2   1.2      2.4      0.3
13     2016     NA    22.6    33      NA        2     1.2      2.6      0.2
14     2019     NA    28.7    16.7    NA        1.9   1.5      3.9      0.4
15     2019     NA    28      20.8    NA        2.2   1.6      3.6      0.5
# … with 9 more variables: `EHB[g]` <dbl>, `CDS[h]` <dbl>, `CC[i]` <dbl>,
#   UPyD <dbl>, Cs <dbl>, Com. <dbl>, `Pod.[j]` <dbl>, Vox <dbl>, MP <dbl>

Data cleaning

  • Clean up column names
semi_cleaned_data <-
  semi_cleaned_data %>%
  rename_all(~ str_replace_all(.x, "\\[.+\\]", ""))

semi_cleaned_data
# A tibble: 15 × 18
   Election   UCD  PSOE    PP    IU   CDC   PNV   ERC   BNG   EHB   CDS    CC
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1977  34.4  29.3   8.3   9.3   2.8   1.7   0.8   0.1   0.2  NA    NA  
 2     1979  34.8  30.4   6.1  10.8   1.7   1.6   0.7   0.3   1    NA    NA  
 3     1982   6.8  48.1  26.4   4     3.7   1.9   0.7   0.2   1     2.9  NA  
 4     1986  NA    44.1  26     4.6   5     1.5   0.4   0.1   1.1   9.2   0.3
 5     1989  NA    39.6  25.8   9.1   5     1.2   0.4   0.2   1.1   7.9   0.3
 6     1993  NA    38.8  34.8   9.6   4.9   1.2   0.8   0.5   0.9   1.8   0.9
 7     1996  NA    37.6  38.8  10.5   4.6   1.3   0.7   0.9   0.7   0.2   0.9
 8     2000  NA    34.2  44.5   5.4   4.2   1.5   0.8   1.3  NA     0.1   1.1
 9     2004  NA    42.6  37.7   5     3.2   1.6   2.5   0.8  NA     0.1   0.9
10     2008  NA    43.9  39.9   3.8   3     1.2   1.2   0.8  NA     0     0.7
11     2011  NA    28.8  44.6   6.9   4.2   1.3   1.1   0.8   1.4  NA     0.6
12     2015  NA    22    28.7   3.7   2.2   1.2   2.4   0.3   0.9  NA     0.3
13     2016  NA    22.6  33    NA     2     1.2   2.6   0.2   0.8  NA     0.3
14     2019  NA    28.7  16.7  NA     1.9   1.5   3.9   0.4   1    NA     0.5
15     2019  NA    28    20.8  NA     2.2   1.6   3.6   0.5   1.2  NA     0.5
# … with 6 more variables: UPyD <dbl>, Cs <dbl>, Com. <dbl>, Pod. <dbl>,
#   Vox <dbl>, MP <dbl>

Data cleaning

# Pivot from wide to long to plot it in ggplot
cleaned_data <-
  semi_cleaned_data %>%
  pivot_longer(-Election, names_to = "parties")

# Plot it
cleaned_data %>%
  ggplot(aes(Election, value, color = parties)) +
  geom_line() +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  scale_color_viridis_d() +
  theme_minimal()

Data cleaning