RSelenium and scraping Catalan educational data

PUBLISHED ON MAR 22, 2018

Yesterday I found this public dataset on schools from Barcelona and their performance on tests on 6th grade. I wanted to scrape them to investigate the relationship between performance and schools that receive special government funds for social integration. I found this dataset here but it was different from the types of websites I usually scape (html or xml). Although the website has some html the engine swiping the schools is actually based on Javascript. Well, that’s a job for RSelenium, an R package that allows you to browse a website with R.

The process was actually much easier than I thought using Docker. I follow the answer of setting docker from this post. Note that this is for Windows 10.

  • install docker
  • run it, restart computer as requested
  • pull image by running in command line: docker pull selenium/standalone-firefox (or chrome instead of firefox) or in R shell('docker pull selenium/standalone-firefox')
  • start server by running in command line: docker run -d -p 4445:4444 selenium/standalone-firefox or in R shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
  • Then run remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox'"). The doc suggests something different with a virtual machine but i couldn’t get it to work. Replacing "localhost" with the ip the your docker server provides.

I used chrome for all of the above and got this working just fine in no time!

Now that we got that down, I scraped the data with not much hassle.

    1. Load packages and create empty data frame to fill out (I looked at the website to get the columns)
library(RSelenium)
library(xml2)
library(tidyverse)

the_df <-
  as_tibble(set_names(rerun(4, character()),
                      c("school_name", "complexity", "social_fund", "score_6th")))
    1. Open the website with RSelenium
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
                      port = 4445L,
                      browserName = "chrome")

remDr$open()
remDr$navigate("https://view-awesome-table.com/-L4lo3r-JA2iaWk1puUT/view")

At this point you can use remDr$screenshot(display = TRUE) to print a screenshot of the website that you’re at.

    1. Define a function that clicks one time on the swiping key on the right, scrapes the table and turns it into a tibble
navigate_click <- function() {
  webElem <- remDr$findElement(using = "class name",
                               "google-visualization-table-div-page")
  
  Sys.sleep(0.5)
  webElem$clickElement()
  
  remDr$getPageSource()[[1]] %>% 
    read_xml() %>%
    xml_ns_strip() %>%
    xml_find_all(xpath = '//td') %>%
    xml_text() %>%
    set_names(c("school_name", "complexity", "social_fund", "score_6th")) %>%
    as.list() %>% as_tibble()
}
    1. Run that function 160 times (# of schools in that data) and bind all of these datasets together
complete_df <-
  map(1:160, ~ navigate_click()) %>%
  bind_rows()

Aaaaandddd, we got our nicely formatted dataset ready for some analysis.

complete_df
## # A tibble: 160 x 4
##    school_name                   complexity   social_fund score_6th   
##    <chr>                         <chr>        <chr>       <chr>       
##  1 Escuela Collaso i Gil         Muy alta     52%         Bajo        
##  2 Escuela Ruben Darío           Muy alta     66%         Bajo        
##  3 Escuela Castella              Muy alta     25%         Mediano-bajo
##  4 Escuela Drassanes             Muy alta     41%         Bajo        
##  5 Escuela Milà i Fontanals      Muy alta     49%         Bajo        
##  6 Escuela Baixeras              Mediana-alta 24%         Bajo        
##  7 Escuela Cervantes             Mediana-alta 38%         Mediano-alto
##  8 Escuela Parc de la Ciutadella Mediana-baja 15%         Mediano-bajo
##  9 Escuela Pere Vila             Alta         30%         Mediano-alto
## 10 Escuela Alexandre Galí        Alta         27%         Bajo        
## # ... with 150 more rows

PS: If they ever remove that dataset from the website this post might not work in the future, but at least there’s a traceback on how to user docker with RSelenium.

comments powered by Disqus