RSelenium and scraping Catalan educational data


Yesterday I found this public dataset on schools from Barcelona and their performance on tests on 6th grade. I wanted to scrape them to investigate the relationship between performance and schools that receive special government funds for social integration. I found this dataset here but it was different from the types of websites I usually scape (html or xml). Although the website has some html the engine swiping the schools is actually based on Javascript. Well, that’s a job for RSelenium, an R package that allows you to browse a website with R.

The process was actually much easier than I thought using Docker. I follow the answer of setting docker from this post. Note that this is for Windows 10.

  • install docker
  • run it, restart computer as requested
  • pull image by running in command line: docker pull selenium/standalone-firefox (or chrome instead of firefox) or in R shell('docker pull selenium/standalone-firefox')
  • start server by running in command line: docker run -d -p 4445:4444 selenium/standalone-firefox or in R shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
  • Then run remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox'"). The doc suggests something different with a virtual machine but i couldn’t get it to work. Replacing "localhost" with the ip the your docker server provides.

I used chrome for all of the above and got this working just fine in no time!

Now that we got that down, I scraped the data with not much hassle.

    1. Load packages and create empty data frame to fill out (I looked at the website to get the columns)

the_df <-
  as_tibble(set_names(rerun(4, character()),
                      c("school_name", "complexity", "social_fund", "score_6th")))
    1. Open the website with RSelenium
remDr <- remoteDriver(remoteServerAddr = "",
                      port = 4445L,
                      browserName = "chrome")


At this point you can use remDr$screenshot(display = TRUE) to print a screenshot of the website that you’re at.

    1. Define a function that clicks one time on the swiping key on the right, scrapes the table and turns it into a tibble
navigate_click <- function() {
  webElem <- remDr$findElement(using = "class name",
  remDr$getPageSource()[[1]] %>% 
    read_xml() %>%
    xml_ns_strip() %>%
    xml_find_all(xpath = '//td') %>%
    xml_text() %>%
    set_names(c("school_name", "complexity", "social_fund", "score_6th")) %>%
    as.list() %>% as_tibble()
    1. Run that function 160 times (# of schools in that data) and bind all of these datasets together
complete_df <-
  map(1:160, ~ navigate_click()) %>%

Aaaaandddd, we got our nicely formatted dataset ready for some analysis.

## # A tibble: 160 x 4
##    school_name                   complexity   social_fund score_6th   
##    <chr>                         <chr>        <chr>       <chr>       
##  1 Escuela Collaso i Gil         Muy alta     52%         Bajo        
##  2 Escuela Ruben Darío           Muy alta     66%         Bajo        
##  3 Escuela Castella              Muy alta     25%         Mediano-bajo
##  4 Escuela Drassanes             Muy alta     41%         Bajo        
##  5 Escuela Milà i Fontanals      Muy alta     49%         Bajo        
##  6 Escuela Baixeras              Mediana-alta 24%         Bajo        
##  7 Escuela Cervantes             Mediana-alta 38%         Mediano-alto
##  8 Escuela Parc de la Ciutadella Mediana-baja 15%         Mediano-bajo
##  9 Escuela Pere Vila             Alta         30%         Mediano-alto
## 10 Escuela Alexandre Galí        Alta         27%         Bajo        
## # ... with 150 more rows

PS: If they ever remove that dataset from the website this post might not work in the future, but at least there’s a traceback on how to user docker with RSelenium.

comments powered by Disqus