Yesterday I found this public dataset on schools from Barcelona and their performance on tests on 6th grade. I wanted to scrape them to investigate the relationship between performance and schools that receive special government funds for social integration. I found this dataset here but it was different from the types of websites I usually scape (html
or xml
). Although the website has some html
the engine swiping the schools is actually based on Javascript
. Well, that’s a job for RSelenium
, an R package that allows you to browse a website with R.
The process was actually much easier than I thought using Docker. I follow the answer of setting docker from this post. Note that this is for Windows 10.
docker pull selenium/standalone-firefox
(or chrome instead of firefox) or in R shell('docker pull selenium/standalone-firefox')
docker run -d -p 4445:4444 selenium/standalone-firefox
or in R shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox'")
. The doc suggests something different with a virtual machine but i couldn’t get it to work. Replacing "localhost"
with the ip
the your docker server provides.I used chrome
for all of the above and got this working just fine in no time!
Now that we got that down, I scraped the data with not much hassle.
library(RSelenium)
library(xml2)
library(tidyverse)
the_df <-
as_tibble(set_names(rerun(4, character()),
c("school_name", "complexity", "social_fund", "score_6th")))
RSelenium
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://view-awesome-table.com/-L4lo3r-JA2iaWk1puUT/view")
At this point you can use remDr$screenshot(display = TRUE)
to print a screenshot of the website that you’re at.
tibble
navigate_click <- function() {
webElem <- remDr$findElement(using = "class name",
"google-visualization-table-div-page")
Sys.sleep(0.5)
webElem$clickElement()
remDr$getPageSource()[[1]] %>%
read_xml() %>%
xml_ns_strip() %>%
xml_find_all(xpath = '//td') %>%
xml_text() %>%
set_names(c("school_name", "complexity", "social_fund", "score_6th")) %>%
as.list() %>% as_tibble()
}
160
times (# of schools in that data) and bind all of these datasets togethercomplete_df <-
map(1:160, ~ navigate_click()) %>%
bind_rows()
Aaaaandddd, we got our nicely formatted dataset ready for some analysis.
complete_df
## # A tibble: 160 x 4
## school_name complexity social_fund score_6th
## <chr> <chr> <chr> <chr>
## 1 Escuela Collaso i Gil Muy alta 52% Bajo
## 2 Escuela Ruben Darío Muy alta 66% Bajo
## 3 Escuela Castella Muy alta 25% Mediano-bajo
## 4 Escuela Drassanes Muy alta 41% Bajo
## 5 Escuela Milà i Fontanals Muy alta 49% Bajo
## 6 Escuela Baixeras Mediana-alta 24% Bajo
## 7 Escuela Cervantes Mediana-alta 38% Mediano-alto
## 8 Escuela Parc de la Ciutadella Mediana-baja 15% Mediano-bajo
## 9 Escuela Pere Vila Alta 30% Mediano-alto
## 10 Escuela Alexandre Galí Alta 27% Bajo
## # ... with 150 more rows
PS: If they ever remove that dataset from the website this post might not work in the future, but at least there’s a traceback on how to user docker with RSelenium
.