Lightning post. Earlier today I was trying to scrape the emails from all the PhD candidates in my program and I had to log in from our ‘Aula Global’. I did so using httr
but something was off: I introduced both my username and password but the website did not log in. Apparently, when loging in through POST
, sometimes there’s a thing call hidden fields that you need to fill out! I would’ve never though about this. Below is a case study, that excludes my credentials.
The first thing we have to do is identify the POST
method and the inputs to the request. Using Google Chrome, go to the website https://sso.upf.edu/CAS/index.php/login?service=https%3A%2F%2Faulaglobal.upf.edu%2Flogin%2Findex.php and then on the Google Chrome menu go to -> Settings -> More tools -> Developer tools. Here we have the complete html of the website.
It’s the branch with form
that has method='post'
.
POST
branch and find all fields. We can see the two ‘hidden’ fields.Below the form
tag, we see two input
tags set to hidden, there they are! Even though we want to login, we also have to provide the two hidden fields. Take note of both their name
and value
tags.
For username:
For password:
all_fields <-
list(
adAS_username = "private",
adAS_password = "private",
adAS_i18n_theme = 'en',
adAS_mode = 'authn'
)
library(tidyverse)
library(httr)
library(xml2)
login <- "https://sso.upf.edu/CAS/index.php/login?service=https%3A%2F%2Faulaglobal.upf.edu%2Flogin%2Findex.php"
website <- "https://aulaglobal.upf.edu/user/index.php?page=0&perpage=5000&mode=1&accesssince=0&search&roleid=5&contextid=185837&id=9829"
upf <- handle("https://aulaglobal.upf.edu")
access <- POST(login,
body = all_fields,
handle = upf)
Note how I set the handle
. If the website you want to visit and the website that hosts the login information have the same root of the URL (aulaglobal.upf.edu
for example), then you can avoid using handle
(it’s done behind the scenes). In my case, I set the handle
to the same root URL of the website I WANT to visit after I log in (because they have different root URL’s). This way the cookies and login information from the login are preserved through out the session.
emails <- GET(website, handle = upf)
all_emails <-
read_html(emails) %>%
xml_ns_strip() %>%
xml_find_all("//table//a") %>%
as_list() %>%
unlist() %>%
str_subset(".+@upf.edu$")
Unfortunately you won’t be able to reproduce this script because you don’t have a log information unless you belong to the same PhD program as I do. However, I hope you find the hidden fields explanation useful, I’m sure I will come back to this in the near future for reference!