1 Welcome
Every day millions of gigabytes of data are shared across the internet. The standard way for most internet users to download this data is by manually clicking and downloading files in formats such as Excel files, CSV files, and Word files. However, this process of downloading files by clicking works well for downloading one or two files. What if you needed to download 500 CSV files? Or what happens when you need to download data that is refreshed every 20 seconds? The process of manually clicking each file is simply not a feasible solution when downloading data at scale or at frequent intervals. That’s why with the increasing amount of data on the internet, there’s also been an increase in the tools that allow you to access that data programmatically. This course will focus on exploring these technologies.
Let me give you an example. When the COVID pandemic struck the world in 2020, it was paramount to understand the mortality rates that it was causing and the general decomposition of this mortality (how it was affecting males versus females as well as different age groups). A team of researchers at the Max Planck Institute for Demographic Research set out to collect administrative data on reported deaths by country, regions, gender and age. Now this might sound like fun but it’s actually awfully difficult because some countries uploaded data and refreshed it every day while others did it on a weekly basis. Some regions of a given country might update the data on a different schedule. The website might be different and the process of getting to know the website can take more time. Scale that process to collect data for hundreds of countries and the process of collecting mortality data can become incredibly cumbersome: you might spend hours on the web simply clicking links to download data (see here for a sample of countries and dates that they had to collect data from). Aside from the long and tedious task of boringly clicking links for hours, chances are you’ll also introduce errors along the way. You might confuse one region with another and assign a wrong name to the CSV file, you might skip a particular country by mistake or you might simply misspell a country’s name, messing up the order of files that you were saving.
The example from above is the typical task you want to automate: wouldn’t it be great if we had some type of robot that could automatically collect the data from all the different websites and save those files with correct, explicitly names for me? That’s exactly what the team did. They created the COVerAGE-DB dataset. With the help of dozens of collaborators, they created web scraping scripts that automatically collected the data for them and automated the process to run as frequently as they needed. It’s as if they managed to created hundreds of small robots that did the work and then brought back the data for them to analyze. This is a perfect motivation for why automated data acquisition is an extremely important tool and why every person that works with data would benefit a lot from knowing these tools.
Throughout the course, we’ll describe and create upon the basic formats for data transfer such as JSON, HTML and XML. We will learn how to subset, manipulate and transform these formats for suitable data analysis tasks. Most of these data formats can be accessed by scraping a website: programmatically accessing the data in a website and asking the program to download as much data as we need and as frequently as needed. An integral part of the course will focus on how to perform ethical web scrapping, how to scale your web scrapping programs to download massive amounts of data as well as to program your scraper to access data in frequent intervals.
The course will also focus on an emerging technology for websites to share data: Application Programming Interfaces (API). We will touch upon the basic formats for data sharing through APIs as well as how to make dozens of data requests in just seconds.
A big part of the course will emphasize automation, as a way for the student to create robust and scalable data acquisition pipelines. At each step of the process, we will focus on the practical applications of these techniques and exercise active participation by the students through hands-on challenges on data acquisition. The course will make heavy use of Github where students will need to share their homework as well as explore the immense repository of data acquisition technology already available as open source software.
The goal of the course is to empower students with the right tool set and ideas to be able to create data acquisition programs, automate their data extraction pipeline and quickly transform that data into formats suitable for analysis. The course will place all these contents in light of the legal and ethical consequences that data acquisition can entail, always informing the students of best practices when grabbing data from the internet.
The course assumes students will be familiar with the R programming language, transforming and manipulating datasets as well as saving their work with Git and Github. No prior knowledge on software development nor data acquisition techniques is needed.
1.1 Packages we’ll use
Webscraping and APIs are inherently online resources. You cannot access a website without internet and you cannot access an API without internet. Although many tutorials on data harvesting use websites as examples, this has a major problem. Websites and APIs change. This means that one example that works today for grabbing some data from the web, might not work in six months because the developers changed the website. This book looks to break with that pattern by using local copies of websites that I saved. This means that all of the examples in this book will work just as well one, two or three years from now. Similarly, I’ve created local copies of APIs that will also persist over time, making this book reproducible in the long term.
You might ask: but where are these examples saved? This book was written with a companion package, scrapex
. scrapex
contains complete websites saved as examples for us to use in this package. Moreover, it contains several APIs that can be launched just as you would access an online API. You can install scrapex
with the code below:
install.packages("devtools")
devtools::install_github("cimentadaj/scrapex")
To actually scrape data and make requests, we’ll use these packages:
install.packages(c("httr2", "xml2", "rvest"))
httr2
will allow us to make requests to websites and APIs, while xml2
and rvest
allow us to extract stuff from a website. rvest
is a more user-friendly package that we’ll use throughout the book but once we start getting more advanced we’ll switch to xml2
. Just note that we’ll use these two packages interchangeably throughout the book. We’ll make heavy use of the tidyverse
ecosystem of packages so be sure to install all of these packages with:
install.packages("tidyverse")
Apart from these packages, we’ll use other packages, mainly for manipulating data as well as for reading other data sources such as JSON. At the beginning of each chapter we’ll load the packages so be sure to install any other packages we haven’t mentioned here.
1.2 What to expect
In general, webscraping and APIs are an inexact art. The data you’ll gather from a website will be dirty, messy and unstructured. There’s no clear way to get what you want and you’ll need to come up with creative solutions to get data as well as to clean it. In particular, we’ll develop basic and intermediate skills for cleaning strings. Most of the data you’ll grab will be any formats as string (numbers, sentences, paragraphs, chunks of incomplete strings) and you’ll need to learn how to clean these strings. Don’t be intimidated by this, we’ll work hard at understating the logic behind each string manipulation exercises and we’ll establish clear feedback loops to quickly learn where we went wrong.
Without further a due, let’s begin.