Jorge Cimentada
Ever wondered how you can grab real time data, on demand, without moving a finger?
Welcome to the world of automation!
One off scraping is just as good
One off scraping will solve many problems as is usually the first starting point
However, you might need this to collect data that is changing constantly
Sometimes you want to gather info the dissappears
Ex: weather data, financial data, sports data, political data
Since we want to automate a program, we first need one. Let’s recycle one from our example on El País:
# Load all our libraries
library(scrapex)
library(xml2)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(readr)
# If this were being done on the real website of the newspaper, you'd want to
# replace the line below with the real link of the website.
newspaper_link <- elpais_newspaper_ex()
newspaper <- read_html(newspaper_link)
all_sections <-
newspaper %>%
# Find all <section> tags which have an <article> tag
# below each <section> tag. Keep only the <article>
# tags which an attribute @data-dtm-region.
xml_find_all("//section[.//article][@data-dtm-region]")
final_df <-
all_sections %>%
# Count the number of articles for each section
map(~ length(xml_find_all(.x, ".//article"))) %>%
# Name all sections
set_names(all_sections %>% xml_attr("data-dtm-region")) %>%
# Convert to data frame
enframe(name = "sections", value = "num_articles") %>%
unnest(num_articles)
Our goal with this scraper is to monitor El País
How it distributes news across different categories
Studying whether there are important patterns
# A tibble: 11 × 2
sections num_articles
<chr> <int>
1 portada_apertura 5
2 portada_arrevistada 1
3 portada_tematicos_science,-tech-&-health 5
4 portada_tematicos_business-&-economy 2
5 portada_tematicos_undefined 1
6 portada_branded_ 2
7 portada_arrevistada_culture 5
8 portada_tematicos_work-&-lifestyle 3
9 portada_arrevistada 1
10 portada_tematicos_celebrities,-movies-&-tv 4
11 portada_tematicos_our-selection 4
The scraper is missing one step: saving the data to CSV.
Logic:
If this is the first time the scraper is run, save a csv file with the count of sections
If the CSV with the count of section exists, open the CSV file and append the newest data with the current time stamp
This approach will add rows with new counts every time the scraper is run.
library(scrapex)
library(xml2)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(readr)
newspaper_link <- elpais_newspaper_ex()
all_sections <-
newspaper_link %>%
read_html() %>%
xml_find_all("//section[.//article][@data-dtm-region]")
final_df <-
all_sections %>%
map(~ length(xml_find_all(.x, ".//article"))) %>%
set_names(all_sections %>% xml_attr("data-dtm-region")) %>%
enframe(name = "sections", value = "num_articles") %>%
unnest(num_articles)
# Save the current date time as a column
final_df$date_saved <- format(Sys.time(), "%Y-%m-%d %H:%M")
# Where the CSV will be saved. Note that this directory
# doesn't exist yet.
file_path <- "~/newspaper/newspaper_section_counter.csv"
# *Try* reading the file. If the file doesn't exist, this will silently save an error
res <- try(read_csv(file_path, show_col_types = FALSE), silent = TRUE)
# If the file doesn't exist
if (inherits(res, "try-error")) {
# Save the data frame we scraped above
print("File doesn't exist; Creating it")
write_csv(final_df, file_path)
} else {
# If the file was read successfully, append the
# new rows and save the file again
rbind(res, final_df) %>% write_csv(file_path)
}
Summary:
This script will read the website of “El País”
Count the number of sections
Save the results as a CSV file at ~/newspaper/newspaper_section_counter.csv
.
That directory still doesn’t exist, so we’ll create it first.
New tool: The Terminal
Open with CTRL + ALT + t
.
Programatically create directories, files, search for files, execute scripts 🦾
With the directory created, we copy the R script and check that is there ls
change directories with the cd
command, which stands for c
hanged
irectory,
followed by the path where you want to switch to.
For our case, this would be cd ~/newspaper/
To execute an R script from the terminal you can do it with the Rscript
command followed by the file name.
For our case it should be Rscript newspaper_scraper.R
The first few lines show the printing of package loading
File doesn't exist; Creating it
shows how it’s creating the first file
Our scraper works
All infrastructure is ready (directories, excel file)
How do we automate it?
Here’s we cron
comes in
Confirm it works:
The output means you have no scheduled scripts in your computer.
Rscript ~/newspaper/newspaper_scraper.R
* * * * *
every minute, hour, of every day of the month, every month, every day of the week.
30 * * * *
run at minute 30 of each hour, each day, each month, each day of the week
30 * * * 3
run every 30 minutes on Wednesdays
30 5 * * 6,7
run on the 30th minute of the 5th hour every month on Saturday and Sunday
if day of week, the last slot, clashes in a schedule with the third slot which is day of month then any day matching either the day of month, or the day of week, shall be matched.
Let’s say we wanted to run our newspaper scraper every 4 hours, every day, how would it look like?
We have no way of saying, regardless of the day / hour / minute, run the scraper every X hours.
1 */4 * * *
run at minute 1 every 4 hours, every day, of the year
1 */4 * * */2
run at minute 1 every 4 hours, every two days
These simple rules will allow you to go very far in scheduling scripts for your scrapers or APIs.
Schedule our newspaper scraper to run every minute, just to make sure it works.
Will get messy because it’ll append the same resultsin the CSV file continuously.
However, it will give proof that the script is running on a schedule.
If we want this to run every minute, our cron expression should be this * * * * *
First steps with cron
is to pick an editor:
Pick nano
, the easiest one.
Here is where we write * * * * * Rscript ~/newspaper/newspaper_scraper.R
Depending, you might need to add: PATH=/usr/local/bin:/usr/bin:/bin
To exit the cron
interface, follow these steps:
Hit CTRL
and X
(this is for exiting the cron
interface)
It will prompt you to save the file. PressY
to save it.
Press enter
to save the cron
schedule file with the same name it has.
Nothing special should be happening at the moment. Wait two or three minutes
Remove entire line to remove schedule and save again
Computer needs to be on all the time; this is why servers are used
cron
can also become complex if your schedule patterns are difficult.
Backtracking a failed cron
job is tricky because it’s not interactive
Setting up production ready scrapers are difficult: databases, interactivity, persistence, avoid bans, saving data real time, etc..