I’ve just started a project which I’m very excited about. Everyday I take my bike to work and most days I have one of two problems. First, whenever I get to my station there are no bikes available; no problem, there’s an app that shows the closest stations with bikes available. The problem is that these stations might be far and sometimes I’m relucant to walk that much. I’d love for bicing to give me some time estimation until a new bike arrives.
Second, whenever you’re trying to return a bike the station might not have any parking spaces available. Similarly, it would be very cool if bicing (the public bicycle company) gave me an estimate of how much time I should wait until a new bike will be taken. I started thinking on how I could implement this and started looking for bicing data online. To my surprise, bicing actually releases their live data as a json! But for this type of estimation I need historical data. I want to know the pattern usage of the station and use that information for the prediction.
With that idea in mind, I got to work. I needed to set up my Virtual Private Server (VPS) to pull the data from the bicing API everyday. Because this is still a work in progress, I will only describe here how I set my VPS to scrape the bicing API everyday and how I set cron
to send me an email after every scrape.
I have a VPS from Digital Ocean with an Ubuntu OS and 512 mb of RAM and 2 GB of hard disk. That’s enough for this task because the data should not be very big, even in the long run. In any case you can adjust for your VPS to have more memory/ram without losing information. Assuming you have R installed in your Ubuntu VPS with your favorite packages, then make sure your script works by running Rscript path/to/your/script.R
. It might be better to type which Rscript
in the terminal and paste the path to the executable, similar to /usr/bin/Rscript path/to/your/script.R
My workflow is as follows: I first create an empty dataset saved as .rds
and my script reads the data, scrapes the bicing data and then saves the data by appending both the empty and the scraped data. It finishes by saving the same .rds
for a later scrape. I tested this very thoroughly to make sure the script wouldn’t fail and I always get the expected data.
All good so far, right? This took me no time. The hard problem came when setting the cron
job, which is a way of scheduling tasks in OSx and Ubuntu. For an explanation of how cron
works, check out how I set my PISA twitter bot.
First, make sure you have cron
installed. I followed a lot of tutorials and dispered information. What worked for me perhaps does not work for you, but here it is.
Type crontab -e
and the cron interface should appear. The lines starting with #
are coments, so scroll down until the end of the comments. First we have to set a few environmental variables that cron
uses to execute your script. I followed these tips.
When I finished my crontab looked like this:
SHELL=/bin/bash
PATH=/home/cimentadaj/bin:/home/cimentadaj/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
HOME=/home/cimentadaj/bicycle
MAILTO=my_email # set email here!
15,50 16 * * * /usr/bin/Rscript scrape_bicing.R
SHELL is the path to the pre-determined program to run on the cron job. Be default I set it to bash (but it could be anything else you want).
PATH I’m not sure what’s for but I pasted the output of echo $PATH
, as the tips suggested.
HOME is the root directory where the script will be executed, I set it to where the script is (or where your project is at).
MAILTO is the email where I will get the cron job alert when it finishes.
15,50 16 * * * /usr/bin/Rscript scrape_bicing.R
is the schedule, program and script to run. Here I set arbitrary times, so the the script is scheduled to run at 16:15
and 16:50
every day, every month and every year. I will run using Rscript
and the name of the script to run.
WARNING: remember that the cron
is set relative to the time of where your server is. Mine did not have the same timezone of where I lived, so I had to set the cron
one hour before of my actual time. Use date
to print the time of your VPS.
Even after this, the cron
job was still not running. Nothing, no email, no log, no change in the data. I then figured out that Ubuntu systems have some pecularities when it comes to cron
. So I went to ./etc/
and renamed every cron.
file for cron-
with rename 's/cron./cron-/g' *
, thanks to this answer.
Run again and it worked! Great. However, I didn’t receive an email stating that the cron
job finished. I looked up many solutions and ended up installing ssmtp
which is a library for sending emails from terminal. I won’t bore you with the details. Here are the steps I took:
ssmtp
with sudo apt-get update
and sudo apt-get install ssmtp
.ssmtp.conf
with sudo nano /etc/ssmtp/ssmtp.conf
Here’s the config that worked for me using gmail
:
# Config file for sSMTP sendmail
#
# The person who gets all mail for userids < 1000
# Make this empty to disable rewriting.
root=your_email@gmail.com
# The place where the mail goes. The actual machine name is required no
# MX records are consulted. Commonly mailhosts are named mail.domain.com
mailhub=smtp.gmail.com:587
AuthUser=your_email@gmail.com
AuthPass=your_password
UseTLS=YES
UseSTARTTLS=yes
TLS_CA_FILE=/etc/ssl/certs/ca-certificates.crt
# Where will the mail seem to come from?
#rewriteDomain=gmail.com
# The full hostname
hostname=your_host_name
# Are users allowed to set their own From: address?
# YES - Allow the user to specify their own From: address
# NO - Use the system generated From: address
#FromLineOverride=YES
Three caveats that took me a lot of time to figure out.
First, some docs say you should use another port in mailhub
, but 587
worked for me.
TLS_CA_FILE
: make sure that this file exists! For Ubuntu/Debian the file is at /etc/ssl/certs/ca-certificates.crt
while on other platforms it might be in /etc/pki/tls/certs/ca-bundle.crt
. Note the different file names!
hostname
should be the result of typing hostname
in your server.
Lastly, I also added the line root:your_EMAIL_@gmail.com:smtp.gmail.com:587
with sudo nano /etc/ssmtp/revaliases
.
After an entire day figuring out all this information, the cron
job worked! I now set my cron
job and whenever it finished I receive an email directly showing the log of the script.
I wrote this primarily for me not to forget any of this, but it might be useful for other people.