Jorge Cimentada

I’ve just started a project which I’m very excited about. Everyday I take my bike to work and most days I have one of two problems. First, whenever I get to my station there are no bikes available; no problem, there’s an app that shows the closest stations with bikes available. The problem is that these stations might be far and sometimes I’m relucant to walk that much. I’d love for bicing to give me some time estimation until a new bike arrives.

Second, whenever you’re trying to return a bike the station might not have any parking spaces available. Similarly, it would be very cool if bicing (the public bicycle company) gave me an estimate of how much time I should wait until a new bike will be taken. I started thinking on how I could implement this and started looking for bicing data online. To my surprise, bicing actually releases their live data as a json! But for this type of estimation I need historical data. I want to know the pattern usage of the station and use that information for the prediction.

With that idea in mind, I got to work. I needed to set up my Virtual Private Server (VPS) to pull the data from the bicing API everyday. Because this is still a work in progress, I will only describe here how I set my VPS to scrape the bicing API everyday and how I set cron to send me an email after every scrape.

I have a VPS from Digital Ocean with an Ubuntu OS and 512 mb of RAM and 2 GB of hard disk. That’s enough for this task because the data should not be very big, even in the long run. In any case you can adjust for your VPS to have more memory/ram without losing information. Assuming you have R installed in your Ubuntu VPS with your favorite packages, then make sure your script works by running Rscript path/to/your/script.R. It might be better to type which Rscript in the terminal and paste the path to the executable, similar to /usr/bin/Rscript path/to/your/script.R

My workflow is as follows: I first create an empty dataset saved as .rds and my script reads the data, scrapes the bicing data and then saves the data by appending both the empty and the scraped data. It finishes by saving the same .rds for a later scrape. I tested this very thoroughly to make sure the script wouldn’t fail and I always get the expected data.

All good so far, right? This took me no time. The hard problem came when setting the cron job, which is a way of scheduling tasks in OSx and Ubuntu. For an explanation of how cron works, check out how I set my PISA twitter bot.

First, make sure you have cron installed. I followed a lot of tutorials and dispered information. What worked for me perhaps does not work for you, but here it is.

Type crontab -e and the cron interface should appear. The lines starting with # are coments, so scroll down until the end of the comments. First we have to set a few environmental variables that cron uses to execute your script. I followed these tips.

When I finished my crontab looked like this:

SHELL=/bin/bash
PATH=/home/cimentadaj/bin:/home/cimentadaj/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
HOME=/home/cimentadaj/bicycle
MAILTO=my_email # set email here!
15,50 16  * * * /usr/bin/Rscript scrape_bicing.R

SHELL is the path to the pre-determined program to run on the cron job. Be default I set it to bash (but it could be anything else you want).
PATH I’m not sure what’s for but I pasted the output of echo $PATH, as the tips suggested.
HOME is the root directory where the script will be executed, I set it to where the script is (or where your project is at).
MAILTO is the email where I will get the cron job alert when it finishes.
15,50 16 * * * /usr/bin/Rscript scrape_bicing.R is the schedule, program and script to run. Here I set arbitrary times, so the the script is scheduled to run at 16:15 and 16:50 every day, every month and every year. I will run using Rscript and the name of the script to run.

WARNING: remember that the cron is set relative to the time of where your server is. Mine did not have the same timezone of where I lived, so I had to set the cron one hour before of my actual time. Use date to print the time of your VPS.

Even after this, the cron job was still not running. Nothing, no email, no log, no change in the data. I then figured out that Ubuntu systems have some pecularities when it comes to cron. So I went to ./etc/ and renamed every cron. file for cron- with rename 's/cron./cron-/g' *, thanks to this answer.

Run again and it worked! Great. However, I didn’t receive an email stating that the cron job finished. I looked up many solutions and ended up installing ssmtp which is a library for sending emails from terminal. I won’t bore you with the details. Here are the steps I took:

Install ssmtp with sudo apt-get update and sudo apt-get install ssmtp.
Edit ssmtp.conf with sudo nano /etc/ssmtp/ssmtp.conf

Here’s the config that worked for me using gmail:

# Config file for sSMTP sendmail
#
# The person who gets all mail for userids < 1000
# Make this empty to disable rewriting.
root=your_email@gmail.com

# The place where the mail goes. The actual machine name is required no 
# MX records are consulted. Commonly mailhosts are named mail.domain.com
mailhub=smtp.gmail.com:587

AuthUser=your_email@gmail.com
AuthPass=your_password
UseTLS=YES
UseSTARTTLS=yes
TLS_CA_FILE=/etc/ssl/certs/ca-certificates.crt

# Where will the mail seem to come from?
#rewriteDomain=gmail.com

# The full hostname
hostname=your_host_name

# Are users allowed to set their own From: address?
# YES - Allow the user to specify their own From: address
# NO - Use the system generated From: address
#FromLineOverride=YES

Three caveats that took me a lot of time to figure out.

First, some docs say you should use another port in mailhub, but 587 worked for me.
TLS_CA_FILE: make sure that this file exists! For Ubuntu/Debian the file is at /etc/ssl/certs/ca-certificates.crt while on other platforms it might be in /etc/pki/tls/certs/ca-bundle.crt. Note the different file names!
hostname should be the result of typing hostname in your server.

Lastly, I also added the line root:your_EMAIL_@gmail.com:smtp.gmail.com:587 with sudo nano /etc/ssmtp/revaliases.

After an entire day figuring out all this information, the cron job worked! I now set my cron job and whenever it finished I receive an email directly showing the log of the script.

I wrote this primarily for me not to forget any of this, but it might be useful for other people.

How long should I wait for my bike?

PUBLISHED ON DEC 1, 2017

TAGS: PROJECTS, SCRAPING

home