Welcome

Jorge Cimentada

Welcome to the world of Data Harvesting

  • Jorge Cimentada

  • Chief Data Scientist at Cognitiva Labs

  • PhD in Sociology

How about you?

What will you expect from the course?

We’ll learn to scrape data from the internet.

What will you expect from the course?

We’ll talk to APIs

All content of this course

How will this course work?


  • Tutoring on Tuesdays between 18h-20h. Requests should be per email and should receive a confirmation per email. Tutoring will be online over video call.
  • My email: cimentadaj@gmail.com
  • Classes on Thursdays
  • Class between 18:00 - 19:30
  • 15 minute break
  • Class between 19:45 to 20:45

How will this course work?

  • Readings and exercises will be assigned for next class.

  • This is the material from next class – read and execute the code.


Scoring for the class:

  • Class participation 20%

  • Final project 80%

Course outline


Webscraping – First three classes (Jan 30th / Feb 6th / Feb 13th)


APIs - Next three classes (Feb 20th / Feb 27th / March 6th)


Automating Data Harvesting (Final class) - March 13th (ONLINE CLASS)


Presentation of projects - March 27nd / 16:30-19:15

Final project

Final project spreadsheet is here.

  • 30 students / 15 groups

  • Try to find your partner as soon as possible – deadline 13 February (third class)

  • Final project ideas submission – deadline 27th Feb (fifth class)

  • 2 weeks of work on final project

  • Final project submission – deadline March 13th (seventh class)

  • Final project presentation – March 27th / 16:30-19:15

  • Every team will have 10 minutes to present.

Project expectations

  • Handout: Github repository private or public.

  • A clear README on how to reproduce the scraper/API program.

  • Key is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.

  • Document what the output is, where it is saved and what each script in the program does.

Project expectations

  • The idea is for some medium-hard scraping/API projects.

    • Scrape several sources of information

    • Same website or combining several websites

    • Meaningful dataset / Something that might help you on another class

    • Remember most of the mark is for this project. 

  • API Projects: tokens should not be hosted on your repository. Provide clear instructions where to place tokens for reproducibility.

Project expectations


Project ideas should be consulted and approved by me before the deadline


Emails can be directed at cimentadaj@gmail.com

Examples from previous classes:

  • https://github.com/myanesp/media-streaming-tmdb
  • https://github.com/myanesp/tmdbR

Contribute

Contribute to the book!

Motivation

Motivation: https://www.coverage-db.org/


Demotivation: https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release