Welcome

Jorge Cimentada

Welcome to the world of Data Harvesting

  • Jorge Cimentada

  • Chief Data Scientist at Cognitiva Labs

  • PhD in Sociology

How about you?

What will you expect from the course?

We’ll learn to scrape data from the internet.

What will you expect from the course?

We’ll talk to APIs

All content of this course

How will this course work?


  • Tutoring on Tuesdays between 18h-20h. Requests should be per email and should receive a confirmation per email. Tutoring will be online over video call.
  • My email: cimentadaj@gmail.com
  • Classes on Wednesdays
  • Class between 18:00 - 19:30
  • 15 minute break
  • Class between 19:45 to 20:45

How will this course work?

  • Readings and exercises will be assigned for next class.

  • This is the material from next class – read and execute the code.


Scoring for the class:

  • Class participation 20%

  • Final project 80%

Course outline


Webscraping – First three classes (31st Jan / 7th Feb / 14th Feb)


APIs - Next three classes (21st Feb / 28th Feb / 6th March)


Automating Data Harvesting (Final class) - 13nd March


22nd February

Presentation of projects - 22nd March / 16:30-19:15

Final project

Final project spreadsheet is here.

  • 30 students / 15 groups

  • Try to find your partner as soon as possible – deadline 14 February (third class)

  • Final project ideas submission – deadline 28th Feb (fifth class)

  • 2 weeks of work on final project

  • Final project submission – deadline 13nd March (seventh class)

  • Final project presentation – 22nd March / 18:00-20:45

  • Every team will have 10 minutes to present.

Project expectations

  • Handout: Github repository private or public.

  • A clear README on how to reproduce the scraper/API program.

  • Key is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.

  • Document what the output is, where it is saved and what each script in the program does.

Project expectations

  • The idea is for some medium-hard scraping/API projects.

    • Scrape several sources of information

    • Same website or combining several websites

    • Meaningful dataset / Something that might help you on another class

    • Remember most of the mark is for this project. 

  • API Projects: tokens should not be hosted on your repository. Provide clear instructions where to place tokens for reproducibility.

Project expectations


Project ideas should be consulted and approved by me before the deadline


Emails can be directed at cimentadaj@gmail.com

Examples from previous classes:

  • https://github.com/myanesp/media-streaming-tmdb
  • https://github.com/myanesp/tmdbR

Contribute

Contribute to the book!

Motivation

Motivation: https://www.coverage-db.org/


Demotivation: https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release