Jorge Cimentada
Jorge Cimentada
Chief Data Scientist at Cognitiva Labs
PhD in Sociology
How about you?
We’ll learn to scrape data from the internet.
We’ll talk to APIs
Readings and exercises will be assigned for next class.
This is the material from next class – read and execute the code.
Scoring for the class:
Class participation 20%
Final project 80%
Webscraping – First three classes (Jan 30th / Feb 6th / Feb 13th)
APIs - Next three classes (Feb 20th / Feb 27th / March 6th)
Automating Data Harvesting (Final class) - March 13th (ONLINE CLASS)
Presentation of projects - March 27nd / 16:30-19:15
Final project spreadsheet is here.
30 students / 15 groups
Try to find your partner as soon as possible – deadline 13 February (third class)
Final project ideas submission – deadline 27th Feb (fifth class)
2 weeks of work on final project
Final project submission – deadline March 13th (seventh class)
Final project presentation – March 27th / 16:30-19:15
Every team will have 10 minutes to present.
Handout: Github repository private or public.
A clear README on how to reproduce the scraper/API program.
Key is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.
Document what the output is, where it is saved and what each script in the program does.
The idea is for some medium-hard scraping/API projects.
Scrape several sources of information
Same website or combining several websites
Meaningful dataset / Something that might help you on another class
Remember most of the mark is for this project.
API Projects: tokens should not be hosted on your repository. Provide clear instructions where to place tokens for reproducibility.
Project ideas should be consulted and approved by me before the deadline
Emails can be directed at cimentadaj@gmail.com
Examples from previous classes:
Contribute to the book!
Motivation: https://www.coverage-db.org/
Demotivation: https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release