Jorge Cimentada
Jorge Cimentada
Chief Data Scientist at Cognitiva Labs
PhD in Sociology
How about you?
We’ll learn to scrape data from the internet.
We’ll talk to APIs
Readings and exercises will be assigned for next class.
This is the material from next class – read and execute the code.
Scoring for the class:
Class participation 20%
Final project 80%
Webscraping – First three classes (31st Jan / 7th Feb / 14th Feb)
APIs - Next three classes (21st Feb / 28th Feb / 6th March)
Automating Data Harvesting (Final class) - 13nd March
22nd February
Presentation of projects - 22nd March / 16:30-19:15
Final project spreadsheet is here.
30 students / 15 groups
Try to find your partner as soon as possible – deadline 14 February (third class)
Final project ideas submission – deadline 28th Feb (fifth class)
2 weeks of work on final project
Final project submission – deadline 13nd March (seventh class)
Final project presentation – 22nd March / 18:00-20:45
Every team will have 10 minutes to present.
Handout: Github repository private or public.
A clear README on how to reproduce the scraper/API program.
Key is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.
Document what the output is, where it is saved and what each script in the program does.
The idea is for some medium-hard scraping/API projects.
Scrape several sources of information
Same website or combining several websites
Meaningful dataset / Something that might help you on another class
Remember most of the mark is for this project.
API Projects: tokens should not be hosted on your repository. Provide clear instructions where to place tokens for reproducibility.
Project ideas should be consulted and approved by me before the deadline
Emails can be directed at cimentadaj@gmail.com
Examples from previous classes:
Contribute to the book!
Motivation: https://www.coverage-db.org/
Demotivation: https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release