Chapter 7 Syllabus

7.1 Course description

With the increasing amounts of data being collected on a daily basis, the field of machine learning has gained mainstream attention. By shifting away from focusing on inference, machine learning is a field at the intersection of statistics and computer science that is focused on maximizing predictive performance by learning patterns from data. That is, the goal of machine learning is to predict something – and predict it very well, regardless of whether you understand it. These techniques are common in business settings where, for example, stakeholders are interested in knowing the probability of a client leaving a company or the propensity of a client for buying a particular product. The field can be intimidating as it is vast and growing every year.

However, scholars in the social sciences are beginning to understand the importance of the machine learning framework and how it can unlock new knowledge in fields such as sociology, political science, economics and psychology. On this course we will introduce students to the basic ideas of the machine learning framework and touch upon the basic algorithms used for prediction and discussing the potential it can have in the social sciences.

In particular, we will introduce predictive algorithms such as regularized regressions, classification trees and clustering techniques through basic examples. We will discuss their advantages and disadvantages while paying great attention to how it’s been used in research. Although many social scientists do not see how predictive models can help explain social phenomena, we will also focus on how machine learning can play a role as a tool for discovery, improving causal inference and generalizing our classical models through cross validation.

We will end the course with a prediction challenge that will put to test all of your acquired knowledge. Starting with a discussion on the role of predictive challenges such as the Fragile Families Challenge in the social sciences, our predictive challenge will require the student to run machine learning algorithms, test their out-of-sample error rate and discuss strategies on how the results are useful. This will give the class a real hands-on example of how to incorporate machine learning into their research right away. Below is a detailed description of the syllabus.

7.2 Schedule

Session 1
July 6th 09:00h-10:45h

Introduction to the Machine Learning Framework
- Inference vs Prediction
- Can inference and prediction complement each other?
- Bias-variance / Interpretability-prediction tradeoffs
- Resampling for understanding your model
- “The Fragile Families Challenge”

Readings:

Sections 2.1 and 2.2 from James, Gareth, et al. An Introduction To Statistical Learning. Vol. 112. New York: springer, 2013
Molina, M., & Garip, F. (2019). Machine Learning for Sociology. Annual Review of Sociology, 45.
Hofman, J. M., Sharma, A., & Watts, D. J. (2017). Prediction and explanation in social systems. Science, 355(6324), 486-488.
Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2), 87-106.
Watts, D. J. (2014). Common sense and sociological explanations. American Journal of Sociology, 120(2), 313-351.
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199-231.

Break 10:45h-11:15h

Session 2
July 6th 11:15h-13:00h

Linear regression and regularization
- Continuous predictions and loss functions
- Lasso
  - Advantages/Disadvantages
  - R example
- Ridge regression
  - Advantages/Disadvantages
  - R example
- Elastic Net
  - Advantages/Disadvantages
  - R example
Exercises

Readings:

For a theoretical introduction to Lasso/Ridge, sections 6.1, 6.2 and 6.6 from James, Gareth, et al. (2013) An Introduction To Statistical Learning. Vol. 112. New York: springer
For hands-on examples, Chapter 6 of Boehmke & Greenwell (2019) Hands-On Machine Learning with R, 1st Edition, Chapman & Hall/CRC The R Series. Accessible at: https://bradleyboehmke.github.io/HOML/

Session 3
July 7th 09:00h-10:45h

Supervised Regression
- Introduction to supervised regression
- Classification
  - Confusion matrices
  - ROC Curves
- Classification Trees
  - Advantages/Disadvantages
  - R example
Exercises

Readings:

For an introduction to classification trees, Section 8.1, 8.3.1 and 8.3.2 from James, Gareth, et al. An Introduction To Statistical Learning. Vol. 112. New York: springer, 2013
For hands-on examples, chapter 9 from Boehmke & Greenwell (2019) Hands-On Machine Learning with R, 1st Edition, Chapman & Hall/CRC The R Series. Accessible at: https://bradleyboehmke.github.io/HOML/
For real-world applications of Classification Trees:
- Billari, F. C., Fürnkranz, J., & Prskawetz, A. (2006). Timing, sequencing, and quantum of life course events: A machine learning approach. European Journal of Population/Revue Européenne de Démographie, 22(1), 37-65.
- Chapter 3 of Nolan, D., & Lang, D. T. (2015). Data science in R: a case studies approach to computational reasoning and problem solving. CRC Press.

Break 10:45h-11:15h

Session 4
July 7th 11:15h-13:00h

Supervised Regression
- Bagging
  - Advantages/Disadvantages
  - R example
- Random Forest
  - Advantages/Disadvantages
  - R example
- Gradient Boosting
  - Advantages/Disadvantages
  - R example
Exercises

Readings:

For an introduction to bagging/random forests/boosting, Chapter 8 from James, Gareth, et al. An Introduction To Statistical Learning. Vol. 112. New York: springer, 2013
For hands-on examples, chapter 10, 11 and 12 from Boehmke & Greenwell (2019) Hands-On Machine Learning with R, 1st Edition, Chapman & Hall/CRC The R Series. Accessible at: https://bradleyboehmke.github.io/HOML/
For real-world applications of Random Forests:
- Perry, C. (2013). Machine learning and conflict prediction: a use case. Stability: International Journal of Security and Development, 2(3), 56.
- Berk, R. A., Sorenson, S. B., & Barnes, G. (2016). Forecasting domestic violence: A machine learning approach to help inform arraignment decisions. Journal of Empirical Legal Studies, 13(1), 94-115.
- Arpino, B., Le Moglie, M., & Mencarini, L. (2018, April). Machine-learning techniques for family demography: an application of random forests to the analysis of divorce determinants in Germany. In ANNUAL MEETING PAA (Vol. 83).
- Rojo, Pedro Salas, and Juan Gabriel Rodríguez. “The Role of Inheritance on Wealth Inequality: A Machine Learning Approach.”

Session 5
July 8th 09h-10:45h

Unsupervised Regression
- Introduction to unsupervised learning
- Principal Component Analysis (PCA)
  - Advantages/Disadvantages
  - R example
- K-Means clustering
  - Advantages/Disadvantages
  - R example
Exercises

Readings:

For an introduction to unsupervised learning, Section 10.1 from James, Gareth, et al. An Introduction To Statistical Learning. Vol. 112. New York: springer, 2013
For an introduction to PCA
- Section 10.2 and 10.4 from James, Gareth, et al. An Introduction To Statistical Learning. 112. New York: springer, 2013
- For hands-on examples, chapter 17 from Boehmke & Greenwell (2019) Hands-On Machine Learning with R, 1st Edition, Chapman & Hall/CRC The R Series. Accessible at: https://bradleyboehmke.github.io/HOML/
For an introduction to K-Means clustering
- Section 10.5 from James, Gareth, et al. An Introduction To Statistical Learning. Vol. 112. New York: springer, 2013
- For hands-on examples, chapter 20 from Boehmke & Greenwell (2019) Hands-On Machine Learning with R, 1st Edition, Chapman & Hall/CRC The R Series. Accessible at: https://bradleyboehmke.github.io/HOML/
For real-world applications of K-means clustering:
- Garip, F. (2012). Discovering diverse mechanisms of migration: The Mexico–US Stream 1970–2000. Population and Development Review, 38(3), 393-433.
- Bail, C. A. (2008). The configuration of symbolic boundaries against immigrants in Europe. American Sociological Review, 73(1), 37-59.

Break 10:45h-11:15h

Session 6
July 8th 11:15h-13:00h

Unsupervised Regression
- Hierarchical clustering
  - Advantages/Disadvantages
  - R example
Final challenge: Prediction competition
- Explanation of strategies
- No free lunch theorem
- Presentation of results

Readings:

For an introduction to hierarchical clustering, sections 10.3.2, 10.3.3, 10.5.2 from James, Gareth, et al. An Introduction To Statistical Learning. Vol. 112. New York: springer, 2013
For hands-on examples, chapter 21 from Boehmke & Greenwell (2019) Hands-On Machine Learning with R, 1st Edition, Chapman & Hall/CRC The R Series. Accessible at: https://bradleyboehmke.github.io/HOML/
For examples on prediction competitions:
- Glaeser, E. L., Hillis, A., Kominers, S. D., & Luca, M. (2016). Crowdsourcing city government: Using tournaments to improve inspection accuracy. American Economic Review, 106(5), 114-18.
- Salganik, M. J., Lundberg, I., Kindel, A. T., & McLanahan, S. (2019). Introduction to the Special Collection on the Fragile Families Challenge. Socius, 5, 2378023119871580. Accessible at https://www.researchgate.net/publication/335733962_Introduction_to_the_Special_Collection_on_the_Fragile_Families_Challenge

7.3 Software

We will be using the R software together with the Rstudio interface. No laptop is required as the seminars will take place in the RECSM facilities. Any packages we plan to use will be already downloaded previous to the session.

7.4 Prerequisites

The course assumes that the student is familiar with R and should be familiar with reading, manipulating and cleaning data frames. Ideally, the student has conducted some type of research using the software.
Students should have solid knowledge of basic statistics such as linear and logistic regression, ideally with more advanced concepts such as multilevel modeling.

7.5 About the author

Jorge Cimentada has a PhD in Sociology from Pompeu Fabra University and is currently a Research Scientist at the Laboratory of Digital and Computational Demography at the Max Planck Institute for Demographic Research. His research is mainly focused on the study of educational inequality, inequality in spatial mobility and computational social science. He has worked on data science projects both in the private sector and in academic research and is interested in merging cutting edge machine learning techniques with classical social statistics. You can check out his blog at cimentadaj.github.io or contact him through twitter at @cimentadaj.

Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60.

Boehmke, Brad, and Brandon M Greenwell. 2019. Hands-on Machine Learning with R. CRC Press.

Breiman, Leo, and others. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Mincer, Jacob. 1958. “Investment in Human Capital and Personal Income Distribution.” Journal of Political Economy 66 (4): 281–302. https://doi.org/10.1086/258055.

Salganik, Matthew J, Ian Lundberg, Alexander T Kindel, Caitlin E Ahearn, Khaled Al-Ghoneim, Abdullah Almaatouq, Drew M Altschul, et al. 2020. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proceedings of the National Academy of Sciences 117 (15): 8398–8403.

Watts, Duncan J. 2014. “Common Sense and Sociological Explanations.” American Journal of Sociology 120 (2): 313–51.

Yarkoni, Tal, and Jacob Westfall. 2017. “Choosing Prediction over Explanation in Psychology: Lessons from Machine Learning.” Perspectives on Psychological Science 12 (6): 1100–1122.