library(dplyr)library(ggplot2)data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"pisa <- read.csv(data_link)
K-Means is a method for finding clusters in a dataset of P variables
K-Means clustering is particularly useful for exploration in the social sciences
Suppose we have a scatterplot of two variables:
How does K-Means identify clusters?
Randomly assigning each point a cluster
K-Means clustering works by creating something called 'centroids'
These represent the center of the different clusters
The centroid is the mean of the P variables
centroids_df <- data.frame(type = factor(c("orange", "purple", "green"), levels = c("orange", "purple", "green")), x = c(.54, .56, .52), y = c(.553, .55, .56))ggplot(centroids_df, aes(x, y, color = type)) + geom_point(size = 4) + scale_color_manual(values = c("orange", "purple", "green")) + lims(x = c(0, 1), y = c(0, 1)) + theme_minimal()
centroids_df %>% ggplot(aes(x, y)) + geom_point(aes(color = type), size = 4) + geom_point(data = data.frame(x = 0.25, y = 0.75)) + scale_color_manual(values = c("orange", "purple", "green")) + lims(x = c(0, 1), y = c(0, 1)) + theme_minimal()
√(x2−x1)+(y2−y1)
Applied to our problem:
Orange: √(0.54−0.25)+(0.553−0.75)=0.304959
Purple: √(0.56−0.25)+(0.550−0.75)=0.3316625
Green: √(0.52−0.25)+(0.560−0.75)=0.2828427
The random point is closest to the green centroid, as the distance is the smallest (0.28). Let's assign it to that cluster:
centroids_df %>% ggplot(aes(x, y, color = type)) + geom_point(size = 4) + geom_point(data = data.frame(type = factor("green"), x = 0.25, y = 0.75)) + scale_color_manual(values = c("orange", "purple", "green")) + lims(x = c(0, 1), y = c(0, 1)) + theme_minimal()
The K-Means clustering algorithm applies this calculation for each point:
where each point is assigned the color of the closest centroid.
Repeat exactly the same strategy again:
Respondents are very similar within each cluster with respect to the P variables and very different between clusters
You need to provide the number of cluster that you want
K-Means will always calculate the number of supplied clusters
The clusters need to make substantive sense rather than statistical sense.
K-Means also has a stability problem
Exploratory
Should make substantive sense
Robustness
Replicability
Centering and scaling might be appropriate
Outliers
How can we fit this in R
?
Suppose that there are different clusters between the socio-economic status of a family and a student's expected socio-economic status:
Low socio-economic status might not have great aspirations
Students from middle socio-economic status have average aspirations
Students from high socio-economic status might have great aspirations.
We fit this using kmeans
and passing a data frame with the columns
res <- pisa %>% select(ESCS, BSMJ) %>% kmeans(centers = 3)pisa$clust <- factor(res$cluster, levels = 1:3, ordered = TRUE)ggplot(pisa, aes(ESCS, BSMJ, color = clust)) + geom_point(alpha = 1/3) + scale_x_continuous("Index of economic, social and cultural status of family") + scale_y_continuous("Students expected occupational status") + theme_minimal()
The 'No free lunch' theorem is a simple axiom that states that since every predictive algorithm has different assumptions, no single model is known to perform better than all others a priori
Lucky for us: social scientists are not only interested in predictive accuracy
Growing interest from the social science literature on achieving causal inference using tree-based methods:
Athey, Susan, and Guido Imbens. "Recursive partitioning for heterogeneous causal effects." Proceedings of the National Academy of Sciences 113.27 (2016): 7353-7360
Brand, Jennie E., et al. "Uncovering Sociological Effect Heterogeneity using Machine Learning." arXiv preprint arXiv:1909.09138 (2019)
Tease out heterogeneity in variation to achieve causal inference
Explore interactions in a causal fashion
We can use machine learning methods for exploring new hypothesis in the data
Avoid overfitting by train/testing and resampling
Tree-based methods and regularized regressions can help us understand variables which are very good for prediction but that we weren't aware of:
Understand the role of interactions from a more intuitive point of view through exploration
This includes unsupervised methods such as PCA and K-Means clustering.
If prediction is the aim, then there's evidence that some models consistently achieve greater accuracy is different settings:
Tree based methods
Neural Networks
Don't forget our training: we need to explore our data and understand it. This can help a lot in figuring out why some models work more than others.
2019 Summer Institute In Computational Social Science (SICSS)
Mark Verhagen
Christopher Barrie
Arun Frey
Pablo Beytía
Arran Davis
Jorge Cimentada
All metadata on counties in the United States
Counties with different poverty levels have varying edits and pageviews
Your task: build a predictive model of the number of edits
Dependent variable is revisions
Can help identify which sites are not being capture by poverty/metadata indicators
150 columns, including Wiki data on the website and characteristics of the county
Ideas
Let's see the variables and data:
https://cimentadaj.github.io/ml_socsci/no-free-lunch.html#prediction-challenge
You have 45 minutes, start!
library(dplyr)library(ggplot2)data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"pisa <- read.csv(data_link)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |