Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Machine Learning for Social Scientists

K-Means clustering and competition

Jorge Cimentada

2020-07-08

1 / 22

Load the data

library(dplyr)
library(ggplot2)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
2 / 22

K-Means Clustering

  • K-Means is a method for finding clusters in a dataset of P variables

  • K-Means clustering is particularly useful for exploration in the social sciences

Suppose we have a scatterplot of two variables:

3 / 22

K-Means Clustering

  • How does K-Means identify clusters?

  • Randomly assigning each point a cluster

  • Each point has now an associated color. However, these colors were randomly assigned.
4 / 22

K-Means Clustering

  • K-Means clustering works by creating something called 'centroids'

  • These represent the center of the different clusters

  • The centroid is the mean of the P variables

  • So far, everything is random!
5 / 22

K-Means Clustering

  • Let's work this out manually:
centroids_df <- data.frame(type = factor(c("orange", "purple", "green"), levels = c("orange", "purple", "green")), x = c(.54, .56, .52), y = c(.553, .55, .56))
ggplot(centroids_df, aes(x, y, color = type)) +
geom_point(size = 4) +
scale_color_manual(values = c("orange", "purple", "green")) +
lims(x = c(0, 1), y = c(0, 1)) +
theme_minimal()

6 / 22

K-Means Clustering

  • Suppose we add a random point
centroids_df %>%
ggplot(aes(x, y)) +
geom_point(aes(color = type), size = 4) +
geom_point(data = data.frame(x = 0.25, y = 0.75)) +
scale_color_manual(values = c("orange", "purple", "green")) +
lims(x = c(0, 1), y = c(0, 1)) +
theme_minimal()

  • How do we assign that point a cluster?
7 / 22

K-Means Clustering

  • We calculate the Euclidean distance:

(x2x1)+(y2y1)

  • Applied to our problem:

    • Orange: (0.540.25)+(0.5530.75)=0.304959

    • Purple: (0.560.25)+(0.5500.75)=0.3316625

    • Green: (0.520.25)+(0.5600.75)=0.2828427

8 / 22

K-Means Clustering

The random point is closest to the green centroid, as the distance is the smallest (0.28). Let's assign it to that cluster:

centroids_df %>%
ggplot(aes(x, y, color = type)) +
geom_point(size = 4) +
geom_point(data = data.frame(type = factor("green"), x = 0.25, y = 0.75)) +
scale_color_manual(values = c("orange", "purple", "green")) +
lims(x = c(0, 1), y = c(0, 1)) +
theme_minimal()

9 / 22

K-Means Clustering

The K-Means clustering algorithm applies this calculation for each point:

where each point is assigned the color of the closest centroid.

  • The centroids are still positioned in the center, reflecting the random allocation of the initial points
10 / 22

K-Means Clustering

  • Calculates new centroids based on the average of the X and Y of the newly new assigned points:

  • Repeat exactly the same strategy again:

    • Calculate the distance between each point and all corresponding clusters
    • Reassign all points to the cluster of the closest centroid
    • Recalculate the centroid
11 / 22

K-Means Clustering

  • After N iterations, each point will be allocated to a particular centroid and it will stop being reassigned:

  • Minimize within-cluster variance
  • Maximize between-cluster variance

Respondents are very similar within each cluster with respect to the P variables and very different between clusters

12 / 22

Disadvantages K-Means Clustering

  • You need to provide the number of cluster that you want

  • K-Means will always calculate the number of supplied clusters

  • The clusters need to make substantive sense rather than statistical sense.

  • K-Means also has a stability problem

13 / 22

Caveats K-Means Clustering

  • Exploratory

  • Should make substantive sense

  • Robustness

  • Replicability

  • Centering and scaling might be appropriate

  • Outliers

14 / 22

K-Means Clustering

  • How can we fit this in R?

  • Suppose that there are different clusters between the socio-economic status of a family and a student's expected socio-economic status:

    • Low socio-economic status might not have great aspirations

    • Students from middle socio-economic status have average aspirations

    • Students from high socio-economic status might have great aspirations.

  • We fit this using kmeans and passing a data frame with the columns

15 / 22

K-Means Clustering

  • K-Means can find clusters even when there aren't any clusters.
res <- pisa %>% select(ESCS, BSMJ) %>% kmeans(centers = 3)
pisa$clust <- factor(res$cluster, levels = 1:3, ordered = TRUE)
ggplot(pisa, aes(ESCS, BSMJ, color = clust)) +
geom_point(alpha = 1/3) +
scale_x_continuous("Index of economic, social and cultural status of family") +
scale_y_continuous("Students expected occupational status") +
theme_minimal()

16 / 22

No free lunch





The 'No free lunch' theorem is a simple axiom that states that since every predictive algorithm has different assumptions, no single model is known to perform better than all others a priori



Lucky for us: social scientists are not only interested in predictive accuracy

17 / 22

Causal Inference

  • Growing interest from the social science literature on achieving causal inference using tree-based methods:

    • Athey, Susan, and Guido Imbens. "Recursive partitioning for heterogeneous causal effects." Proceedings of the National Academy of Sciences 113.27 (2016): 7353-7360

    • Brand, Jennie E., et al. "Uncovering Sociological Effect Heterogeneity using Machine Learning." arXiv preprint arXiv:1909.09138 (2019)

  • Tease out heterogeneity in variation to achieve causal inference

  • Explore interactions in a causal fashion

18 / 22

Inference

  • We can use machine learning methods for exploring new hypothesis in the data

  • Avoid overfitting by train/testing and resampling

  • Tree-based methods and regularized regressions can help us understand variables which are very good for prediction but that we weren't aware of:

    • Arpino, B., Le Moglie, M., and Mencarini, L. (2018). Machine-Learning techniques for family demography: An application of random forests to the analysis of divorce determinants in Germany
  • Understand the role of interactions from a more intuitive point of view through exploration

  • This includes unsupervised methods such as PCA and K-Means clustering.

19 / 22

Prediction

If prediction is the aim, then there's evidence that some models consistently achieve greater accuracy is different settings:

  • Tree based methods

    • Random Forests
    • Gradient Boosting
  • Neural Networks

  • Support Vector Machines

Don't forget our training: we need to explore our data and understand it. This can help a lot in figuring out why some models work more than others.

20 / 22

Prediction challenge

  • 2019 Summer Institute In Computational Social Science (SICSS)

    • Mark Verhagen

    • Christopher Barrie

    • Arun Frey

    • Pablo Beytía

    • Arran Davis

    • Jorge Cimentada

  • All metadata on counties in the United States

  • Counties with different poverty levels have varying edits and pageviews

21 / 22

Prediction challenge

  • Your task: build a predictive model of the number of edits

  • Dependent variable is revisions

  • Can help identify which sites are not being capture by poverty/metadata indicators

  • 150 columns, including Wiki data on the website and characteristics of the county

  • Ideas

    • Does it make sense to reduce the number of correlated variables into a few principal components?
    • Do some counties cluster on very correlated variables? Is it fesiable to summarize some of these variables through predicting the cluster membership?
    • Do we really need to use all variables?
    • Does regularized regression or tree-based methods do better?

Let's see the variables and data:

https://cimentadaj.github.io/ml_socsci/no-free-lunch.html#prediction-challenge

You have 45 minutes, start!

22 / 22

Load the data

library(dplyr)
library(ggplot2)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
2 / 22
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow