Chapter 6 No free lunch
Throughout this course we’ve explained several different methods that are used in machine learning for predictive problems. Although we presented the benefits and pitfalls of each one when possible, there’s no clear cut rule on which one to use. The ‘No free lunch’ theorem is a simple axiom that states that since every predictive algorithm has different assumptions, no single model is known to perform better than all others a priori. In other words, machine learning practitioners need to try different models to check which one predicts better for their task.
However, since social scientists are not only interested in predictive accuracy, for different scenarios this might be different. Let’s discuss some of these scenarios.
6.1 Causal Inference
There is growing interest from the social science literature on achieving causal inference using tree-based methods (Athey and Imbens 2016). By definition, this type of analysis is not interested in predictive accuracy alone. This means that we would not try several different models and test each one against their predictive accuracy. Instead, we need to carefully understand how tree-based methods work and how they can help us estimate a causal effect. For this type of area, machine learning servers as a tool for exploration and for estimation of point estimates of causal effects rather than for predictive accuracy.
6.2 Explaining complex models
In business settings, there are scenarios where interpretability is often needed more than accuracy. This is also the case for the social sciences. For example, explaining a complex model to key stakeholders can be challenging. It is sometimes better to have a simple model that performs worse than more complex model at the expense of interpretability. I’ve experienced situations like this one where we used simple decision trees that performed worse than other tree-based methods simply because it was much more important that the stakeholder understand how we achieved at a final prediction and which variables were the most important ones.
6.3 Inference
For social scientists, we can use machine learning methods for exploring hypothesis in the data. In particular, tree-based methods and regularized regressions can help us understand variables which are very good for prediction but that we weren’t aware of. Moreover, it can help us understand the role of interactions from a more intuitive point of view through exploration. This includes unsupervised methods such as \(PCA\) and K-Means clustering.
6.4 Prediction
If you’re aim is to achieve the best predictive accuracy out there, then there’s also evidence that some models seem to perform better than others. Tree based methods such as random forests and gradient boosting seem to continually perform the best in predictive competitions, together with more advanced models such as neural networks and support vector machines. For raw accuracy, there’s no rule on which model to use. You might have a hunch depending on the distribution and exploration of your data but since these methods are quite complex, there’s no single rule that states that one will perform better. We simply need to try several of them.
Having said this, we need to explore our data and understand it. This can help a lot in figuring out why some models work more than others.
6.5 Prediction challenge
As part of the end of the course, we will have a prediction competition. This means you’ll get to use all the methods we’ve discussed so far and compare your predictions to your fellow class mates.
At the 2019 Summer Institute In Computational Social Science (SICSS), Mark Verhagen, Christopher Barrie, Arun Frey, Pablo Beytía, Arran Davis and me collected data on the number of people that visit the Wikipedia website of all counties in the United States. This data can be used to understand whether counties with different poverty levels get more edits from the Wikipedia community. This can help assess whether there is a fundamental bias in Wikipedia contribution to richer counties.
We will use this data to predict the total number of edits of each county in Wikipedia. We’ve matched this Wikipedia data with census-level indicators for each county, including indicators on poverty level, racial composition and general metadata on the counties Wikipedia page. This amounts to a total of 150 columns. The variable you will be trying to predict is revisions
. This is the total number of edits for each county all along it’s history since 2001. Below is the complete codebook:
county_fips
: the county codelongitude/latitude
: the location of the countypopulation
: total population of countydensity
: density of population in the countywatchers
: number of wikipedia users who ‘watch’ the pagepageviews
: number of pageviewspageviews_offset
: minimum number of pageviews which is visible to the user (censored)revisions
: total number of edits (from the creation of the website)editors
: total number of editors (from the creation of the website)secs_since_last_edit
: seconds since last editcharacters
: number of characters in the websitewords
: number of words in the websitereferences
: number of references in the articleunique_references
: number of unique references in the articlesections
: number of sections in the wikipedia articleexternal_links
: number of external linkslinks_from_this_page
: number of hyperlinks used in this pagelinks_to_this_page
: number of hyperlinks that point to this page (from other wikipedia websites)male_*_*
: these are the number of males within different age groupsfemale_*_*
: these are the number of females within different age groupstotal_*
: This is the total population for different demographic groupslatino
: total count of latinoslatino_*
: total count of latinos from different racesno_schooling_completed
: total respondents with no schoolingnursery_school
: total respondents with only nursery schoolkindergarten
: total respondents with kindergartengrade_*
: These are the number of people the completed certain high school educationhs_diploma
: total respondents with high school diplomaged
: total respondents with a GED diplomaless_than_1_year_college
: total respondents with less than one year of collegemore_than_1_year_college
: total respondents with more than one year of collegeassociates_degree
: total respondents with associates degreebachelors_degree
: total respondents with bachelors degreemasters_degree
: total respondents with masters degreeprofessional_degree
: total respondents with professional degreedoctorate_degree
: total respondents with doctorate degreetotal_with_poverty_status
: total respondents with poverty statusincome_below_poverty
: total respondents with income below poverty levelsborn_in_usa
: total respondents born in USAforeign_born
: total respondents foreing bornspeak_only_english
: total respondents who speak only englishspeak_other_languages
: total respondents who speak other languagescount_*
: total number of respondents within age groupspercent_age_*
: percentage of people within different age groupspercent_*
: percentage of people form different demographic groups. For example, whites, blacks, less than highschool, born in USA, etc..internet_usage
: percentage of internet usage in county
For all of your analysis, use the rmse
loss function, so that we can compare results across participants.
Here are some ideas you can try in your analysis:
Does it make sense to reduce the number of correlated variables into a few principal components?
Do some counties cluster on very correlated variables? Is it feasible to summarize some of these variables through predicting the cluster membership?
Do we really need to use all variables?
Does regularized regression or tree-based methods do better?
You can read the data with:
wiki_dt <- read.csv("https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/wikipedia_final.csv")
You have 45 minutes, start!
References
Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60.