Chapter 6 No free lunch

Throughout this course we’ve explained several different methods that are used in machine learning for predictive problems. Although we presented the benefits and pitfalls of each one when possible, there’s no clear cut rule on which one to use. The ‘No free lunch’ theorem is a simple axiom that states that since every predictive algorithm has different assumptions, no single model is known to perform better than all others a priori. In other words, machine learning practitioners need to try different models to check which one predicts better for their task.

However, since social scientists are not only interested in predictive accuracy, for different scenarios this might be different. Let’s discuss some of these scenarios.

6.1 Causal Inference

There is growing interest from the social science literature on achieving causal inference using tree-based methods (Athey and Imbens 2016). By definition, this type of analysis is not interested in predictive accuracy alone. This means that we would not try several different models and test each one against their predictive accuracy. Instead, we need to carefully understand how tree-based methods work and how they can help us estimate a causal effect. For this type of area, machine learning servers as a tool for exploration and for estimation of point estimates of causal effects rather than for predictive accuracy.

6.2 Explaining complex models

In business settings, there are scenarios where interpretability is often needed more than accuracy. This is also the case for the social sciences. For example, explaining a complex model to key stakeholders can be challenging. It is sometimes better to have a simple model that performs worse than more complex model at the expense of interpretability. I’ve experienced situations like this one where we used simple decision trees that performed worse than other tree-based methods simply because it was much more important that the stakeholder understand how we achieved at a final prediction and which variables were the most important ones.

6.3 Inference

For social scientists, we can use machine learning methods for exploring hypothesis in the data. In particular, tree-based methods and regularized regressions can help us understand variables which are very good for prediction but that we weren’t aware of. Moreover, it can help us understand the role of interactions from a more intuitive point of view through exploration. This includes unsupervised methods such as \(PCA\) and K-Means clustering.

6.4 Prediction

If you’re aim is to achieve the best predictive accuracy out there, then there’s also evidence that some models seem to perform better than others. Tree based methods such as random forests and gradient boosting seem to continually perform the best in predictive competitions, together with more advanced models such as neural networks and support vector machines. For raw accuracy, there’s no rule on which model to use. You might have a hunch depending on the distribution and exploration of your data but since these methods are quite complex, there’s no single rule that states that one will perform better. We simply need to try several of them.

Having said this, we need to explore our data and understand it. This can help a lot in figuring out why some models work more than others.

6.5 Prediction challenge

As part of the end of the course, we will have a prediction competition. This means you’ll get to use all the methods we’ve discussed so far and compare your predictions to your fellow class mates.

At the 2019 Summer Institute In Computational Social Science (SICSS), Mark Verhagen, Christopher Barrie, Arun Frey, Pablo Beytía, Arran Davis and me collected data on the number of people that visit the Wikipedia website of all counties in the United States. This data can be used to understand whether counties with different poverty levels get more edits from the Wikipedia community. This can help assess whether there is a fundamental bias in Wikipedia contribution to richer counties.

We will use this data to predict the total number of edits of each county in Wikipedia. We’ve matched this Wikipedia data with census-level indicators for each county, including indicators on poverty level, racial composition and general metadata on the counties Wikipedia page. This amounts to a total of 150 columns. The variable you will be trying to predict is revisions. This is the total number of edits for each county all along it’s history since 2001. Below is the complete codebook:

  • county_fips: the county code
  • longitude/latitude: the location of the county
  • population: total population of county
  • density: density of population in the county
  • watchers: number of wikipedia users who ‘watch’ the page
  • pageviews: number of pageviews
  • pageviews_offset: minimum number of pageviews which is visible to the user (censored)
  • revisions: total number of edits (from the creation of the website)
  • editors: total number of editors (from the creation of the website)
  • secs_since_last_edit: seconds since last edit
  • characters: number of characters in the website
  • words: number of words in the website
  • references: number of references in the article
  • unique_references: number of unique references in the article
  • sections: number of sections in the wikipedia article
  • external_links: number of external links
  • links_from_this_page: number of hyperlinks used in this page
  • links_to_this_page: number of hyperlinks that point to this page (from other wikipedia websites)
  • male_*_*: these are the number of males within different age groups
  • female_*_*: these are the number of females within different age groups
  • total_*: This is the total population for different demographic groups
  • latino: total count of latinos
  • latino_*: total count of latinos from different races
  • no_schooling_completed: total respondents with no schooling
  • nursery_school: total respondents with only nursery school
  • kindergarten: total respondents with kindergarten
  • grade_*: These are the number of people the completed certain high school education
  • hs_diploma: total respondents with high school diploma
  • ged: total respondents with a GED diploma
  • less_than_1_year_college: total respondents with less than one year of college
  • more_than_1_year_college: total respondents with more than one year of college
  • associates_degree: total respondents with associates degree
  • bachelors_degree: total respondents with bachelors degree
  • masters_degree: total respondents with masters degree
  • professional_degree: total respondents with professional degree
  • doctorate_degree: total respondents with doctorate degree
  • total_with_poverty_status: total respondents with poverty status
  • income_below_poverty: total respondents with income below poverty levels
  • born_in_usa: total respondents born in USA
  • foreign_born: total respondents foreing born
  • speak_only_english: total respondents who speak only english
  • speak_other_languages: total respondents who speak other languages
  • count_*: total number of respondents within age groups
  • percent_age_*: percentage of people within different age groups
  • percent_*: percentage of people form different demographic groups. For example, whites, blacks, less than highschool, born in USA, etc..
  • internet_usage: percentage of internet usage in county

For all of your analysis, use the rmse loss function, so that we can compare results across participants.

Here are some ideas you can try in your analysis:

  • Does it make sense to reduce the number of correlated variables into a few principal components?

  • Do some counties cluster on very correlated variables? Is it feasible to summarize some of these variables through predicting the cluster membership?

  • Do we really need to use all variables?

  • Does regularized regression or tree-based methods do better?

You can read the data with:

wiki_dt <- read.csv("https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/wikipedia_final.csv")

You have 45 minutes, start!

References

Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60.