class: center, middle, inverse, title-slide # Machine Learning for Social Scientists ## Introduction ### Jorge Cimentada ### 2022-02-17 --- layout: true <!-- background-image: url(./figs/upf.png) --> background-position: 100% 0%, 100% 0%, 50% 100% background-size: 10%, 10%, 10% --- # An introduction to the Machine Learning Framework **What is Machine Learning after all?** .left-colum[ .center[ > Using statistical methods to **learn** the data enough to be able to predict it accurately on new data ] ] -- <br> <br> That sounds somewhat familiar to social scientists 🤔 - Perhaps our goal is not to **predict** it but it is certainly to **learn** it and **understand** it <br> <br> -- Here comes the catch: > ML doesn't want to **understand** the problem; it wants to learn it enough to **predict** it well. --- class: center, middle # How do social scientists work? <img src="../../img/socsci_wflow1.svg" width="90%" /> --- # Prediction vs Inference - Social Scientists are concerned with making inferences about their data > If a new data source comes along, their results should be able to replicate. <br> - Data Scientits are concerned with making predictions about their data > If a new data source comes along, they want to be able to predict it accurately. <br> <br> -- .center[ .large[ **What's the common framework?** ] ] --- name: fat class: inverse, top, center background-image: url(../../img/bart_replicability.png) background-size: cover --- <br> <br> <br> .center[ <img src="01_introduction_files/figure-html/unnamed-chunk-3-1.png" height="90%" /> ] .center[Very important to ML! (as it should be in Social Science)] --- class: center, middle # Where social scientists have gone wrong Tell me a strategy that you were taught to make sure your results are replicable on a new dataset -- **I can tell you several that Machine Learning researchers have thought of** --- <img src="../../img/socsci_wflow1.svg" width="90%" /> --- <img src="../../img/socsci_wflow2.svg" width="90%" /> --- <img src="../../img/socsci_wflow3.svg" width="90%" /> --- <img src="../../img/socsci_wflow4.svg" width="90%" /> --- # Difference in workflow - Machine Learning practioners have renamed stuff statisticians have been doing for 100 years -- * Features --> Variables * Feature Engineering --> Creating Variables * Supervised Learning --> Models that have a dependent variable * Unsupervised Learning --> Models that don't have a dependent variable > I won't discuss the first two approaches, since we have a lot of experience with that. Throughout the course we'll show main models they use for prediction. -- - Machine Learning practioners have developed extra steps to make sure we don't overfit the data -- * Training/Testing data --> Unknown to us * Cross-validation --> Unknown to us * Loss functions --> Model fit --> Known to us but are not predominant (RMSE, `\(R^2\)` etc...) > These are very useful concepts. Let's focus on those. --- # Objective <!-- hahahahaha, worst idea but I don't want to search how to create the HTML tag, etc... --> **Minimize** **Maximize:** <img src="01_introduction_files/figure-html/unnamed-chunk-8-1.png" height="90%" /> --- ## Testing/Training data .pull-left[ .center[ ## Data <img src="../../img/raw_data.svg" width="70%" /> ] ] -- .pull-right[ <br> <br> <br> <br> <br> <br> - Social Scientist would fit the model on this data * How do you know if you're overfitting? * Is there a metric? * Is there a method? > Nothing fancy! Just split the data ] --- # Testing/Training data .center[ <img src="../../img/train_testing_df.svg" width="80%" /> ] --- # Testing/Training data - Iterative process * Fit your model on the **training** data * Testing different models/specifications * Settle on final model * Fit your model on **testing** data * Compare model fit from **training** and **testing** > If you train/test on the same data you'll inadvertently tweak the model to overfit both training/testing <br> <br> .center[ .middle[ **Too abstract**😕 <br> Let's run an example ] ] --- ## Testing/Training data .pull-left[ .center[ <img src="../../img/training_df.svg" width="95%" /> ] > Fit model here, tweak and refit until happy. ] .pull-right[ .center[ <img src="../../img/testing_df.svg" width="95%" /> ] > Test final model here and compare model fit between training/testing ] -- - In the first prediction, this is better because **testing** is "pristine" - However, if we repeat the train/testing iteration 2, 3, 4, ... times, we'll start to learn the **testing** data too well (**overfitting**)! --- ## Hello cross-validation! .pull-left[ .center[ <img src="../../img/train_cv1.svg" width="95%" /> ] ] --- ## Hello cross-validation! .pull-left[ .center[ <img src="../../img/train_cv2.svg" width="95%" /> ] ] --- ## Hello cross-validation! .pull-left[ .center[ <img src="../../img/train_cv3.svg" width="95%" /> ] ] --- ## Hello cross-validation! .pull-left[ .center[ <img src="../../img/train_cv4.svg" width="95%" /> ] ] --- ## Hello cross-validation! .pull-left[ .center[ <img src="../../img/train_cv4.svg" width="95%" /> ] ] .pull-right[ <br> <br> <br> <br> - Why is this a good approach? > It's the least bad approach we have: there are 10 different chances of pristine checking ] --- # Goodbye cross-validation! <br> <br> <br> <br> <br> <br> - I know what you're thinking... we'll also overfit on these 10 slots if we repeat this 2, 3, 4, ... times. - That's why I said: **it's the least bad approach** > The model fitted on the training data (in any way, be it the whole data or through cross-validation), will always have a lower error than the testing data. --- # Loss functions > As I told you, machine learning practitioners like to put new names to things that already exist. Loss functions are metrics that evaluate your model fit: * `\(R^2\)` * AIC * BIC * RMSE * etc... These are familiar to us! --- # Loss functions However, they work with several others that are specific to prediction: * Confusion matrix * Accuracy * Precision * Specificity * etc... These are the topic of the next class! --- # Bias-Variance tradeoff .pull-left[ ![](../../img/bias_variance.svg)<!-- --> ] .pull-right[ ![](01_introduction_files/figure-html/unnamed-chunk-20-1.png)<!-- --> <br> ![](01_introduction_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] --- # A unified example Let's combine all the new steps into a complete pipeline of machine learning. Let's say we have the age of a person and their income and we want to predict their income based on the age. The data looks like:
--- # A unified example <img src="01_introduction_files/figure-html/unnamed-chunk-23-1.png" height="90%" /> --- # A unified example Let's partition our data into training and testing: <br> .pull-left[
] .pull-right[
] --- # A unified example Run a simple regression `income ~ age` on the **training** data and plot predicted values: <img src="01_introduction_files/figure-html/unnamed-chunk-27-1.png" height="80%" /> --- # A unified example It seems we're underfitting the relationship. To measure the **fit** of the model, we'll use the Root Mean Square Error (RMSE). Remember it? $$ RMSE = \sqrt{\sum_{i = 1}^n{\frac{(\hat{y} - y)^2}{N}}} $$ The current `\(RMSE\)` of our model is 388.08. This means that on average our predictions are off by around 388.08 euros. --- # A unified example - How do we increase the fit? <br> - It seems that the relationship is non-linear, so we would need to add non-linear terms to the model, for example `\(age^2\)`, `\(age^3\)`, ..., `\(age^{10}\)`. <br> - However, remember, by fitting these non-linear terms repetitevely to the data, we might tweak the model to **learn** the data so much that it starts to capture noise rather than the signal. <br> - This is where cross-validation comes in! --- # A unified example .pull-left[ .center[ ``` # A tibble: 10 x 2 training testing <list> <list> 1 <df[,2] [337 × 2]> <df[,2] [38 × 2]> 2 <df[,2] [337 × 2]> <df[,2] [38 × 2]> 3 <df[,2] [337 × 2]> <df[,2] [38 × 2]> 4 <df[,2] [337 × 2]> <df[,2] [38 × 2]> 5 <df[,2] [337 × 2]> <df[,2] [38 × 2]> 6 <df[,2] [338 × 2]> <df[,2] [37 × 2]> 7 <df[,2] [338 × 2]> <df[,2] [37 × 2]> 8 <df[,2] [338 × 2]> <df[,2] [37 × 2]> 9 <df[,2] [338 × 2]> <df[,2] [37 × 2]> 10 <df[,2] [338 × 2]> <df[,2] [37 × 2]> ``` ] ] .pull-right[ .center[ <img src="../../img/train_cv4.svg" width="95%" /> ] ] --- # A unified example .center[ <img src="../../img/train_cv5.svg" width="50%" /> ] --- # A unified example .center[ <img src="../../img/train_cv6.svg" width="50%" /> ] --- # A unified example .center[ <img src="../../img/train_cv7.svg" width="50%" /> ] --- # A unified example <img src="01_introduction_files/figure-html/unnamed-chunk-34-1.png" height="80%" /> --- # A unified example We can run the model on the entire **training** data with 3 non-linear terms and check the fit: <img src="01_introduction_files/figure-html/unnamed-chunk-35-1.png" height="80%" /> The `\(RMSE\)` on the training data for the three polynomial model is 283.32. --- # A unified example Finally, once our model final model has been fit and we're finished, we use the fitted model on the training data to predict on the **testing** data: .center[ <img src="01_introduction_files/figure-html/unnamed-chunk-36-1.png" height="70%" /> ] * Training RMSE is 283.32 * Testing RMSE is 304.49 Testing RMSE will almost always be higher, since we always overfit the data in some way through cross-validation. --- class: center, middle # Pratical examples https://cimentadaj.github.io/ml_socsci/machine-learning-for-social-scientists.html#an-example ## **Break**