class: center, middle, inverse, title-slide # Machine Learning for Social Scientists ## Loss functions and decision trees ### Jorge Cimentada ### 2022-02-18 --- layout: true <!-- background-image: url(./figs/upf.png) --> background-position: 100% 0%, 100% 0%, 50% 100% background-size: 10%, 10%, 10% --- # What are loss functions? * Social Scientists use metrics such as the `\(R^2\)`, `\(AIC\)`, `\(Log\text{ }likelihood\)` or `\(BIC\)`. * We almost always use these metrics and their purpose is to inform some of our modeling choices. * In machine learning, metrics such as the `\(R^2\)` and the `\(AIC\)` are called 'loss functions' * There are two types of loss functions: continuous and binary --- # Root Mean Square Error (RMSE) Subtract the actual `\(Y_{i}\)` score of each respondent from the predicted `\(\hat{Y_{i}}\)` for each respondent: <img src="03_loss_trees_files/figure-html/unnamed-chunk-3-1.png" width="70%" style="display: block; margin: auto;" /> `$$RMSE = \sqrt{\sum_{i = 1}^n{\frac{(\hat{y_{i}} - y_{i})^2}{N}}}$$` --- # Mean Absolute Error (MAE) * This approach doesn't penalize any values and just takes the absolute error of the predictions. * Fundamentally simpler to interpret than the `\(RMSE\)` since it's just the average absolute error. <img src="03_loss_trees_files/figure-html/unnamed-chunk-4-1.png" width="70%" style="display: block; margin: auto;" /> `$$MAE = \sum_{i = 1}^n{\frac{|\hat{y_{i}} - y_{i}|}{N}}$$` --- # Confusion Matrices * The city of Berlin is working on developing an 'early warning' system that is aimed at predicting whether a family is in need of childcare support. * Families which received childcare support are flagged with a 1 and families which didn't received childcare support are flagged with a 0: <img src="../../../img/base_df_lossfunction.svg" width="15%" style="display: block; margin: auto;" /> --- # Confusion Matrices * Suppose we fit a logistic regression that returns a predicted probability for each family: <img src="../../../img/df_lossfunction_prob.svg" width="35%" style="display: block; margin: auto;" /> --- # Confusion Matrices * We could assign a 1 to every respondent who has a probability above `0.5` and a 0 to every respondent with a probability below `0.5`: <img src="../../../img/df_lossfunction_class.svg" width="45%" style="display: block; margin: auto;" /> --- # Confusion Matrices The accuracy is the sum of all correctly predicted rows divided by the total number of predictions: <img src="../../../img/confusion_matrix_50_accuracy.svg" width="55%" style="display: block; margin: auto;" /> * Accuracy: `\((3 + 1) / (3 + 1 + 1 + 2) = 50\%\)` --- # Confusion Matrices * **Sensitivity** of a model is a fancy name for the **true positive rate**. * Sensitivity measures those that were correctly predicted only for the `1`: <img src="../../../img/confusion_matrix_50_sensitivity.svg" width="55%" style="display: block; margin: auto;" /> * Sensitivity: `\(3 / (3 + 1) = 75\%\)` --- # Confusion Matrices * The **specificity** of a model measures the true false rate. * Specificity measures those that were correctly predicted only for the `0`: <img src="../../../img/confusion_matrix_50_specificity.svg" width="55%" style="display: block; margin: auto;" /> * Specificity: `\(1 / (1 + 2) = 33\%\)` --- # ROC Curves and Area Under the Curve * The ROC curve is just another fancy name for something that is just a representation of sensitivity and specificity. <br> <br> <br> * In our previous example, we calculated the sensitivity and specificity assuming that the threshold for being 1 in the probability of each respondent is `0.5`. <br> <br> > What if we tried different cutoff points? --- # ROC Curves and Area Under the Curve
* Assigning a 1 if the probability was above `0.3` is associated with a true positive rate (sensitivity) of `0.74`. * Switching the cutoff to `0.7`, increases the true positive rate to `0.95`, quite an impressive benchmark. * At the expense of increasing sensitivity, the true false rate decreases from `0.87` to `0.53`. --- # ROC Curves and Area Under the Curve * We want a cutoff that maximizes both the true positive rate and true false rate. * Try all possible combinations:
--- # ROC Curves and Area Under the Curve * This result contains the sensitivity and specificity for many different cutoff points. These results are most easy to understand by visualizing them. * Cutoffs that improve the specificity does so at the expense of sensitivity. <img src="03_loss_trees_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" /> --- # ROC Curves and Area Under the Curve * Instead of visualizing the specificity as the true negative rate, let's subtract 1 such that as the `X` axis increases, it means that the error is increasing: <img src="03_loss_trees_files/figure-html/unnamed-chunk-14-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> * Ideal result: most points cluster on the top left quadrant. * Sensitivity is high (the true positive rate) and the specificity is high (because `\(1 - specificity\)` will switch the direction of the accuracy to the lower values of the `X` axis). --- # ROC Curves and Area Under the Curve * There is one thing we're missing: the actual cutoff points! * Hover over the plot
--- # ROC Curves and Area Under the Curve * The last loss function we'll discuss is a very small extension of the ROC curve: the **A**rea **U**nder the **C**urve or `\(AUC\)`. * `\(AUC\)` is the percentage of the plot that is under the curve. For example: <img src="03_loss_trees_files/figure-html/unnamed-chunk-16-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> * The more points are located in the top left quadrant, the higher the overall accuracy of our model * 90% of the space of the plot is under the curve. --- # Precision and recall <img src="../../../img/precision_recall.png" width="55%" style="display: block; margin: auto;" /> --- # Precision and recall <img src="../../../img/precision_recall_plot.png" width="55%" style="display: block; margin: auto;" /> --- # Precision and recall - When to use ROC curves and precision-recall: <br> 1. ROC curves should be used when there are roughly equal numbers of observations for each class <br> 2. Precision-Recall curves should be used when there is a moderate to large class imbalance --- <br> <br> <br> .center[ # Decision trees ] --- # Decision trees * Decision trees are tree-like diagrams. * They work by defining `yes-or-no` rules based on the data and assign the most common value for each respondent within their final branch. <img src="03_loss_trees_files/figure-html/unnamed-chunk-19-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-20-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-21-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-22-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-23-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-24-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # How do decision trees work <img src="../../../img/decision_trees_adv1.png" width="55%" style="display: block; margin: auto;" /> --- # How do decision trees work <img src="../../../img/decision_trees_adv2.png" width="55%" style="display: block; margin: auto;" /> --- # How do decision trees work <img src="../../../img/decision_trees_adv3.png" width="55%" style="display: block; margin: auto;" /> --- # How do decision trees work <img src="../../../img/decision_trees_adv4.png" width="55%" style="display: block; margin: auto;" /> --- # How do decision trees work <img src="../../../img/decision_trees_adv5.png" width="55%" style="display: block; margin: auto;" /> --- # Bad things about Decision trees * They overfit a lot <img src="03_loss_trees_files/figure-html/unnamed-chunk-31-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Bad things about Decision trees How can you address this? * Not straight forward * `min_n` and `tree_depth` are sometimes useful * You need to tune these <img src="03_loss_trees_files/figure-html/unnamed-chunk-32-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Tuning decision trees * Model tuning can help select the best `min_n` and `tree_depth` ``` # A tibble: 5 x 7 tree_depth min_n .metric .estimator mean n std_err <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> 1 9 50 rmse standard 0.459 5 0.0126 2 9 100 rmse standard 0.459 5 0.0126 3 3 50 rmse standard 0.518 5 0.0116 4 3 100 rmse standard 0.518 5 0.0116 5 1 50 rmse standard 0.649 5 0.0102 ``` --- # Tuning decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-34-1.png" width="95%" height="950%" style="display: block; margin: auto;" /> --- # Best tuned decision tree * As usual, once we have out model, we predict on our test set and compare: .center[ ``` Testing error Training error 0.4644939 0.4512248 ``` ] --- .center[ # Break ]