class: center, middle, inverse, title-slide # Machine Learning for Social Scientists ## Loss functions and decision trees ### Jorge Cimentada ### 2020-07-08 --- layout: true <!-- background-image: url(./figs/upf.png) --> background-position: 100% 0%, 100% 0%, 50% 100% background-size: 10%, 10%, 10% --- # Load the data ```r library(tidymodels) library(tidyflow) library(rpart.plot) library(vip) library(plotly) data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv" pisa <- read.csv(data_link) ``` --- # What are loss functions? * Social Scientists use metrics such as the `\(R^2\)`, `\(AIC\)`, `\(Log\text{ }likelihood\)` or `\(BIC\)`. * We almost always use these metrics and their purpose is to inform some of our modeling choices. * In machine learning, metrics such as the `\(R^2\)` and the `\(AIC\)` are called 'loss functions' * There are two types of loss functions: continuous and binary --- # Root Mean Square Error (RMSE) Subtract the actual `\(Y_{i}\)` score of each respondent from the predicted `\(\hat{Y_{i}}\)` for each respondent: <img src="03_loss_trees_files/figure-html/unnamed-chunk-4-1.png" width="70%" style="display: block; margin: auto;" /> `$$RMSE = \sqrt{\sum_{i = 1}^n{\frac{(\hat{y_{i}} - y_{i})^2}{N}}}$$` --- # Mean Absolute Error (MAE) * This approach doesn't penalize any values and just takes the absolute error of the predictions. * Fundamentally simpler to interpret than the `\(RMSE\)` since it's just the average absolute error. <img src="03_loss_trees_files/figure-html/unnamed-chunk-5-1.png" width="70%" style="display: block; margin: auto;" /> `$$MAE = \sum_{i = 1}^n{\frac{|\hat{y_{i}} - y_{i}|}{N}}$$` --- # Confusion Matrices * The city of Berlin is working on developing an 'early warning' system that is aimed at predicting whether a family is in need of childcare support. * Families which received childcare support are flagged with a 1 and families which didn't received childcare support are flagged with a 0: <img src="../../img/base_df_lossfunction.svg" width="15%" style="display: block; margin: auto;" /> --- # Confusion Matrices * Suppose we fit a logistic regression that returns a predicted probability for each family: <img src="../../img/df_lossfunction_prob.svg" width="35%" style="display: block; margin: auto;" /> --- # Confusion Matrices * We could assign a 1 to every respondent who has a probability above `0.5` and a 0 to every respondent with a probability below `0.5`: <img src="../../img/df_lossfunction_class.svg" width="45%" style="display: block; margin: auto;" /> --- # Confusion Matrices The accuracy is the sum of all correctly predicted rows divided by the total number of predictions: <img src="../../img/confusion_matrix_50_accuracy.svg" width="55%" style="display: block; margin: auto;" /> * Accuracy: `\((3 + 1) / (3 + 1 + 1 + 2) = 50\%\)` --- # Confusion Matrices * **Sensitivity** of a model is a fancy name for the **true positive rate**. * Sensitivity measures those that were correctly predicted only for the `1`: <img src="../../img/confusion_matrix_50_sensitivity.svg" width="55%" style="display: block; margin: auto;" /> * Sensitivity: `\(3 / (3 + 1) = 75\%\)` --- # Confusion Matrices * The **specificity** of a model measures the true false rate. * Specificity measures those that were correctly predicted only for the `0`: <img src="../../img/confusion_matrix_50_specificity.svg" width="55%" style="display: block; margin: auto;" /> * Specificity: `\(1 / (1 + 2) = 33\%\)` --- # ROC Curves and Area Under the Curve * The ROC curve is just another fancy name for something that is just a representation of sensitivity and specificity. <br> <br> <br> * In our previous example, we calculated the sensitivity and specificity assuming that the threshold for being 1 in the probability of each respondent is `0.5`. <br> <br> > What if we tried different cutoff points? --- # ROC Curves and Area Under the Curve
* Assigning a 1 if the probability was above `0.3` is associated with a true positive rate (sensitivity) of `0.74`. * Switching the cutoff to `0.7`, increases the true positive rate to `0.95`, quite an impressive benchmark. * At the expense of increasing sensitivity, the true false rate decreases from `0.87` to `0.53`. --- # ROC Curves and Area Under the Curve * We want a cutoff that maximizes both the true positive rate and true false rate. * Try all possible combinations:
--- # ROC Curves and Area Under the Curve * This result contains the sensitivity and specificity for many different cutoff points. These results are most easy to understand by visualizing them. * Cutoffs that improve the specificity does so at the expense of sensitivity. <img src="03_loss_trees_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> --- # ROC Curves and Area Under the Curve * Instead of visualizing the specificity as the true negative rate, let's subtract 1 such that as the `X` axis increases, it means that the error is increasing: <img src="03_loss_trees_files/figure-html/unnamed-chunk-15-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> * Ideal result: most points cluster on the top left quadrant. * Sensitivity is high (the true positive rate) and the specificity is high (because `\(1 - specificity\)` will switch the direction of the accuracy to the lower values of the `X` axis). --- # ROC Curves and Area Under the Curve * There is one thing we're missing: the actual cutoff points! * Hover over the plot
--- # ROC Curves and Area Under the Curve * The last loss function we'll discuss is a very small extension of the ROC curve: the **A**rea **U**nder the **C**urve or `\(AUC\)`. * `\(AUC\)` is the percentage of the plot that is under the curve. For example: <img src="03_loss_trees_files/figure-html/unnamed-chunk-17-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> * The more points are located in the top left quadrant, the higher the overall accuracy of our model * 90% of the space of the plot is under the curve. --- # Decision trees * Decision trees are tree-like diagrams. * They work by defining `yes-or-no` rules based on the data and assign the most common value for each respondent within their final branch. <img src="03_loss_trees_files/figure-html/unnamed-chunk-18-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-19-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-20-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-21-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-22-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-23-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Decision trees * How do we fit in R? ```r # Define the decision tree and tell it the the dependent # variable is continuous ('mode' = 'regression') mod1 <- decision_tree(mode = "regression") %>% set_engine("rpart") tflow <- # Plug the data pisa %>% # Begin the tidyflow tidyflow(seed = 23151) %>% # Separate the data into training/testing plug_split(initial_split) %>% # Plug the formula plug_formula(math_score ~ FISCED + HISEI + REPEAT) %>% # Plug the model plug_model(mod1) vanilla_fit <- fit(tflow) ``` --- # Decision trees * All `plug_*` functions serve to build your machine learning workflow * `tidyflow`: receives the data and seed * `plug_split`: add the training/testing * `plug_formula`/`plug_recipe`: model definition * `plug_model`: the model used --- # Bad things about Decision trees * They overfit a lot ```r # We can recicle the entire `tflow` from above and just replace the formula: tflow <- tflow %>% replace_formula(ST102Q01TA ~ .) fit_complex <- fit(tflow) rpart.plot(pull_tflow_fit(fit_complex)$fit) ``` <img src="03_loss_trees_files/figure-html/unnamed-chunk-25-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Bad things about Decision trees How can you address this? * Not straight forward * `min_n` and `tree_depth` are sometimes useful ```r dectree <- update(mod1, min_n = 200, tree_depth = 3) tflow <- tflow %>% replace_model(dectree) fit_complex <- fit(tflow) rpart.plot(pull_tflow_fit(fit_complex)$fit) ``` <img src="03_loss_trees_files/figure-html/unnamed-chunk-26-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- # Tuning decision trees * Model tuning can help select the best `min_n` and `tree_depth` ```r tflow <- tflow %>% plug_resample(vfold_cv, v = 5) %>% plug_grid(expand.grid, tree_depth = c(1, 3, 9), min_n = c(50, 100)) %>% replace_model(update(dectree, min_n = tune(), tree_depth = tune())) fit_tuned <- fit(tflow) fit_tuned %>% pull_tflow_fit_tuning() %>% show_best(metric = "rmse") ``` ``` # A tibble: 5 x 7 tree_depth min_n .metric .estimator mean n std_err <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> 1 9 50 rmse standard 0.459 5 0.0126 2 9 100 rmse standard 0.459 5 0.0126 3 3 50 rmse standard 0.518 5 0.0116 4 3 100 rmse standard 0.518 5 0.0116 5 1 50 rmse standard 0.649 5 0.0102 ``` --- # Tuning decision trees <img src="03_loss_trees_files/figure-html/unnamed-chunk-28-1.png" width="95%" height="950%" style="display: block; margin: auto;" /> --- # Best tuned decision tree * Best model ```r final_model <- complete_tflow(fit_tuned, metric = "rmse", tree_depth, method = "select_by_one_std_err") train_err <- final_model %>% predict_training() %>% rmse(ST102Q01TA, .pred) test_err <- final_model %>% predict_testing() %>% rmse(ST102Q01TA, .pred) c("Testing error" = test_err$.estimate, "Training error" = train_err$.estimate) ``` ``` Testing error Training error 0.4644939 0.4512248 ``` --- # Exercises Exercises 1-4 .center[ https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#exercises-1 ]