class: center, middle, inverse, title-slide # Machine Learning for Social Scientists ## Regularization ### Jorge Cimentada ### 2020-07-08 --- layout: true <!-- background-image: url(./figs/upf.png) --> background-position: 100% 0%, 100% 0%, 50% 100% background-size: 10%, 10%, 10% --- # What is regularization? * Machine Learning is almost always about prediction * **It is important to make sure that out-of-sample accuracy is high** * Overfitting is our weak spot by including redundant or unimportant variables * Correct theoretical model is not always the aim -- <br> <br> > How do we make sure our model does good predictions on unseen data? We regularize how much it overfits the data. How do we do that? Forcing unimportant coefficients towards zero. <br> * ML parlance: reduce variance in favor of increasing bias * SocSci parlance: make sure your model fits an unseen data as fairly well as this data --- # What is regularization? Regularization is when you force your estimates towards specific values: * Bayesian: restrict coefficients based on prior distributions <br> * Machine Learning: restrict coefficents to zero <br> -- ### What is this good for? It depends on your context * Increasing predictive power * Including important confounders in large models * Understanding the strength of variables * Testing the generalization of your model --- # A first example: ridge regression * OLS minimizes the Residual Sum of Squares (RSS) * Fit N lines that minimize the RSS and keep the one with the best fit `\begin{equation} RSS = \sum_{k = 1}^n(actual_i - predicted_i)^2 \end{equation}` .center[ <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-2-1.png" width="80%" /> ] --- # A first example: ridge regression Ridge regression adds one term: `\begin{equation} RSS + \lambda \sum_{k = 1}^n \beta^2_j \end{equation}` **The regularization term** or **penalty term** * `\(RSS\)` estimates how the model fits the data * `\(\sum_{k = 1}^n \beta^2_j\)` limits how much you overfit the data. * `\(\lambda\)` is the weight given to the penalty term (called **lambda**): the higher the weight the bigger the shrinkage term of the equation. In layman words: > We want the smallest coefficients that don’t affect the fit of the line (RSS). --- # A first example: ridge regression Some caveats: * Since we're penalizing coefficients, their scale *matter*. > Suppose that you have the income of a particular person (measured in thousands per months) and time spent with their families (measured in seconds) and you're trying to predict happiness. A one unit increase in salary could be penalized much more than a one unit increase in time spent with their families **just** because a one unit increase in salary can be much bigger due to it's metric. <br> <br> .center[ ### **Always standardize coefficients before running a regularized regression** ] --- # A first example: ridge regression Download the data: ```r library(tidymodels) library(tidyflow) # Read the PISA data data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv" pisa <- read.csv(data_link) head(pisa, n = c(5, 8)) ``` ``` CNTSCHID CNTSTUID BOOKID ST001D01T ST003D02T ST003D03T ST004D01T ST005Q01TA 1 84000001 84000250 3 11 7 2002 2 1 2 84000001 84000304 14 11 9 2002 1 1 3 84000001 84000353 15 9 6 2003 2 1 4 84000001 84000536 24 10 5 2003 2 3 5 84000001 84001240 6 10 5 2003 2 1 ``` --- # A first example: ridge regression Remember, we do everything with the **training** dataset and ignore the **testing** dataset for our best model. Adding a split: ```r tflow <- pisa %>% tidyflow(seed = 23151) %>% plug_split(initial_split) tflow ``` ``` ══ Tidyflow ════════════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe/Formula: None Resample: None Grid: None Model: None ``` --- # A first example: ridge regression Adding the preprocessing: ```r rcp <- ~ recipe(math_score ~ MISCED + FISCED + HISEI + REPEAT + IMMIG + DURECEC + BSMJ, data = .x) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) tflow <- tflow %>% plug_recipe(rcp) tflow ``` ``` ══ Tidyflow ════════════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: None Grid: None Model: None ``` --- # A first example: ridge regression Adding the resample. `vfold_cv` is the function applying the cross-validated set: ```r tflow <- tflow %>% plug_resample(vfold_cv) tflow ``` ``` ══ Tidyflow ════════════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: None Model: None ``` --- # A first example: ridge regression Adding the model: ```r # mixture 0 is the same as ridge regression regularized_reg <- linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet") tflow <- tflow %>% plug_model(regularized_reg) tflow ``` ``` ══ Tidyflow ════════════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: None Model: Linear Regression Model Specification (regression) Main Arguments: penalty = tune() mixture = 0 Computational engine: glmnet ``` --- # A first example: ridge regression Adding the grid: ```r tflow <- tflow %>% plug_grid(expand.grid, penalty = seq(0, 3, length.out = 300)) tflow ``` ``` ══ Tidyflow ════════════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: expand.grid w/ penalty = ~seq(0, 3, length.out = 300) Model: Linear Regression Model Specification (regression) Main Arguments: penalty = tune() mixture = 0 Computational engine: glmnet ``` --- # A first example: ridge regression ```r (res <- fit(tflow)) ``` ``` ══ Tidyflow [tuned] ════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: expand.grid w/ penalty = ~seq(0, 3, length.out = 300) Model: Linear Regression Model Specification (regression) Main Arguments: penalty = tune() mixture = 0 Computational engine: glmnet ══ Results ═════════════════════════════════════════════════════════════════════ Tuning results: # A tibble: 5 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]> ... and 5 more lines. ``` --- # A first example: ridge regression ```r res %>% pull_tflow_fit_tuning() ``` ``` # 10-fold cross-validation # A tibble: 10 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]> 6 <split [3.3K/363]> Fold06 <tibble [600 × 4]> <tibble [0 × 1]> 7 <split [3.3K/363]> Fold07 <tibble [600 × 4]> <tibble [0 × 1]> 8 <split [3.3K/363]> Fold08 <tibble [600 × 4]> <tibble [0 × 1]> 9 <split [3.3K/363]> Fold09 <tibble [600 × 4]> <tibble [0 × 1]> 10 <split [3.3K/362]> Fold10 <tibble [600 × 4]> <tibble [0 × 1]> ``` --- # A first example: ridge regression ```r res %>% pull_tflow_fit_tuning() %>% autoplot() ``` <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-11-1.png" width="80%" /> --- # A first example: ridge regression ```r final_ridge <- complete_tflow(res, metric = "rmse") final_ridge %>% pull_tflow_fit() %>% .[['fit']] %>% plot(xvar = "lambda", label = TRUE) ``` <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> --- # A first example: ridge regression ```r final_ridge ``` ``` ══ Tidyflow [trained] ══════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: expand.grid w/ penalty = ~seq(0, 3, length.out = 300) Model: Linear Regression Model Specification (regression) Main Arguments: penalty = 0 mixture = 0 Computational engine: glmnet ══ Results ═════════════════════════════════════════════════════════════════════ Tuning results: # A tibble: 5 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]> ... and 5 more lines. Fitted model: Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~0) Df %Dev Lambda 1 7 0.000000 27820.0 ... and 99 more lines. ``` --- # A first example: ridge regression ```r train_rmse_ridge <- final_ridge %>% predict_training() %>% rmse(math_score, .pred) test_ridge <- final_ridge %>% predict_testing() %>% rmse(math_score, .pred) train_rmse_ridge$type <- "training" test_ridge$type <- "testing" ridge <- as.data.frame(rbind(train_rmse_ridge, test_ridge)) ridge$model <- "ridge" ridge ``` ``` .metric .estimator .estimate type model 1 rmse standard 76.87668 training ridge 2 rmse standard 77.88607 testing ridge ``` --- # A first example: lasso regression Lasso regression is very similar to ridge but the penalty term is different: `\begin{equation} RSS + \lambda \sum_{k = 1}^n |\beta_j| \end{equation}` The same notes for ridge applies with one caveat: - The penalty term for lasso can **completely shrink to 0** meaning that it excludes variables. > Lasso excludes variables which are not adding anything useful to the model whereas ridge keeps them close to 0. --- # A first example: lasso regression <br> <br> <br> .center[ ## **Always standardize coefficients before running a regularized regression** ] --- # A first example: lasso regression `tflow` has all our steps, just replace the model: ```r # mixture = 1 is lasso lasso_mod <- update(regularized_reg, mixture = 1) tflow <- tflow %>% replace_model(lasso_mod) res_lasso <- fit(tflow) ``` --- # A first example: lasso regression ```r res_lasso %>% pull_tflow_fit_tuning() ``` ``` # 10-fold cross-validation # A tibble: 10 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]> 6 <split [3.3K/363]> Fold06 <tibble [600 × 4]> <tibble [0 × 1]> 7 <split [3.3K/363]> Fold07 <tibble [600 × 4]> <tibble [0 × 1]> 8 <split [3.3K/363]> Fold08 <tibble [600 × 4]> <tibble [0 × 1]> 9 <split [3.3K/363]> Fold09 <tibble [600 × 4]> <tibble [0 × 1]> 10 <split [3.3K/362]> Fold10 <tibble [600 × 4]> <tibble [0 × 1]> ``` --- # A first example: lasso regression ```r res_lasso %>% pull_tflow_fit_tuning() %>% autoplot() ``` <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-17-1.png" width="80%" /> --- # A first example: lasso regression ```r final_lasso <- complete_tflow(res_lasso, metric = "rmse") final_lasso %>% pull_tflow_fit() %>% .[['fit']] %>% plot(xvar = "lambda", label = TRUE) ``` <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-18-1.png" width="90%" style="display: block; margin: auto;" /> --- # A first example: lasso regression ```r final_lasso ``` ``` ══ Tidyflow [trained] ══════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: expand.grid w/ penalty = ~seq(0, 3, length.out = 300) Model: Linear Regression Model Specification (regression) Main Arguments: penalty = 0.180602006688963 mixture = 1 Computational engine: glmnet ══ Results ═════════════════════════════════════════════════════════════════════ Tuning results: # A tibble: 5 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]> ... and 5 more lines. Fitted model: Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~1) Df %Dev Lambda 1 0 0.00000 27.8200 ... and 59 more lines. ``` --- # A first example: lasso regression ```r train_rmse_lasso <- final_lasso %>% predict_training() %>% rmse(math_score, .pred) holdout_lasso <- final_lasso %>% predict_testing() %>% rmse(math_score, .pred) train_rmse_lasso$type <- "training" holdout_lasso$type <- "testing" lasso <- as.data.frame(rbind(train_rmse_lasso, holdout_lasso)) lasso$model <- "lasso" lasso ``` ``` .metric .estimator .estimate type model 1 rmse standard 76.87264 training lasso 2 rmse standard 77.86454 testing lasso ``` --- # A first example: elastic net regression `\(ridge = \lambda \sum_{k = 1}^n \beta_j^2\)` `\(lasso = \lambda \sum_{k = 1}^n |\beta_j|\)` Elastic net regularization is the addition of these two penalties in comparison to the RSS: `$$RSS + lasso + ridge$$` Explanation: > Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty. --- # A first example: elastic net regression * `tidyflow` will slide through several values of `mixture` ranging from 0 to 1 * Instead of `mixture` of `0` (ridge) or `1` (lasso) <br> <br> <br> .center[ ## **Always standardize coefficients before running a regularized regression** ] --- # A first example: elastic net regression ```r elnet_mod <- update(lasso_mod, mixture = tune()) tflow <- tflow %>% replace_model(elnet_mod) %>% replace_grid(grid_regular) tflow ``` ``` ══ Tidyflow ════════════════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: grid_regular w/ default args Model: Linear Regression Model Specification (regression) Main Arguments: penalty = tune() mixture = tune() Computational engine: glmnet ``` --- # A first example: elastic net regression ```r res_elnet <- fit(tflow) res_elnet %>% pull_tflow_fit_tuning() ``` ``` # 10-fold cross-validation # A tibble: 10 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [18 × 5]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [18 × 5]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [18 × 5]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [18 × 5]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [18 × 5]> <tibble [0 × 1]> 6 <split [3.3K/363]> Fold06 <tibble [18 × 5]> <tibble [0 × 1]> 7 <split [3.3K/363]> Fold07 <tibble [18 × 5]> <tibble [0 × 1]> 8 <split [3.3K/363]> Fold08 <tibble [18 × 5]> <tibble [0 × 1]> 9 <split [3.3K/363]> Fold09 <tibble [18 × 5]> <tibble [0 × 1]> 10 <split [3.3K/362]> Fold10 <tibble [18 × 5]> <tibble [0 × 1]> ``` --- # A first example: elastic net regression ```r res_elnet %>% pull_tflow_fit_tuning() %>% autoplot() ``` <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-23-1.png" width="80%" /> --- # A first example: elastic net regression ```r final_elnet <- complete_tflow(res_elnet, metric = "rmse") final_elnet %>% pull_tflow_fit() %>% .[['fit']] %>% plot(xvar = "lambda", label = TRUE) ``` <img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-24-1.png" width="90%" style="display: block; margin: auto;" /> --- # A first example: elastic net regression ```r final_elnet ``` ``` ══ Tidyflow [trained] ══════════════════════════════════════════════════════════ Data: 4.84K rows x 501 columns Split: initial_split w/ default args Recipe: available Resample: vfold_cv w/ default args Grid: grid_regular w/ default args Model: Linear Regression Model Specification (regression) Main Arguments: penalty = 1 mixture = 0.05 Computational engine: glmnet ══ Results ═════════════════════════════════════════════════════════════════════ Tuning results: # A tibble: 5 x 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [3.3K/363]> Fold01 <tibble [18 × 5]> <tibble [0 × 1]> 2 <split [3.3K/363]> Fold02 <tibble [18 × 5]> <tibble [0 × 1]> 3 <split [3.3K/363]> Fold03 <tibble [18 × 5]> <tibble [0 × 1]> 4 <split [3.3K/363]> Fold04 <tibble [18 × 5]> <tibble [0 × 1]> 5 <split [3.3K/363]> Fold05 <tibble [18 × 5]> <tibble [0 × 1]> ... and 5 more lines. Fitted model: Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~0.05) Df %Dev Lambda 1 0 0.000000 556.40 ... and 72 more lines. ``` --- # A first example: elastic net regression ```r train_rmse_elnet <- final_elnet %>% predict_training() %>% rmse(math_score, .pred) holdout_elnet <- final_elnet %>% predict_testing() %>% rmse(math_score, .pred) train_rmse_elnet$type <- "training" holdout_elnet$type <- "testing" elnet <- as.data.frame(rbind(train_rmse_elnet, holdout_elnet)) elnet$model <- "elnet" elnet ``` ``` .metric .estimator .estimate type model 1 rmse standard 76.87256 training elnet 2 rmse standard 77.87192 testing elnet ``` --- # Exercise .center[ https://cimentadaj.github.io/ml_socsci/regularization.html#exercises ]