Machine Learning for Social Scientists

class: center, middle, inverse, title-slide

# Machine Learning for Social Scientists
## Regularization
### Jorge Cimentada
### 2020-07-08

---

layout: true

background-position: 100% 0%, 100% 0%, 50% 100%
background-size: 10%, 10%, 10%

---

# What is regularization?

* Machine Learning is almost always about prediction

* **It is important to make sure that out-of-sample accuracy is high**

* Overfitting is our weak spot by including redundant or unimportant variables

* Correct theoretical model is not always the aim

> How do we make sure our model does good predictions on unseen data? We regularize how much it overfits the data. How do we do that? Forcing unimportant coefficients towards zero.

* ML parlance: reduce variance in favor of increasing bias
* SocSci parlance: make sure your model fits an unseen data as fairly well as this data

---

# What is regularization?

Regularization is when you force your estimates towards specific values:

* Bayesian: restrict coefficients based on prior distributions

* Machine Learning: restrict coefficents to zero

### What is this good for? It depends on your context

* Increasing predictive power
* Including important confounders in large models
* Understanding the strength of variables
* Testing the generalization of your model

---

# A first example: ridge regression

* OLS minimizes the Residual Sum of Squares (RSS)
* Fit N lines that minimize the RSS and keep the one with the best fit

`\begin{equation}
RSS = \sum_{k = 1}^n(actual_i - predicted_i)^2
\end{equation}`

.center[
<img src="02_regularization_afternoon_files/figure-html/unnamed-chunk-2-1.png" width="80%" />
]

---

# A first example: ridge regression

Ridge regression adds one term:

`\begin{equation}
RSS + \lambda \sum_{k = 1}^n \beta^2_j
\end{equation}`

**The regularization term** or **penalty term**

* `$RSS$` estimates how the model fits the data
* `$\sum_{k = 1}^n \beta^2_j$` limits how much you overfit the data. 
* `$\lambda$` is the weight given to the penalty term (called **lambda**): the higher the weight the bigger the shrinkage term of the equation.

In layman words:

> We want the smallest coefficients that don’t affect the fit of the line (RSS).

---

# A first example: ridge regression

Some caveats:

* Since we're penalizing coefficients, their scale *matter*.

> Suppose that you have the income of a particular person (measured in thousands per months) and time spent with their families (measured in seconds) and you're trying to predict happiness. A one unit increase in salary could be penalized much more than a one unit increase in time spent with their families **just** because a one unit increase in salary can be much bigger due to it's metric.

.center[
### **Always standardize coefficients before running a regularized regression**
]
---

# A first example: ridge regression

Download the data:

```r
library(tidymodels)
library(tidyflow)

# Read the PISA data
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
head(pisa, n = c(5, 8))
```

```
  CNTSCHID CNTSTUID BOOKID ST001D01T ST003D02T ST003D03T ST004D01T ST005Q01TA
1 84000001 84000250      3        11         7      2002         2          1
2 84000001 84000304     14        11         9      2002         1          1
3 84000001 84000353     15         9         6      2003         2          1
4 84000001 84000536     24        10         5      2003         2          3
5 84000001 84001240      6        10         5      2003         2          1
```

---

# A first example: ridge regression

Remember, we do everything with the **training** dataset and ignore the **testing** dataset for our best model. Adding a split:

```r
tflow <-
 pisa %>%
 tidyflow(seed = 23151) %>%
 plug_split(initial_split)

tflow
```

```
══ Tidyflow ════════════════════════════════════════════════════════════════════
Data: 4.84K rows x 501 columns
Split: initial_split w/ default args
Recipe/Formula: None
Resample: None
Grid: None
Model: None
```

---

# A first example: ridge regression

Adding the preprocessing:

```r
rcp <-
 ~ recipe(math_score ~ MISCED + FISCED + HISEI + REPEAT + IMMIG + DURECEC + BSMJ, data = .x) %>%
 step_center(all_predictors()) %>%
 step_scale(all_predictors())

tflow <-
 tflow %>%
 plug_recipe(rcp)

tflow
```

```
══ Tidyflow ════════════════════════════════════════════════════════════════════
Data: 4.84K rows x 501 columns
Split: initial_split w/ default args
Recipe: available
Resample: None
Grid: None
Model: None
```

---

# A first example: ridge regression

Adding the resample. `vfold_cv` is the function applying the cross-validated set:

```r
tflow <-
 tflow %>%
 plug_resample(vfold_cv)

tflow
```

---

# A first example: ridge regression

Adding the model:

```r
# mixture 0 is the same as ridge regression
regularized_reg <- linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
tflow <- tflow %>% plug_model(regularized_reg)
tflow
```

Main Arguments:
  penalty = tune()
  mixture = 0

Computational engine: glmnet 
```

---

# A first example: ridge regression

Adding the grid:

```r
tflow <-
 tflow %>%
 plug_grid(expand.grid, penalty = seq(0, 3, length.out = 300))

tflow
```

Main Arguments:
  penalty = tune()
  mixture = 0

Computational engine: glmnet 
```

---

# A first example: ridge regression

```r
(res <- fit(tflow))
```

```
══ Tidyflow [tuned] ════════════════════════════════════════════════════════════
Data: 4.84K rows x 501 columns
Split: initial_split w/ default args
Recipe: available
Resample: vfold_cv w/ default args
Grid: expand.grid w/ penalty = ~seq(0, 3, length.out = 300)
Model:
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = 0

Computational engine: glmnet

══ Results ═════════════════════════════════════════════════════════════════════

Tuning results:

# A tibble: 5 x 4
 splits id .metrics .notes 
 <list> <chr> <list> <list> 
1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]>
2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]>
3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]>
4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]>
5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]>

... and 5 more lines.
```

---

# A first example: ridge regression

```r
res %>%
  pull_tflow_fit_tuning()
```

```
# 10-fold cross-validation 
# A tibble: 10 x 4
 splits id .metrics .notes 
 <list> <chr> <list> <list> 
 1 <split [3.3K/363]> Fold01 <tibble [600 × 4]> <tibble [0 × 1]>
 2 <split [3.3K/363]> Fold02 <tibble [600 × 4]> <tibble [0 × 1]>
 3 <split [3.3K/363]> Fold03 <tibble [600 × 4]> <tibble [0 × 1]>
 4 <split [3.3K/363]> Fold04 <tibble [600 × 4]> <tibble [0 × 1]>
 5 <split [3.3K/363]> Fold05 <tibble [600 × 4]> <tibble [0 × 1]>
 6 <split [3.3K/363]> Fold06 <tibble [600 × 4]> <tibble [0 × 1]>
 7 <split [3.3K/363]> Fold07 <tibble [600 × 4]> <tibble [0 × 1]>
 8 <split [3.3K/363]> Fold08 <tibble [600 × 4]> <tibble [0 × 1]>
 9 <split [3.3K/363]> Fold09 <tibble [600 × 4]> <tibble [0 × 1]>
10 <split [3.3K/362]> Fold10 <tibble [600 × 4]> <tibble [0 × 1]>
```

---

# A first example: ridge regression

```r
res %>%
  pull_tflow_fit_tuning() %>%
  autoplot()
```

---

# A first example: ridge regression

```r
final_ridge <- complete_tflow(res, metric = "rmse")

final_ridge %>% 
  pull_tflow_fit() %>%
  .[['fit']] %>%
  plot(xvar = "lambda", label = TRUE)
```

---

# A first example: ridge regression

```r
final_ridge
```

```
══ Tidyflow [trained] ══════════════════════════════════════════════════════════
Data: 4.84K rows x 501 columns
Split: initial_split w/ default args
Recipe: available
Resample: vfold_cv w/ default args
Grid: expand.grid w/ penalty = ~seq(0, 3, length.out = 300)
Model:
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0
  mixture = 0

Computational engine: glmnet

Tuning results:

... and 5 more lines.

Fitted model:

Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~0)

Df     %Dev  Lambda
1    7 0.000000 27820.0

...
and 99 more lines.
```

---

# A first example: ridge regression

```r
train_rmse_ridge <-
 final_ridge %>%
 predict_training() %>%
 rmse(math_score, .pred)

test_ridge <-
 final_ridge %>%
 predict_testing() %>%
 rmse(math_score, .pred)

train_rmse_ridge$type <- "training"
test_ridge$type <- "testing"

ridge <- as.data.frame(rbind(train_rmse_ridge, test_ridge))
ridge$model <- "ridge"
ridge
```

```
  .metric .estimator .estimate     type model
1    rmse   standard  76.87668 training ridge
2    rmse   standard  77.88607  testing ridge
```

---

# A first example: lasso regression

Lasso regression is very similar to ridge but the penalty term is different:

`\begin{equation}
RSS + \lambda \sum_{k = 1}^n |\beta_j|
\end{equation}`

The same notes for ridge applies with one caveat:

- The penalty term for lasso can **completely shrink to 0** meaning that it excludes variables.

> Lasso excludes variables which are not adding anything useful to the model whereas ridge keeps them close to 0.

---

# A first example: lasso regression

.center[
## **Always standardize coefficients before running a regularized regression**
]

---

# A first example: lasso regression

`tflow` has all our steps, just replace the model:

```r
# mixture = 1 is lasso
lasso_mod <- update(regularized_reg, mixture = 1)

tflow <-
 tflow %>%
 replace_model(lasso_mod)

res_lasso <- fit(tflow)
```

---

# A first example: lasso regression

```r
res_lasso %>% pull_tflow_fit_tuning()
```

---

# A first example: lasso regression

```r
res_lasso %>%
  pull_tflow_fit_tuning() %>%
  autoplot()
```

---

# A first example: lasso regression

```r
final_lasso <- complete_tflow(res_lasso, metric = "rmse")

final_lasso %>%
  pull_tflow_fit() %>%
  .[['fit']] %>%
  plot(xvar = "lambda", label = TRUE)
```

---
# A first example: lasso regression

```r
final_lasso
```

Main Arguments:
  penalty = 0.180602006688963
  mixture = 1

Computational engine: glmnet

Tuning results:

... and 5 more lines.

Fitted model:

Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~1)

Df    %Dev  Lambda
1   0 0.00000 27.8200

...
and 59 more lines.
```

---

# A first example: lasso regression

```r
train_rmse_lasso <-
 final_lasso %>%
 predict_training() %>%
 rmse(math_score, .pred)

holdout_lasso <-
 final_lasso %>%
 predict_testing() %>%
 rmse(math_score, .pred)

train_rmse_lasso$type <- "training"
holdout_lasso$type <- "testing"

lasso <- as.data.frame(rbind(train_rmse_lasso, holdout_lasso))
lasso$model <- "lasso"
lasso
```

```
  .metric .estimator .estimate     type model
1    rmse   standard  76.87264 training lasso
2    rmse   standard  77.86454  testing lasso
```

---

# A first example: elastic net regression

`$ridge = \lambda \sum_{k = 1}^n \beta_j^2$`

`$lasso = \lambda \sum_{k = 1}^n |\beta_j|$`

Elastic net regularization is the addition of these two penalties in comparison to the RSS:

`$$RSS + lasso + ridge$$`

Explanation:

> Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty.

---

# A first example: elastic net regression

* `tidyflow` will slide through several values of `mixture` ranging from 0 to 1
* Instead of `mixture` of `0` (ridge) or `1` (lasso)

.center[
## **Always standardize coefficients before running a regularized regression**
]

---

# A first example: elastic net regression

```r
elnet_mod <- update(lasso_mod, mixture = tune())

tflow <-
 tflow %>%
 replace_model(elnet_mod) %>%
 replace_grid(grid_regular)

tflow
```

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet 
```

---

# A first example: elastic net regression

```r
res_elnet <- fit(tflow)
res_elnet %>% pull_tflow_fit_tuning()
```

```
# 10-fold cross-validation 
# A tibble: 10 x 4
 splits id .metrics .notes 
 <list> <chr> <list> <list> 
 1 <split [3.3K/363]> Fold01 <tibble [18 × 5]> <tibble [0 × 1]>
 2 <split [3.3K/363]> Fold02 <tibble [18 × 5]> <tibble [0 × 1]>
 3 <split [3.3K/363]> Fold03 <tibble [18 × 5]> <tibble [0 × 1]>
 4 <split [3.3K/363]> Fold04 <tibble [18 × 5]> <tibble [0 × 1]>
 5 <split [3.3K/363]> Fold05 <tibble [18 × 5]> <tibble [0 × 1]>
 6 <split [3.3K/363]> Fold06 <tibble [18 × 5]> <tibble [0 × 1]>
 7 <split [3.3K/363]> Fold07 <tibble [18 × 5]> <tibble [0 × 1]>
 8 <split [3.3K/363]> Fold08 <tibble [18 × 5]> <tibble [0 × 1]>
 9 <split [3.3K/363]> Fold09 <tibble [18 × 5]> <tibble [0 × 1]>
10 <split [3.3K/362]> Fold10 <tibble [18 × 5]> <tibble [0 × 1]>
```

---

# A first example: elastic net regression

```r
res_elnet %>%
  pull_tflow_fit_tuning() %>%
  autoplot()
```

---

# A first example: elastic net regression

```r
final_elnet <- complete_tflow(res_elnet, metric = "rmse")

final_elnet %>%
  pull_tflow_fit() %>%
  .[['fit']] %>%
  plot(xvar = "lambda", label = TRUE)
```

---

# A first example: elastic net regression

```r
final_elnet
```

```
══ Tidyflow [trained] ══════════════════════════════════════════════════════════
Data: 4.84K rows x 501 columns
Split: initial_split w/ default args
Recipe: available
Resample: vfold_cv w/ default args
Grid: grid_regular w/ default args
Model:
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1
  mixture = 0.05

Computational engine: glmnet

Tuning results:

# A tibble: 5 x 4
 splits id .metrics .notes 
 <list> <chr> <list> <list> 
1 <split [3.3K/363]> Fold01 <tibble [18 × 5]> <tibble [0 × 1]>
2 <split [3.3K/363]> Fold02 <tibble [18 × 5]> <tibble [0 × 1]>
3 <split [3.3K/363]> Fold03 <tibble [18 × 5]> <tibble [0 × 1]>
4 <split [3.3K/363]> Fold04 <tibble [18 × 5]> <tibble [0 × 1]>
5 <split [3.3K/363]> Fold05 <tibble [18 × 5]> <tibble [0 × 1]>

... and 5 more lines.

Fitted model:

Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~0.05)

Df     %Dev Lambda
1   0 0.000000 556.40

...
and 72 more lines.
```

---

# A first example: elastic net regression

```r
train_rmse_elnet <-
 final_elnet %>%
 predict_training() %>%
 rmse(math_score, .pred)

holdout_elnet <-
 final_elnet %>%
 predict_testing() %>%
 rmse(math_score, .pred)

train_rmse_elnet$type <- "training"
holdout_elnet$type <- "testing"

elnet <- as.data.frame(rbind(train_rmse_elnet, holdout_elnet))
elnet$model <- "elnet"
elnet
```

```
  .metric .estimator .estimate     type model
1    rmse   standard  76.87256 training elnet
2    rmse   standard  77.87192  testing elnet
```

---

# Exercise
.center[
https://cimentadaj.github.io/ml_socsci/regularization.html#exercises
]