Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Machine Learning for Social Scientists

Regularization

Jorge Cimentada

2022-02-18

1 / 33

What is regularization?

  • Machine Learning is almost always about prediction

  • It is important to make sure that out-of-sample accuracy is high

  • Overfitting is our weak spot by including redundant or unimportant variables

  • Correct theoretical model is not always the aim

2 / 33

What is regularization?

  • Machine Learning is almost always about prediction

  • It is important to make sure that out-of-sample accuracy is high

  • Overfitting is our weak spot by including redundant or unimportant variables

  • Correct theoretical model is not always the aim



How do we make sure our model does good predictions on unseen data? We regularize how much it overfits the data. How do we do that? Forcing unimportant coefficients towards zero.


  • ML parlance: reduce variance in favor of increasing bias
  • SocSci parlance: make sure your model fits an unseen data as fairly well as this data
2 / 33

What is regularization?

Regularization is when you force your estimates towards specific values:

  • Bayesian: restrict coefficients based on prior distributions

  • Machine Learning: restrict coefficents to zero


3 / 33

What is regularization?

Regularization is when you force your estimates towards specific values:

  • Bayesian: restrict coefficients based on prior distributions

  • Machine Learning: restrict coefficents to zero


What is this good for? It depends on your context

  • Increasing predictive power
  • Including important confounders in large models
  • Understanding the strength of variables
  • Testing the generalization of your model
3 / 33

Why regularization?

4 / 33

A first example: ridge regression

  • OLS minimizes the Residual Sum of Squares (RSS)
  • Fit N lines that minimize the RSS and keep the one with the best fit

RSS=nk=1(actualipredictedi)2

5 / 33

A first example: ridge regression

Ridge regression adds one term:

RSS+λnk=1β2j

The regularization term or penalty term

  • RSS estimates how the model fits the data
  • nk=1β2j limits how much you overfit the data.
  • λ is the weight given to the penalty term (called lambda): the higher the weight the bigger the shrinkage term of the equation.

In layman terms:

We want the highest coefficients that don’t affect the fit of the line (RSS).

6 / 33

Deep dive into lambda

  • Lambda is a tuning parameter: that means you try different values and grab the best one

  • Usually called a shrinkage penalty

    • When 0, lambda is just classical OLS
    • Selecting a good value of lambda is critical for it to be effective
    • As lambda goes to infinity, each coefficient get less weight
  • Never applied to the intercept, only to variable coefficients


  • The reason of being of ridge is the problem of N < P
    • In layman terms:

      When you have more predictors than observations, avoiding overfitting is crucial

7 / 33

A first example: ridge regression

Some caveats:

  • Since we're penalizing coefficients, their scale matter.

Suppose that you have the income of a particular person (measured in thousands per months) and time spent with their families (measured in seconds) and you're trying to predict happiness. A one unit increase in salary could be penalized much more than a one unit increase in time spent with their families just because a one unit increase in salary can be much bigger due to it's metric.



Always standardize coefficients before running a regularized regression

8 / 33

A first example: ridge regression

A look at the data:

math_score MISCED FISCED HISEI REPEAT IMMIG DURECEC BSMJ
1 512.7125 5 4 28.60 0 1 2 77.1000
2 427.3615 4 4 59.89 0 1 2 63.0300
3 449.9545 4 6 39.02 0 1 2 67.4242
4 474.5553 2 4 26.60 0 1 2 28.5200
5 469.1545 5 6 76.65 0 1 2 50.9000
6 442.6426 4 4 29.73 0 1 2 64.4400
7 426.4296 2 4 35.34 0 1 1 81.9200
8 449.8329 6 5 65.01 0 1 2 81.9200
9 493.6453 5 4 48.66 0 1 2 51.3500
10 341.7272 6 6 68.70 1 1 2 67.4242
9 / 33

A first example: ridge regression

Next we take the usual steps that we expect to have in the machine learning pipeline:

  • Split into training and testing. Perform all analysis on the training set.
  • Perform any variable recodification / scaling (important for regularization)
  • Split training into a K fold data set for tuning parameters:
# 10-fold cross-validation
# A tibble: 10 x 2
splits id
<list> <chr>
1 <split [3.3K/363]> Fold01
2 <split [3.3K/363]> Fold02
3 <split [3.3K/363]> Fold03
4 <split [3.3K/363]> Fold04
5 <split [3.3K/363]> Fold05
6 <split [3.3K/363]> Fold06
7 <split [3.3K/363]> Fold07
8 <split [3.3K/363]> Fold08
9 <split [3.3K/363]> Fold09
10 <split [3.3K/362]> Fold10
10 / 33

A first example: ridge regression

11 / 33

A first example: ridge regression

12 / 33

A first example: ridge regression

  • Take your previous model, refit it only with the testing dataset and compare:
.metric .estimator .estimate type model
1 rmse standard 76.87668 training ridge
2 rmse standard 77.88607 testing ridge
13 / 33

A first example: lasso regression

Lasso regression is very similar to ridge but the penalty term is different:

RSS+λnk=1|βj|

The same notes for ridge applies with one caveat:

  • The penalty term for lasso can completely shrink to 0 meaning that it excludes variables.

Lasso excludes variables which are not adding anything useful to the model whereas ridge keeps them close to 0.

14 / 33

A first example: lasso regression

  • The fact that lasso performs feature selection is a somewhat new concept to the SocSci world. Why is this important?

  • When having hundreds of variables, it allows for greater explainability.

  • When few observations, it allows for greater flexibility by having more degrees of freedom
  • It dramatically decreases the risk of overfitting by removing redundant variables

15 / 33

A first example: lasso regression




Always standardize coefficients before running a regularized regression

16 / 33

A first example: lasso regression

Next we take the usual steps that we expect to have in the machine learning pipeline:

  • Split into training and testing. Perform all analysis on the training set.
  • Perform any variable recodification / scaling (important for regularization)
  • Split training into a K fold data set for tuning parameters
  • Run N models with N tuning parameters
17 / 33

A first example: lasso regression

18 / 33

A first example: lasso regression

19 / 33

A first example: lasso regression

  • Take your previous model, refit it only with the testing dataset and compare:
.metric .estimator .estimate type model
1 rmse standard 76.87264 training lasso
2 rmse standard 77.86454 testing lasso
20 / 33

When to use ridge or lasso?

  • Both are very similar but perform differently

  • Lasso usually works well when we know there are a handful of strong coefficients and the remaining variables have very small effects

  • Ridge will usually be better when all predictors aren't weak

A priori we don't know, that's why we use cross-validation: to test which models with which penalty terms work better

  • Interpretability is important or not
21 / 33

Regularization and bias - variance trade off

  • MSE error (pink)
  • Bias (green): the more shrinkage, reduce bias (overfitting)
  • Variance (black): the more shrinkage, increases generalizability

22 / 33

A first example: elastic net regression

ridge=λnk=1β2j

lasso=λnk=1|βj|

Elastic net regularization is the addition of these two penalties in comparison to the RSS:

RSS+lasso+ridge

Explanation:

Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty.

Now you have two parameters to tune

23 / 33

A first example: elastic net regression




Always standardize coefficients before running a regularized regression

24 / 33

Usual workflow

Next we take the usual steps that we expect to have in the machine learning pipeline:

  • Split into training and testing. Perform all analysis on the training set.
  • Perform any variable recodification / scaling (important for regularization)
  • Split training into a K fold data set for tuning parameters:

    • Fit first model with first ridge parameter and first lasso ridge parameter


    • Fit first model with first ridge parameter and second lasso ridge parameter


    • Fit first model with first ridge parameter and third lasso ridge parameter


    • Fit first model with ... ridge parameter and ... lasso ridge parameter
25 / 33

A first example: elastic net regression

# A tibble: 9 x 4
penalty mixture rmse rsq
<dbl> <dbl> <dbl> <dbl>
1 0.0000000001 0.05 79.5 0.207
2 0.00001 0.05 79.5 0.207
3 1 0.05 79.5 0.207
4 0.0000000001 0.525 79.5 0.207
5 0.00001 0.525 79.5 0.207
6 1 0.525 79.6 0.206
7 0.0000000001 1 79.5 0.207
8 0.00001 1 79.5 0.207
9 1 1 79.7 0.206
26 / 33

A first example: elastic net regression

27 / 33

A first example: elastic net regression

  • Run our model on the testing dataset and compare with the training model:
.metric .estimator .estimate type model
1 rmse standard 76.87256 training elnet
2 rmse standard 77.87192 testing elnet
28 / 33

Alternative: forward-selection

If you have three variables and a dependent variable, begin with:

y ~ x1 # lowest RMSE
y ~ x2
y ~ x3

next step

y ~ x1 + x2
y ~ x1 + x3 # lowest RMSE

Then CV with the best two models

29 / 33

Alternative: backward-selection

If you have three variables and a dependent variable, begin with:

# In CV
y ~ x2 + x3 # lowest RMSE
y ~ x1 + x3
y ~ x1 + x2

next step

y ~ x3 # lowes RMSE
y ~ x2

Then CV with the best two models

30 / 33

Comparison [1/2]

  • Ridge

    • Keeps all variables
    • Might introduce overfitting by keeping all variables
    • Assumes linearity
  • Lasso

    • Variable selection
    • Inconsistency (two highly correlated variables, removes one)
    • Assumes linearity
  • Elastic Net (in reality, elastic net usually performs better)

    • Variable selection, depending on weights from both ridge and lasso (lambda)
    • Assumes linearity
31 / 33

Comparison [2/2]

  • Forward selection

    • Doesn't work well with n < p models
    • RSS is biased because models with high P will usually have higher RSS
    • Computationally intensive
  • Backward selection

    • Doesn't work well with n < p models
    • RSS is biased because models with high P will usually have higher RSS
    • Computationally intensive
32 / 33

Break

33 / 33

What is regularization?

  • Machine Learning is almost always about prediction

  • It is important to make sure that out-of-sample accuracy is high

  • Overfitting is our weak spot by including redundant or unimportant variables

  • Correct theoretical model is not always the aim

2 / 33
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow