Machine Learning is almost always about prediction
It is important to make sure that out-of-sample accuracy is high
Overfitting is our weak spot by including redundant or unimportant variables
Correct theoretical model is not always the aim
Machine Learning is almost always about prediction
It is important to make sure that out-of-sample accuracy is high
Overfitting is our weak spot by including redundant or unimportant variables
Correct theoretical model is not always the aim
How do we make sure our model does good predictions on unseen data? We regularize how much it overfits the data. How do we do that? Forcing unimportant coefficients towards zero.
Regularization is when you force your estimates towards specific values:
Bayesian: restrict coefficients based on prior distributions
Machine Learning: restrict coefficents to zero
Regularization is when you force your estimates towards specific values:
Bayesian: restrict coefficients based on prior distributions
Machine Learning: restrict coefficents to zero
RSS=n∑k=1(actuali−predictedi)2
Ridge regression adds one term:
RSS+λn∑k=1β2j
The regularization term or penalty term
In layman terms:
We want the highest coefficients that don’t affect the fit of the line (RSS).
Lambda is a tuning parameter: that means you try different values and grab the best one
Usually called a shrinkage penalty
Never applied to the intercept, only to variable coefficients
When you have more predictors than observations, avoiding overfitting is crucial
Some caveats:
Suppose that you have the income of a particular person (measured in thousands per months) and time spent with their families (measured in seconds) and you're trying to predict happiness. A one unit increase in salary could be penalized much more than a one unit increase in time spent with their families just because a one unit increase in salary can be much bigger due to it's metric.
A look at the data:
math_score MISCED FISCED HISEI REPEAT IMMIG DURECEC BSMJ1 512.7125 5 4 28.60 0 1 2 77.10002 427.3615 4 4 59.89 0 1 2 63.03003 449.9545 4 6 39.02 0 1 2 67.42424 474.5553 2 4 26.60 0 1 2 28.52005 469.1545 5 6 76.65 0 1 2 50.90006 442.6426 4 4 29.73 0 1 2 64.44007 426.4296 2 4 35.34 0 1 1 81.92008 449.8329 6 5 65.01 0 1 2 81.92009 493.6453 5 4 48.66 0 1 2 51.350010 341.7272 6 6 68.70 1 1 2 67.4242
Next we take the usual steps that we expect to have in the machine learning pipeline:
# 10-fold cross-validation # A tibble: 10 x 2 splits id <list> <chr> 1 <split [3.3K/363]> Fold01 2 <split [3.3K/363]> Fold02 3 <split [3.3K/363]> Fold03 4 <split [3.3K/363]> Fold04 5 <split [3.3K/363]> Fold05 6 <split [3.3K/363]> Fold06 7 <split [3.3K/363]> Fold07 8 <split [3.3K/363]> Fold08 9 <split [3.3K/363]> Fold0910 <split [3.3K/362]> Fold10
.metric .estimator .estimate type model1 rmse standard 76.87668 training ridge2 rmse standard 77.88607 testing ridge
Lasso regression is very similar to ridge but the penalty term is different:
RSS+λn∑k=1|βj|
The same notes for ridge applies with one caveat:
Lasso excludes variables which are not adding anything useful to the model whereas ridge keeps them close to 0.
The fact that lasso performs feature selection is a somewhat new concept to the SocSci world. Why is this important?
When having hundreds of variables, it allows for greater explainability.
Next we take the usual steps that we expect to have in the machine learning pipeline:
.metric .estimator .estimate type model1 rmse standard 76.87264 training lasso2 rmse standard 77.86454 testing lasso
Both are very similar but perform differently
Lasso usually works well when we know there are a handful of strong coefficients and the remaining variables have very small effects
Ridge will usually be better when all predictors aren't weak
A priori we don't know, that's why we use cross-validation: to test which models with which penalty terms work better
ridge=λ∑nk=1β2j
lasso=λ∑nk=1|βj|
Elastic net regularization is the addition of these two penalties in comparison to the RSS:
RSS+lasso+ridge
Explanation:
Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty.
Now you have two parameters to tune
Next we take the usual steps that we expect to have in the machine learning pipeline:
Split training into a K fold data set for tuning parameters:
# A tibble: 9 x 4 penalty mixture rmse rsq <dbl> <dbl> <dbl> <dbl>1 0.0000000001 0.05 79.5 0.2072 0.00001 0.05 79.5 0.2073 1 0.05 79.5 0.2074 0.0000000001 0.525 79.5 0.2075 0.00001 0.525 79.5 0.2076 1 0.525 79.6 0.2067 0.0000000001 1 79.5 0.2078 0.00001 1 79.5 0.2079 1 1 79.7 0.206
.metric .estimator .estimate type model1 rmse standard 76.87256 training elnet2 rmse standard 77.87192 testing elnet
If you have three variables and a dependent variable, begin with:
y ~ x1 # lowest RMSEy ~ x2y ~ x3
next step
y ~ x1 + x2y ~ x1 + x3 # lowest RMSE
Then CV with the best two models
If you have three variables and a dependent variable, begin with:
# In CVy ~ x2 + x3 # lowest RMSEy ~ x1 + x3y ~ x1 + x2
next step
y ~ x3 # lowes RMSEy ~ x2
Then CV with the best two models
Ridge
Lasso
Elastic Net (in reality, elastic net usually performs better)
Forward selection
Backward selection
Machine Learning is almost always about prediction
It is important to make sure that out-of-sample accuracy is high
Overfitting is our weak spot by including redundant or unimportant variables
Correct theoretical model is not always the aim
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |