This is the second entry, and probably the last, on model validation methods. These posts are inspired by the work of Kohavi (1995), which I totally recommend reading. This post will talk talk about the Leave-One-Out Cross Validation (LOOCV), which is the extreme version of the K-Fold Cross Validation and the Bootstrap for model assessment.
Let’s dive in!
The LOOCV is actually a very intuitive idea if you know how the K-Fold CV works.
This is surprisingly easy to implement in R.
library(tidyverse)
set.seed(21341)
loo_result <-
map_lgl(1:nrow(mtcars), ~ {
test <- mtcars[.x, ] # Pick the .x row of the iteration to be the test
train <- mtcars[-.x, ] # Let the training be all the data EXCEPT that row
train_model <- glm(am ~ mpg + cyl + disp, family = binomial(), data = train) # Fit any model
# Since the prediction is in probabilities, pass the probability
# to generate either a 1 or 0 based on the probability
prediction <- predict(train_model, newdata = test, type = "response") %>% rbinom(1, 1, .)
test$am == prediction # compare whether the prediction matches the actual value
})
summary(loo_result %>% as.numeric) # percentage of accurate results
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0000 0.0000 1.0000 0.5938 1.0000 1.0000
It looks like our model had nearly 60% accuracy, not very good. But not entirely bad given our very low sample size.
Advantages:
Just as with the K-Fold CV, this approach is useful because it uses all the data. At some point, every rows gets to be the test set and training set, maximizing information.
In fact, it uses almost ALL the data as the original data set as the training set is just N - 1 (this method uses even more than the K-Fold CV).
Disadvantage:
This approach is very heavy on your computer. We need to refit de model N times (although there is a shortcut for linear regreesion, see here).
Given that the test set is of only 1 observation, there might be a lot of variance in the prediction, making the accuracy test more unreliable (that is, relative to K-Fold CV)
The bootstrap method is a bit different. Maybe you’ve heard about the bootstrap for estimating standard errors, and in fact for model assessment it’s very similar.
Again, the R implementation is very straightforward.
set.seed(21314)
bootstrap <-
map_dbl(1:500, ~ {
train <- mtcars[sample(nrow(mtcars), replace = T), ] # randomly sample rows with replacement
test <- mtcars
train_model <- glm(am ~ mpg + cyl + disp, family = binomial(), data = train) # fit any model
# Get predicted probabilities and assign a 1 or 0 based on the probability
prediction <- predict(train_model, newdata = test, type = "response") %>% rbinom(nrow(mtcars), 1, .)
accuracy <- test$am == prediction # compare whether the prediction matches the actual value
mean(accuracy) # get the proportion of correct predictions
})
summary(bootstrap)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.4375 0.6875 0.7500 0.7468 0.8125 0.9375
We got a better accuracy with the bootstrap (probably biased, see below) and a range of possible values going from 0.43 to 0.93. Note that if you run these models you’ll get a bunch of warnings like glm.fit: fitted probabilities numerically 0 or 1 occurred
because we just have too few observations to be including covariates, resulting in a lot of overfitting.
Advantages:
Disadvantages
In the end, it’s a trade-off against what you’re looking for. In some instances, it’s alright to have a slightly biased estimate (either pessimistic or optimistic) as long as its reliable (bootstrap). On other instances, it’s better to have a very exact prediction but that is less unreliable (CV methods).
Some rule of thumbs:
For large sample sizes, the variance issues become less important and the computational part is more of an issues. I still would stick by repeated CV for small and large sample sizes. See here
Cross validation is a good tool when deciding on the model – it helps you avoid fooling yourself into thinking that you have a good model when in fact you are overfitting. When your model is fixed, then using the bootstrap makes more sense to assess accuracy (to me at least). See again here
Again, this is a very crude approach, and the whole idea is to understand the inner workings of these algorithms in practice. For more thorough approaches I suggest using the cv
functions from the boot
package or caret
or modelr
. I hope this was useful. I will try to keep doing these things as they help me understand these techniques better.