In a recent attempt to bring a bit of discipline into my life, I’ve been forcing myself to read papers after lunch, specifically concentrated on data science topics. The whole idea is to educated myself every day, but if I find something cool that I can implement in R, I’ll do it right away.
This blogpost is the first of a series of entries I plan to post explaining the main concepts of Kohavi (1995), which compares cross-validation methods and bootstrap methods for model selection. This first post will implement a K-Fold cross validation from scratch in order to understand more deeply what’s going on behind the scenes.
Before we explain the concept of K-Fold cross validation, we need to define what the ‘Holdout’ method is.
30
rows. The whole idea is that you build a model that can predict house prices accurately. To ‘train’ your model, or see how well it performs, we randomly subset 20 of those rows and fit the model. The second step is to predict the values of those 10 rows that we excluded and measure how well our predictions were. As a rule of thumb, experts suggest to randomly sample 80% of the data into the training set and 20% into the test set.A very quick example:
library(tidyverse)
library(modelr)
holdout <- function(repeat_times) { # empty argument for later
n <- nrow(mtcars)
eighty_percent <- (n * 0.8) %>% floor
train_sample <- sample(1:n, eighty_percent) # randomly pick 80% of the rows
test_sample <- setdiff(1:n, train_sample) # get the remaining 20% of the rows
train <- mtcars[train_sample, ] # subset the 80% of the rows
test <- mtcars[test_sample, ] # subset 20% of the rows
train_model <- lm(mpg ~ ., data = train)
test %>%
add_predictions(train_model) %>% # add the predicted mpg values to the test data
summarize(average_error = (pred - mpg) %>% mean %>% round(2)) %>%
pull(average_error)
# calculate the average difference of the predicition from the actual value
}
set.seed(2131)
holdout()
# [1] 3.59
We can see that on average the training set over predicts the actual values by about 3.6 points. An even more complex approach is what Kohavi (1995) calls “random subsampling”.
In a nutshell, repeat the previous N
times and calculate the average and standard deviation of your metric of interest.
set.seed(2134)
random_subsampling <- map_dbl(1:500, holdout)
summary(random_subsampling)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -7.3100 -0.7525 0.4000 0.4255 1.5550 9.1500
We get a mean error of 0.42, a maximum of 9.15 and a minimum of -7.31. Quite some variation, eh? It is precisely for this reason that Kohavi (1995) highlights that random subsampling has an important problem.
Each time we resample, some observations might’ve been in the previous resample, leading to non-independence and making the training dataset unrepresentative of the original dataset.
What happens when you try to predict Y from an unrepresented X, 500 times? What we just saw before.
Let’s move on to cross validation. K-Fold cross validation is a bit trickier, but here is a simple explanation.
We can implement this in R.
k_slicer <- function(data, slices) {
stopifnot(nrow(data) > slices) # the number of rows must be bigger than the K slices
slice_size <- (nrow(data) / slices) %>% floor
rows <- 1:nrow(data)
data_list <- rep(list(list()), slices) # create empty list of N slices
# Randomly sample slice_size from the rows available, but exclude these rows
# from the next sample of rows. This makes sure each slice has unique rows.
for (k in 1:slices) {
specific_rows <- sample(rows, slice_size) # sample unique rows for K slice
rows <- setdiff(rows, specific_rows) # exclue those rows
data_list[[k]] <- data[specific_rows, ] # sample the K slice and save in empty list
}
data_list
}
mtcars_sliced <- k_slicer(mtcars, slices = 5) # data sliced in K slices
All good so far? We took a dataset and split it into K mutually exclusive datasets. The next step is to run the modeling on K = 2:10
and test on K = 1
, and then repeat on K = c(1, 3:10)
as training and test on K = 2
, and repeat for al K’s
. Below we implement it in R.
k_fold_cv <-
map_dbl(seq_along(mtcars_sliced), ~ {
test <- mtcars_sliced[[.x]] # Take the K fold
# Note the -.x, for excluding that K
train <- mtcars_sliced[-.x] %>% reduce(bind_rows) # bind row all remaining K's
lm(mpg ~ ., data = train) %>%
rmse(test) # calculate the root mean square error of predicting the test set
})
k_fold_cv %>%
summary
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 3.192 3.746 3.993 3.957 4.279 4.574
And we get a summary of the root mean square error, a metric we decided to use now, instead of predictions. We can asses how accurate our model is this way and compare several specification of models and choose the one which better fits the data.
The main advantage of this approach:
This is very interesting in contrast to the holdout method in which we can’t maximize our data! Take data out of the test set and the predictions will have wider uncertainty intervals, take data out of the train set and get biased predictions.
This approach, as any other, has disadvantages.
It is computationally intensive, given that we have to run the model K-1 times. In this setting it’s trivial, but in more complex modeling this can be quite costly.
If in any of the K iterations the predictions are bad, the overall accuracy will be bad, considering that other K iterations will also likely be bad. In other words, predictions need to be stable across all K iterations.
Building on the previous point, once the model is stable, increasing the number of folds (5, 10, 20, 25…) generates little change considering that the accuracy will be similar (and the variance of different K-folds will be similar as well).
Finally, if Y consists of categories, and one of these categories is very minimal, the best K-Fold CV can do is predict the class with more observations. If an observation of this minimal class gets to be in the test set in one of the iterations, then the training model will have very little accuracy for that category. See Kohavi (1995) page 3, example 1 for a detailed example.
This was my first attempt at manually implementing the Holdout method and the K-Fold CV. These examples are certainly flawed, like rounding the decimal number of rows correct for the unique number of rows in each K-Fold slice. If anyone is interested in correcting thes, please do send a pull request. For those interested in using more reliable approaches, take a look at the caret
and the modelr
package. In the next entry I will implement the LOO method and the bootstrap (and maybe the stratified K-Fold CV)