Holdout and cross-validation

PUBLISHED ON SEP 6, 2017

In a recent attempt to bring a bit of discipline into my life, I’ve been forcing myself to read papers after lunch, specifically concentrated on data science topics. The whole idea is to educated myself every day, but if I find something cool that I can implement in R, I’ll do it right away.

This blogpost is the first of a series of entries I plan to post explaining the main concepts of Kohavi (1995), which compares cross-validation methods and bootstrap methods for model selection. This first post will implement a K-Fold cross validation from scratch in order to understand more deeply what’s going on behind the scenes.

Before we explain the concept of K-Fold cross validation, we need to define what the ‘Holdout’ method is.

Holdout method

  • Holdout method: Imagine we have a dataset with house prices as the dependent variable and two independent variables showing the square footage of the house and the number of rooms. Now, imagine this dataset has 30 rows. The whole idea is that you build a model that can predict house prices accurately. To ‘train’ your model, or see how well it performs, we randomly subset 20 of those rows and fit the model. The second step is to predict the values of those 10 rows that we excluded and measure how well our predictions were. As a rule of thumb, experts suggest to randomly sample 80% of the data into the training set and 20% into the test set.

A very quick example:

library(tidyverse)
library(modelr)

holdout <- function(repeat_times) { # empty argument for later
  n <- nrow(mtcars)
  eighty_percent <- (n * 0.8) %>% floor
  train_sample <- sample(1:n, eighty_percent) # randomly pick 80% of the rows
  test_sample <- setdiff(1:n, train_sample) # get the remaining 20% of the rows
  
  train <- mtcars[train_sample, ] # subset the 80% of the rows
  test <- mtcars[test_sample, ] # subset 20% of the rows
  
  train_model <- lm(mpg ~ ., data = train)
  
  test %>%
    add_predictions(train_model) %>% # add the predicted mpg values to the test data
    summarize(average_error = (pred - mpg) %>% mean %>% round(2)) %>%
    pull(average_error)
  # calculate the average difference of the predicition from the actual value
}

set.seed(2131)
holdout()
# [1] 3.59

We can see that on average the training set over predicts the actual values by about 3.6 points. An even more complex approach is what Kohavi (1995) calls “random subsampling”.

Random subsampling

In a nutshell, repeat the previous N times and calculate the average and standard deviation of your metric of interest.

set.seed(2134)

random_subsampling <- map_dbl(1:500, holdout)
summary(random_subsampling)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# -7.3100 -0.7525  0.4000  0.4255  1.5550  9.1500

We get a mean error of 0.42, a maximum of 9.15 and a minimum of -7.31. Quite some variation, eh? It is precisely for this reason that Kohavi (1995) highlights that random subsampling has an important problem.

  • Each time we resample, some observations might’ve been in the previous resample, leading to non-independence and making the training dataset unrepresentative of the original dataset.

  • What happens when you try to predict Y from an unrepresented X, 500 times? What we just saw before.

K-Fold cross validation

Let’s move on to cross validation. K-Fold cross validation is a bit trickier, but here is a simple explanation.

  • K-Fold cross validation: Take the house prices dataset from the previous example, divide the dataset into 10 parts of equal size, so if the data is 30 rows long, you’ll have 10 datasets of 3 rows each. Each split contains unique rows not present in other splits. In the first iteration, take the first dataset as the test dataset and merge the remaining 9 datasets as the train dataset. Fit the model on the training data, predict on the test data and record model accuracy. Repeat a new iteration where dataset 2 is the test set and data set 1 and 3:10 merged is the training set. Repeat for all K slices.

We can implement this in R.

k_slicer <- function(data, slices) {
  stopifnot(nrow(data) > slices) # the number of rows must be bigger than the K slices
  slice_size <- (nrow(data) / slices) %>% floor
  
  rows <- 1:nrow(data)
  data_list <- rep(list(list()), slices) # create empty list of N slices

  # Randomly sample slice_size from the rows available, but exclude these rows
  # from the next sample of rows. This makes sure each slice has unique rows.
  for (k in 1:slices) {
    specific_rows <- sample(rows, slice_size) # sample unique rows for K slice
    rows <- setdiff(rows, specific_rows) # exclue those rows
    data_list[[k]] <- data[specific_rows, ] # sample the K slice and save in empty list
  }
  
  data_list
}

mtcars_sliced <- k_slicer(mtcars, slices = 5) # data sliced in K slices

All good so far? We took a dataset and split it into K mutually exclusive datasets. The next step is to run the modeling on K = 2:10 and test on K = 1, and then repeat on K = c(1, 3:10) as training and test on K = 2, and repeat for al K’s. Below we implement it in R.


k_fold_cv <-
  map_dbl(seq_along(mtcars_sliced), ~ {
  test <- mtcars_sliced[[.x]] # Take the K fold
  
  # Note the -.x, for excluding that K
  train <- mtcars_sliced[-.x] %>% reduce(bind_rows) # bind row all remaining K's
  
  lm(mpg ~ ., data = train) %>%
    rmse(test) # calculate the root mean square error of predicting the test set
})

k_fold_cv %>%
  summary
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   3.192   3.746   3.993   3.957   4.279   4.574

And we get a summary of the root mean square error, a metric we decided to use now, instead of predictions. We can asses how accurate our model is this way and compare several specification of models and choose the one which better fits the data.

The main advantage of this approach:

  • We maximize the use of data because all data is used, at some point, as test and training.

This is very interesting in contrast to the holdout method in which we can’t maximize our data! Take data out of the test set and the predictions will have wider uncertainty intervals, take data out of the train set and get biased predictions.

This approach, as any other, has disadvantages.

  • It is computationally intensive, given that we have to run the model K-1 times. In this setting it’s trivial, but in more complex modeling this can be quite costly.

  • If in any of the K iterations the predictions are bad, the overall accuracy will be bad, considering that other K iterations will also likely be bad. In other words, predictions need to be stable across all K iterations.

  • Building on the previous point, once the model is stable, increasing the number of folds (5, 10, 20, 25…) generates little change considering that the accuracy will be similar (and the variance of different K-folds will be similar as well).

  • Finally, if Y consists of categories, and one of these categories is very minimal, the best K-Fold CV can do is predict the class with more observations. If an observation of this minimal class gets to be in the test set in one of the iterations, then the training model will have very little accuracy for that category. See Kohavi (1995) page 3, example 1 for a detailed example.

Conclusion

This was my first attempt at manually implementing the Holdout method and the K-Fold CV. These examples are certainly flawed, like rounding the decimal number of rows correct for the unique number of rows in each K-Fold slice. If anyone is interested in correcting thes, please do send a pull request. For those interested in using more reliable approaches, take a look at the caret and the modelr package. In the next entry I will implement the LOO method and the bootstrap (and maybe the stratified K-Fold CV)

  • Kohavi, Ron. “A study of cross-validation and bootstrap for accuracy estimation and model selection.” Ijcai. Vol. 14. No. 2. 1995.
comments powered by Disqus