caret
is a magical package for doing machine learning in R. Look at this code for running a regularized regression:
library(caret)
inTrain <- createDataPartition(y = mtcars$mpg,
p = 0.75,
list = FALSE)
reg_mod <- train(
mpg ~ .,
data = mtcars[inTrain, ],
method = "glmnet",
tuneLength = 10,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10)
)
The two function calls in the expression above perform these operations:
That is a lot of modelling, optimization and computation done with almost no mental load. However, in case you didn’t know, caret
is doomed to be left behind. The creator of the package has stated that he will give maintenance to the package but most active development will be given to tidymodels
, its successor.
tidymodels
is more or less a restructuring of the caret
package (as it aims to do the same thing and more) but with an interface and design philosophy that resembles the Unix
philosophy. This means that instead of having one package and one function (caret
and train
) that does much of the work, all operations described above are performed by different packages.
tidymodels
has been in development for the past two years and the main pieces for effective modelling have been implemented (packages such as parsnip
, tune
, yardstick
, etc…). However, there still isn’t a completely unified workflow that allows them to be as succint and elegant as train
. I’ve been keeping an eye on the development of the different packages from tidymodels
and I really want to understand the key workflow that will allow users to make modelling with tidymodels
easy.
The objective of this post is to present what I think is currently the most succint and barebones workflow that a user should need using tidymodels
. I reached this workflow by looking at the machine learning tutorials from the RStudio conference and stripped most of the details to see the link between the high-level steps in the modelling workflow and where tidymodels
fits . In particular, I was curious on how tidymodels
makes the workflow fit a logical set of steps without much mental load.
tidymodels
package. It assumes you are familiar with some of the main packagestidymodels
can do (no fancy modelling or deep learning)In fact, I’ve always had some issues using tidymodels
because there are so many functions that are difficult to think as isolated entities that remembering every step is quite difficult (unlike the tidyverse
where each package can be thought of as a different entity independent of the others but that you use them because they work well together).
tidymodels
This post is slightly longer than my usual posts, so here’s the too long don’t read version of the workflow:
library(AmesHousing)
# devtools::install_github("tidymodels/tidymodels")
library(tidymodels)
ames <- make_ames()
############################# Data Partitioning ###############################
###############################################################################
ames_split <- rsample::initial_split(ames, prop = .7)
ames_train <- rsample::training(ames_split)
ames_cv <- rsample::vfold_cv(ames_train)
############################# Preprocessing ###################################
###############################################################################
mod_rec <-
recipes::recipe(Sale_Price ~ Longitude + Latitude + Neighborhood,
data = ames_train) %>%
recipes::step_log(Sale_Price, base = 10) %>%
recipes::step_other(Neighborhood, threshold = 0.05) %>%
recipes::step_dummy(recipes::all_nominal())
############################# Model Training/Tuning ###########################
###############################################################################
## Define a regularized regression and explicitly leave the tuning parameters
## empty for later tuning.
lm_mod <-
parsnip::linear_reg(penalty = tune::tune(), mixture = tune::tune()) %>%
parsnip::set_engine("glmnet")
## Construct a workflow that combines your recipe and your model
ml_wflow <-
workflows::workflow() %>%
workflows::add_recipe(mod_rec) %>%
workflows::add_model(lm_mod)
# Find best tuned model
res <-
ml_wflow %>%
tune::tune_grid(resamples = ames_cv,
grid = 10,
metrics = yardstick::metric_set(yardstick::rmse))
############################# Validation ######################################
###############################################################################
# Select best parameters
best_params <-
res %>%
tune::select_best(metric = "rmse", maximize = FALSE)
# Refit using the entire training data
reg_res <-
ml_wflow %>%
tune::finalize_workflow(best_params) %>%
parsnip::fit(data = ames_train)
# Predict on test data
ames_test <- rsample::testing(ames_split)
reg_res %>%
parsnip::predict(new_data = recipes::bake(mod_rec, ames_test)) %>%
bind_cols(ames_test, .) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
select(Sale_Price, .pred) %>%
rmse(Sale_Price, .pred)
and here’s what I think it should look like in pseudocode:
############################# Pseudocode ######################################
###############################################################################
library(AmesHousing)
# devtools::install_github("tidymodels/tidymodels")
library(tidymodels)
ames <- make_ames()
ml_wflow <-
# Original data (unsplit)
ames %>%
workflow() %>%
# Split test/train
initial_split(prop = .75) %>%
# Specify cross-validation
vfold_cv() %>%
# Start preprocessing
recipe(Sale_Price ~ Longitude + Latitude + Neighborhood) %>%
step_log(Sale_Price, base = 10) %>%
step_other(Neighborhood, threshold = 0.05) %>%
step_dummy(recipes::all_nominal()) %>%
# Define model
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
# Define grid of tuning parameters
tune_grid(grid = 10)
# ml_wflow shouldn't run anything -- it's just a specification
# of all the different steps. `fit` should run everything
ml_wflow <- fit(ml_wflow)
# Plot results of tuning parameters
ml_wflow %>%
autoplot()
# Automatically extract best parameters and fit to the training data
final_model <-
ml_wflow %>%
fit_best_model(metrics = metric_set(rmse))
# Predict on the test data using the last model
# Everything is bundled into a workflow object
# and everything can be extracted with separate
# functions with the same verb
final_model %>%
holdout_error()
If you want more details on each step, continue reading :).
Let’s recycle the operations I described above from caret::train
and redefine them as general principles:
Before we start, let’s load the two packages and data we’ll use:
library(AmesHousing)
# devtools::install_github("tidymodels/tidymodels")
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.0.4 ──
## ✔ broom 0.5.4 ✔ recipes 0.1.9
## ✔ dials 0.0.4 ✔ rsample 0.0.5.9000
## ✔ dplyr 0.8.4 ✔ tibble 2.1.3
## ✔ ggplot2 3.2.1 ✔ tune 0.0.1
## ✔ infer 0.5.1 ✔ workflows 0.1.0.9000
## ✔ parsnip 0.0.5.9000 ✔ yardstick 0.0.5
## ✔ purrr 0.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::margin() masks dials::margin()
## ✖ recipes::step() masks stats::step()
## ✖ recipes::yj_trans() masks scales::yj_trans()
ames <- make_ames()
This step is performed by the rsample
package. It allows you to do two basic things in machine learning: separate your training/test set and create resamples sets for tuning. Since nearly all machine learning modelling requires model tuning, I will create a cross-validation set in this example.
ames_split <- rsample::initial_split(ames, prop = .75)
ames_train <- rsample::training(ames_split)
ames_cv <- rsample::vfold_cv(ames_train)
I believe the code above is quite easy to understand and (even if slightly more verbose than the caret
equivalent) is quite elegant. For now, there are two things to keep in mind: we have a training set (ames_train
) and we have a cross-validation set (ames_cv
). We can forget about the testing set all together since it’ll be used in the end.
caret
takes care of doing the preprocessing behind the scenes while the user only needs to specify which steps are needed. In tidymodels
, the recipes
package takes care of preprocessing and you have to perform each step explicitly:
mod_rec <-
recipes::recipe(Sale_Price ~ Longitude + Latitude + Neighborhood,
data = ames_train) %>%
recipes::step_log(Sale_Price, base = 10) %>%
recipes::step_other(Neighborhood, threshold = 0.05) %>%
recipes::step_dummy(recipes::all_nominal())
I find this preprocessing statement very intuitive as well. You define the formula for your analysis, provide the training dataset and then apply whatever transformation to the prediction variables. So far the workflow is simple but growing:
Divide training set
-> Define model formula
-> Specify the data is the training set
-> Apply preprocessing
Previously, recipes
was a bit confusing because there were steps which are not easy to remember: prep
the dataset and juice
or bake
it depending on what you want to do (even more verbose and complex when applying this to a cross-validation set). With the workflows
package, these steps have been completely eliminated from the users mental load.
Model training and tuning is the step on which I think tidymodels
brings in too many moving parts. This has been partially ameliorated with workflows
. For this step there are three to four packages: parsnip
for modelling, workflows
for creating modelling workflows, tune
for tuning models and yardstick
for validating the results. Let’s see how they fit together:
## Define a regularized regression and explicitly leave the tuning parameters
## empty for later tuning.
lm_mod <-
parsnip::linear_reg(penalty = tune::tune(), mixture = tune::tune()) %>%
parsnip::set_engine("glmnet")
## Construct a workflow that combines your recipe and your model
ml_wflow <-
workflows::workflow() %>%
workflows::add_recipe(mod_rec) %>%
workflows::add_model(lm_mod)
The expression above adds much more flexibility as you can swap models by just changing the linear_reg
to another model. However, it also adds more complexity. tune()
requires you to know about parameters()
to extract the parameters to create the grid. For that you have to be aware of the grid_*
functions to create a grid of values. However, this comes from the dials
package and not the tune
package. However, all of these moving parts return different things, so they’re not very easy to remember at first glance.
Having said that, the actual tuning is done with tune_grid
where we specify the cross-validated set from the first step. Here tune_grid
is quite elegant since it allows you specify a grid of values or an integer which it will use to create a grid of parameters:
res <-
ml_wflow %>%
tune::tune_grid(resamples = ames_cv,
grid = 10,
metrics = yardstick::metric_set(yardstick::rmse))
And finally, you can get the summary of the metrics with collect_metrics
:
res %>%
tune::collect_metrics()
## # A tibble: 10 x 7
## penalty mixture .metric .estimator mean n std_err
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 4.99e-10 0.577 rmse standard 0.141 10 0.00327
## 2 3.11e- 9 0.655 rmse standard 0.141 10 0.00327
## 3 2.74e- 8 0.476 rmse standard 0.141 10 0.00327
## 4 1.86e- 7 0.795 rmse standard 0.141 10 0.00327
## 5 8.39e- 6 0.976 rmse standard 0.141 10 0.00327
## 6 8.47e- 5 0.177 rmse standard 0.141 10 0.00327
## 7 6.00e- 4 0.394 rmse standard 0.141 10 0.00327
## 8 4.45e- 3 0.268 rmse standard 0.141 10 0.00329
## 9 1.28e- 2 0.143 rmse standard 0.142 10 0.00331
## 10 1.66e- 1 0.863 rmse standard 0.175 10 0.00387
Or choose the best parameters with select_best
:
best_params <-
res %>%
tune::select_best(metric = "rmse", maximize = FALSE)
best_params
## # A tibble: 1 x 2
## penalty mixture
## <dbl> <dbl>
## 1 0.0000847 0.177
The final step is to extract the best model and contrast the training and test error. Here workflows
offers some convenience to replace the model with the best parameters and fit the complete training data with the best parameters. This step is currently completely automatized with train
where you can extract the best model even after exploring the results of different tuning parameters.
reg_res <-
ml_wflow %>%
# Attach the best tuning parameters to the model
tune::finalize_workflow(best_params) %>%
# Fit the final model to the training data
parsnip::fit(data = ames_train)
ames_test <- rsample::testing(ames_split)
reg_res %>%
predict(new_data = ames_test) %>%
bind_cols(ames_test, .) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
select(Sale_Price, .pred) %>%
yardstick::rmse(Sale_Price, .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.139
One of the things I don’t like about fit
for this current scenario is that I have to think about specifying the training data again. I understand that the data specified in recipe
could be even an empty data frame, as it is used only to detect the column names. However, in nearly all the applications I can think of, I will specify the training data at the beginning (in my recipe). So I find that having to specify the data again is a step that can be eliminated altogether if the data is in the workflow.
There are many things to remember from the workflow above. Below is a kind of cheatsheet:
rsample::initial_split
: splits your data into training/testingrsample::training
: extract the training datarsample::vfold_cv
: create a cross-validated set from the training datarecipes::recipe
: define your formula with the training datarecipes::step_*
: add any preprocessing steps your dataparsnip::linear_reg
: define your model. This example shows a linear regression but it could be anything else (random forest)tune::tune
: leave the tuning parameters empty for laterparsnip::set_engine
: set the engine to run the models (which package to use)workflows::workflow
: create a workflow object to hold your model/recipeworkflows::add_recipe
: add the recipe to your workflowworkflows::add_model
: add the model to your workflowyardstick::metric_set
: create a set of metricsyardstick::rmse
: specify the root-mean-square-error as the loss functiontune::tune_grid
run the workflow across all resamples with the desired tuning parameterstune::collect_metrics
: collect which are the best tuning parameterstune::select_best
: select the best tuning parametertune::finalize_workflow
: replace the empty parameters of the model with the best tuned parametersparsnip::fit
: fit the final model to the training datarsample::testing
: extract the testing data from the initial splitparsnip::predict
: predict the trained model on the testing dataThis is currently what I think is the simplest workflow to train models in tidymodels
. This is of course a very simplified example which doesn’t create tuning grids or tune parameters in the recipes. This is supposed to be the barebones workflow that is currently available in tidymodels
. Having said that, I still think there are too many steps which makes the workflow convoluted.
tidymodels
is currently being designed to be decoupled into several packages and the key steps for modelling are currently implemented. This offers greater flexibility for defining models, making some of the steps in modelling less obscure and explicit.
Having said that, there is too much to remember. dplyr::select
is a function which is easy to remember because it can be thought of as an independent entity which I can use with a data.table
or base R
. On top of that, I know it follows the general principle of the tidyverse
where it only accepts a data frame and only returns a data frame. This makes it much more memorable. Due to its simplicity, it’s easy to think of it like a hammer: I can apply it to so many different problems that I don’t have to memorize it, it becomes a general tool that represents an abstract idea.
Some of the functions/packages from tidymodels
are difficult to think like that. I believe this is because they are supposed to be almost always used together, otherwise they have no practical applications. tune
, workflows
and parsnip
introduce several ideas which I think are difficult to remember (mainly because you have to remember them and they don’t come off naturally, as an abstract concept).
workflows
seems to be a package that combines some of the steps performed by parsnip
and recipes
, suggesting that you can build a logical workflow with it. However, workflows
is introduced after you define your preprocessing and model. My intuition would tell me that the workflow should begin at first rather than in the middle. For example, in pseucode a logical workflow could look like this:
ml_wflow <-
# Original data (unsplit)
ames %>%
# Begin workflow
workflow() %>%
# No need to extract training/testing, they're already in the workflow
# This eliminates the mental load of mixing up training/testing and
# mistakenly predict one over the other.
initial_split(prop = .75) %>%
# Apply directly the cross-validation to the training set. No resaving
# the data into different names, adding more and more objects to remember
vfold_cv() %>%
# Introduce preprocessing
# No need to specify the data, the training data is already inside
# the workflow. This simplifies having to specify your training
# data in many different places (recipes, fit, vfold_cv). The data
# was specified at the beginning and that's it.
recipe(Sale_Price ~ Longitude + Latitude + Neighborhood) %>%
step_log(Sale_Price, base = 10) %>%
step_other(Neighborhood, threshold = 0.05) %>%
step_dummy(recipes::all_nominal()) %>%
# Add your model definition and include placeholders for your tuning
# parameters
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
I believe the code above is much more logical than the current setup for three reasons which are very much related to each other.
First, it follows the ‘traditional’ workflow of machine learning more clearly without intermediate steps. You begin with your data and add the key modelling steps one by one. Second, it avoids creating too many intermediate steps which add mental load. Whenever I’m using tidymodels
I have to remember so many things: the training data, the cross-validated set, the recipe, the tuning grid, the model, etc. I often forget what I need to add to tune_grid
: is it the recipe and the resample set? Is it the workflow? Did I mistakenly add the test set to the recipe and fit the data with the training set? It’s very easy to get lost along the way. And third, I think the workflow from above fits with the tidyverse
philosophy much better, where you can read the steps from left to right, in a linear fashion.
The power of the pseudocode above is that the workflow is thought of as the holder of your workflow since the beginning, meaning you can add or remove stuff from it. For example, it would very easy to add another model to be compared:
ml_wflow <-
# Original data (unsplit)
ames %>%
workflow() %>%
initial_split(prop = .75) %>%
vfold_cv() %>%
recipe(Sale_Price ~ Longitude + Latitude + Neighborhood) %>%
step_log(Sale_Price, base = 10) %>%
step_other(Neighborhood, threshold = 0.05) %>%
step_dummy(recipes::all_nominal()) %>%
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
# Adds another model
rand_forest(mtry = tune(), tress = tune(), min_n = tune()) %>%
set_engine("rf")
The code above could also include additional steps for adding tuning grids for each model and then a final call to fit
would fit all models/tuning parameters directly into the cross-validated set. Additionally, since the original data is in the workflow, methods for fitting the best model to the complete training data could be implemented as well as methods for running the best tuned model on the test data. No objects laying around to remember and everything is unified into a bundle of logical steps which begin with your data.
This workflow idea doesn’t introduce anything new programatically in tidymodels
: all ingredients are currently implemented. The idea is to rearrange specific methods to handle a workflow in this fashion. This workflow idea is just a prototype idea and I’m sure that many things can be improved. I do think, however, that this is the direction which would make tidymodels
a truly friendly interface. At least to me, it would make it as easy to use as the tidyverse
.
This post is intended to be thought-provoking take on the current development of tidymodels
. I’m a big fan of RStudio and their work and I’m looking forward to the “official release” of tidymodels
. I wrote this piece with the intention of understanding the currently workflow but noticed that I’m not comfortable with it, nor did it come off naturally. I hope these ideas can help exemplify some of the bottlenecks that future tidymodels
users can face with the aim of improving the user experience of the modelling framework from tidymodels
.