library(tidymodels)library(tidyflow)library(ggfortify)library(rpart.plot)data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"pisa <- read.csv(data_link)
Tree based methods we've seen use decision trees as baseline models
They use ensemble approaches to calculate the average predictions of all decision trees
Boosting also uses decision trees as the baseline model but the ensemble strategy is fundamentally different
Manual example: let's fit a very weak decision tree
dt_tree <- decision_tree(mode = "regression", tree_depth = 1, min_n = 10) %>% set_engine("rpart")pisa_tr <- training(initial_split(pisa))tflow <- tidyflow(pisa_tr, seed = 51231) %>% plug_formula(math_score ~ scie_score) %>% plug_model(dt_tree)mod1 <- fit(tflow)mod1 %>% pull_tflow_fit() %>% .[['fit']] %>% rpart.plot()
Weak model with tree_depth = 1
What is the RMSE?
res_mod1 <- pisa_tr %>% cbind(., predict(mod1, new_data = .))res_mod1 %>% rmse(math_score, .pred)
# A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl>1 rmse standard 55.7
res_mod1 <- res_mod1 %>% mutate(.resid = math_score - .pred)res_mod1 %>% ggplot(aes(scie_score, .resid)) + geom_point(alpha = 1/3) + scale_x_continuous(name = "Science scores") + scale_y_continuous(name = "Residuals") + theme_minimal()
# Convert `math_score` to be the residuals of model 1res_mod1 <- mutate(res_mod1, math_score = .resid)# Replace the new data in our `tflow` In the data `res_mod1`, `math_score` is# now the residuals of the first modelmod2 <- tflow %>% replace_data(res_mod1) %>% fit()mod2 %>% pull_tflow_fit() %>% .[['fit']] %>% rpart.plot()
res_mod2 <- pisa_tr %>% cbind(., predict(mod2, new_data = .)) %>% mutate(.resid = math_score - .pred)res_mod2 %>% ggplot(aes(scie_score, .resid)) + geom_point(alpha = 1/3) + scale_x_continuous(name = "Science scores") + scale_y_continuous(name = "Residuals") + theme_minimal()
Boosting is a way for each model to boost the last model's performance:
After having 20 predictions for each respondent, can you take the average?
mod1_pred <- predict(mod1, new_data = pisa_tr)names(mod1_pred) <- "pred_mod1"mod2_pred <- predict(mod2, new_data = pisa_tr)names(mod2_pred) <- "pred_mod2"resid_pred <- cbind(mod1_pred, mod2_pred)head(resid_pred)
pred_mod1 pred_mod21 540.8022 -10.330722 406.8686 -10.330723 406.8686 -10.330724 406.8686 -10.330725 406.8686 -10.330726 540.8022 -10.33072
The first model has the correct metric but all the remaining models are residuals
Final prediction is the sum of all predictions
For our small-scale example, we can do that with rowSums
:
resid_pred$final_pred <- rowSums(resid_pred)head(resid_pred)
pred_mod1 pred_mod2 final_pred1 540.8022 -10.33072 530.47152 406.8686 -10.33072 396.53793 406.8686 -10.33072 396.53794 406.8686 -10.33072 396.53795 406.8686 -10.33072 396.53796 540.8022 -10.33072 530.4715
math_score
regressed on all variables with xgboost
boost_mod <- boost_tree(mode = "regression", trees = 500) %>% set_engine("xgboost")tflow <- pisa %>% tidyflow(seed = 51231) %>% plug_formula(math_score ~ .) %>% plug_split(initial_split) %>% plug_model(boost_mod)boot_res <- fit(tflow)
[13:11:49] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
rmse_gb_train <- boot_res %>% predict_training() %>% rmse(math_score, .pred)rmse_gb_train
# A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl>1 rmse standard 0.000679
gb_rmse <- boot_res %>% predict_testing() %>% rmse(math_score, .pred) %>% pull(.estimate)c("Extreme Gradient Boosting" = gb_rmse)
Extreme Gradient Boosting 27.02214
Boosting outperforms all others considerably
Boosting and xgboost
are considered to be among the best predictive models
They can achieve great accuracy even with default values
Increasing the number of trees in a boosting algorithm can increase overfitting
For the random forest, increasing the number of trees has no impact on overfitting
You might reach a point that adding more trees will just try to explain residuals which are random, resulting in overfitting.
stop_iter
signals that after N number trees have passed without any improvement, the algorithm should stop. This approach often runs less trees than the user requested.
There are other tuning parameters available in boost_tree
which you can use to improve your model:
trees
: the number of trees that will be ran
mtry
: just as in random forests
min_n
: minimum number in each node
tree_depth
: how complex the tree is grown
learn_rate
: controls how much we regularize each tree
loss_reduction
: signals the amount of reduction in your loss function (for example, RMSE) that will allow each split in a decision tree to continue to grow. You can see this as cost-effective step: only if the tree improves it's prediction by X, we allow the tree to produce another split.
sample_size
: controls the percentage of the data used in each iteration of the decision tree. This is similar to the bagging approach where we perform bootstraps on each iteration.
No dependent variables
Methods are certainly less advanced (finding similarities with no dependent variables)
True AI is dependent-variable-free
Humans are excelent unsupervised models
In the course: PCA and K−Means Clustering
Principal Component Analysis or PCA
Summarizes many columns into a very small subset that captures the greatest variability of the original columns.
PCA works by creating several components which are the normalized linear combination of the variables of interest.
In the pisa
data there are a six variables which asks the students whether they've suffered negative behavior from their friends in the past 12 months:
Other students left them out of things on purpose
Other students made fun of them
They were threatened by other students
Other students took away or destroyed things that belonged to them
They got hit or pushed around by other students
Other students spread nasty rumours about them
Scale ranges from 1 to 4, the higher the number, the more negative their response.
pisa <- rename( pisa, past12_left_out = ST038Q03NA, past12_madefun_of_me = ST038Q04NA, past12_threatened = ST038Q05NA, past12_destroyed_personal = ST038Q06NA, past12_got_hit = ST038Q07NA, past12_spread_rumours = ST038Q08NA )pisa_selected <- pisa %>% select(starts_with("past12"))cor(pisa_selected)
past12_left_out past12_madefun_of_mepast12_left_out 1.0000000 0.6073982past12_madefun_of_me 0.6073982 1.0000000past12_threatened 0.4454125 0.4712083past12_destroyed_personal 0.4037351 0.4165931past12_got_hit 0.3918129 0.4480862past12_spread_rumours 0.4746302 0.5069299 past12_threatened past12_destroyed_personalpast12_left_out 0.4454125 0.4037351past12_madefun_of_me 0.4712083 0.4165931past12_threatened 1.0000000 0.5685773past12_destroyed_personal 0.5685773 1.0000000past12_got_hit 0.5807617 0.6206485past12_spread_rumours 0.5513099 0.4543380 past12_got_hit past12_spread_rumourspast12_left_out 0.3918129 0.4746302past12_madefun_of_me 0.4480862 0.5069299past12_threatened 0.5807617 0.5513099past12_destroyed_personal 0.6206485 0.4543380past12_got_hit 1.0000000 0.4451408past12_spread_rumours 0.4451408 1.0000000
0.4
and 0.6
PCA works by receiving as input P variables (in this case six)
−−> Calculate the normalized linear combination of the P variables.
−−> This new variable is the linear combination of the six variables that captures the greatest variance out of all of them.
−−> PCA continues to calculate other normalized linear combinations but uncorrelated
Constructs as many principal components as possible (achieve 100% variability)
Each PC is assessed by how much variance it explains
We need to center and scale the independent variables, however, our variables are in the same scale
Let's pass in our six variables to the function prcomp
, which estimates these principal components based on our six variables.
pc <- prcomp(pisa_selected)all_pcs <- as.data.frame(pc$x)head(all_pcs)
PC1 PC2 PC3 PC4 PC5 PC61 -2.836172297 -0.7549602 -1.91065434 -0.232647114 -0.368981283 -1.8856076562 -1.478020766 0.6622561 0.94113153 0.181451711 0.149387648 0.6783844713 1.025953306 0.1602906 -0.03806864 -0.008994148 0.009439987 -0.0023919964 -0.002173173 -0.7902197 -0.10112894 -0.197389118 0.013521080 -0.0027182895 -4.832075955 0.1996595 0.39221922 -0.256660522 -1.178883084 0.1503996296 -1.132036976 -1.8534154 -0.68913950 0.914561923 0.065907346 0.087208533
The result of all of this is a dataframe with six new columns.
They are variables that summarize the relationship of these six variables.
tidy(pc, "pcs")
# A tibble: 6 x 4 PC std.dev percent cumulative <dbl> <dbl> <dbl> <dbl>1 1 1.34 0.591 0.5912 2 0.640 0.135 0.7263 3 0.530 0.0929 0.8194 4 0.522 0.0899 0.9095 5 0.394 0.0513 0.9606 6 0.347 0.0397 1
First principal component explains about 58% of the variance
Second principal component explains an additional 13.7%
Total of 71.4%
cor(all_pcs[c("PC1", "PC2")])
PC1 PC2PC1 1.00000000000000000000 -0.00000000000001545012PC2 -0.00000000000001545012 1.00000000000000000000
As expected, the correlation between these two variables is 0.
Social Scientist would make sure that their expected explanatory power of the two components is high enough.
PCA is all about exploratory data analysis.
We might want to go further and explore how the original six variables are related to these principal components.
These two principal components are a bit of a black box at this point. Which variables do they represent? We can check that with the initial output of prcomp
:
pc$rotation[, 1:2]
PC1 PC2past12_left_out -0.4631946 -0.4189125past12_madefun_of_me -0.5649319 -0.5315979past12_threatened -0.3446963 0.4025682past12_destroyed_personal -0.2694606 0.3405411past12_got_hit -0.2987481 0.3715999past12_spread_rumours -0.4308453 0.3546832
First PC: all correlations are negative.
Informally, we could call this variable a 'negative-peer index'.
pc$rotation[, 1:2]
PC1 PC2past12_left_out -0.4631946 -0.4189125past12_madefun_of_me -0.5649319 -0.5315979past12_threatened -0.3446963 0.4025682past12_destroyed_personal -0.2694606 0.3405411past12_got_hit -0.2987481 0.3715999past12_spread_rumours -0.4308453 0.3546832
Second PC: four of these six variables correlate positively
The principal components tend capture the exact opposite relationship.
This is a 'positive-peer index'
This plot shows how the variables cluster between the principal components
Mean 0 for both variables
set.seed(6652)autoplot(pc, loadings = TRUE, loadings.label = TRUE, loadings.label.repel = TRUE, alpha = 1/6) + theme_minimal()
The other four variables from the correlation showed negative correlations with PC1 and positive correlations with PC2.
This means that these variables should cluster below the average of PC1 and higher than the average of PC2.
You might reject to focus on the first two principal components and explore this same plot for PC1 and PC3 or PC2 and PC4.
There's no clear cut rule for the number of principal components to use.
Exploratorion is key
In any case, this method is inherently exploratory. It serves as way to understand whether we can reduce correlated variables into a small subset of variables that represent them. For a social science point of view, this method is often used for reducing the number of variables. However, there is still room for using it as a clustering method to understand whether certain variables can help us summarize our understanding into simpler concepts.
rcp <- ~ recipe(.x, math_score ~ .) %>% step_pca(starts_with("past12_"), num_comp = tune())tflow <- tidyflow(pisa, seed = 25131) %>% plug_split(initial_split) %>% plug_recipe(rcp) %>% plug_model(set_engine(rand_forest(mode = "regression"), "ranger")) %>% plug_resample(vfold_cv) %>% plug_grid(expand.grid, num_comp = 1:3)
res_rf <- fit(tflow)pull_tflow_fit_tuning(res_rf) %>% collect_metrics() %>% filter(.metric == "rmse")
# A tibble: 3 x 6 num_comp .metric .estimator mean n std_err <int> <chr> <chr> <dbl> <int> <dbl>1 1 rmse standard 40.8 10 0.4022 2 rmse standard 40.8 10 0.4563 3 rmse standard 40.9 10 0.394
Alternative approach:
step_pca
allows you to specify the minimum explanatory power of the principal components. As discussed in the documentation of
step_pca
, you specify the fraction of the total variance that should be covered by the components. For example,threshold = .75
means thatstep_pca
should generate enough components to capture 75\% of the variance.
rcp <- ~ recipe(.x, math_score ~ .) %>% step_pca(starts_with("past12_"), threshold = .90)tflow <- tflow %>% replace_recipe(rcp) %>% drop_grid()res_rf <- fit(tflow)res_cv <- res_rf %>% pull_tflow_fit_tuning() %>% collect_metrics()res_cv
PCA is a very useful method for summarizing information
However, it is based on the notion that the variables to be summarized are best summarized through a linear combination.
Finish up exercises from https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#exercises-1
Exercises 1:2
at https://cimentadaj.github.io/ml_socsci/unsupervised-methods.html#exercises-2
library(tidymodels)library(tidyflow)library(ggfortify)library(rpart.plot)data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"pisa <- read.csv(data_link)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |