Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Machine Learning for Social Scientists

Tree based methods and PCA

Jorge Cimentada

2020-07-08

1 / 34

Load the data

library(tidymodels)
library(tidyflow)
library(ggfortify)
library(rpart.plot)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
2 / 34

Boosting

  • Tree based methods we've seen use decision trees as baseline models

  • They use ensemble approaches to calculate the average predictions of all decision trees

  • Boosting also uses decision trees as the baseline model but the ensemble strategy is fundamentally different

  • Manual example: let's fit a very weak decision tree

3 / 34

Boosting

dt_tree <-
decision_tree(mode = "regression", tree_depth = 1, min_n = 10) %>%
set_engine("rpart")
pisa_tr <- training(initial_split(pisa))
tflow <-
tidyflow(pisa_tr, seed = 51231) %>%
plug_formula(math_score ~ scie_score) %>%
plug_model(dt_tree)
mod1 <- fit(tflow)
mod1 %>% pull_tflow_fit() %>% .[['fit']] %>% rpart.plot()

4 / 34

Boosting

  • Weak model with tree_depth = 1

  • What is the RMSE?

res_mod1 <-
pisa_tr %>%
cbind(., predict(mod1, new_data = .))
res_mod1 %>%
rmse(math_score, .pred)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 55.7
  • Not a good nor robust model.
5 / 34

Boosting

  • Let's look at the residuals: we should see a very strong pattern
res_mod1 <- res_mod1 %>% mutate(.resid = math_score - .pred)
res_mod1 %>%
ggplot(aes(scie_score, .resid)) +
geom_point(alpha = 1/3) +
scale_x_continuous(name = "Science scores") +
scale_y_continuous(name = "Residuals") +
theme_minimal()

6 / 34

Boosting

  • Boosting works by predicting the residuals of previous decision trees.
  1. Fit a first model and calculated the residuals
  2. Fit a second model but the dependent variable should now be the residuals of the first model
  3. Recursively fit N trees following this pattern
# Convert `math_score` to be the residuals of model 1
res_mod1 <- mutate(res_mod1, math_score = .resid)
# Replace the new data in our `tflow` In the data `res_mod1`, `math_score` is
# now the residuals of the first model
mod2 <- tflow %>% replace_data(res_mod1) %>% fit()
mod2 %>% pull_tflow_fit() %>% .[['fit']] %>% rpart.plot()

7 / 34

Boosting

  • Let's visualize the residuals from the second model:
res_mod2 <- pisa_tr %>% cbind(., predict(mod2, new_data = .)) %>% mutate(.resid = math_score - .pred)
res_mod2 %>%
ggplot(aes(scie_score, .resid)) +
geom_point(alpha = 1/3) +
scale_x_continuous(name = "Science scores") +
scale_y_continuous(name = "Residuals") +
theme_minimal()

  • Pattern seems to have changed although it's not clear that it's closer to a random pattern
8 / 34

Boosting

  • If we repeat the same for 20 trees, residuals approximate randomness:

9 / 34

Boosting

  • Boosting is a way for each model to boost the last model's performance:

    • Focuses mostly on observations which had big residuals
  • After having 20 predictions for each respondent, can you take the average?

mod1_pred <- predict(mod1, new_data = pisa_tr)
names(mod1_pred) <- "pred_mod1"
mod2_pred <- predict(mod2, new_data = pisa_tr)
names(mod2_pred) <- "pred_mod2"
resid_pred <- cbind(mod1_pred, mod2_pred)
head(resid_pred)
pred_mod1 pred_mod2
1 540.8022 -10.33072
2 406.8686 -10.33072
3 406.8686 -10.33072
4 406.8686 -10.33072
5 406.8686 -10.33072
6 540.8022 -10.33072
10 / 34

Boosting

  • The first model has the correct metric but all the remaining models are residuals

  • Final prediction is the sum of all predictions

  • For our small-scale example, we can do that with rowSums:

resid_pred$final_pred <- rowSums(resid_pred)
head(resid_pred)
pred_mod1 pred_mod2 final_pred
1 540.8022 -10.33072 530.4715
2 406.8686 -10.33072 396.5379
3 406.8686 -10.33072 396.5379
4 406.8686 -10.33072 396.5379
5 406.8686 -10.33072 396.5379
6 540.8022 -10.33072 530.4715
  • We have a final prediction for each respondent.
11 / 34

Boosting

  • Let's fit our trademark model of math_score regressed on all variables with xgboost
boost_mod <- boost_tree(mode = "regression", trees = 500) %>% set_engine("xgboost")
tflow <-
pisa %>%
tidyflow(seed = 51231) %>%
plug_formula(math_score ~ .) %>%
plug_split(initial_split) %>%
plug_model(boost_mod)
boot_res <- fit(tflow)
[13:11:49] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
rmse_gb_train <-
boot_res %>%
predict_training() %>%
rmse(math_score, .pred)
rmse_gb_train
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.000679
12 / 34

Boosting

  • Let's check how it performs on the testing data:
gb_rmse <-
boot_res %>%
predict_testing() %>%
rmse(math_score, .pred) %>%
pull(.estimate)
c("Extreme Gradient Boosting" = gb_rmse)
Extreme Gradient Boosting
27.02214
  • Boosting outperforms all others considerably

  • Boosting and xgboost are considered to be among the best predictive models

  • They can achieve great accuracy even with default values

13 / 34

Disadvantages of boosting

  • Increasing the number of trees in a boosting algorithm can increase overfitting

  • For the random forest, increasing the number of trees has no impact on overfitting

  • You might reach a point that adding more trees will just try to explain residuals which are random, resulting in overfitting.

  • stop_iter signals that after N number trees have passed without any improvement, the algorithm should stop. This approach often runs less trees than the user requested.

14 / 34

Boosting

There are other tuning parameters available in boost_tree which you can use to improve your model:

  • trees: the number of trees that will be ran

  • mtry: just as in random forests

  • min_n: minimum number in each node

  • tree_depth: how complex the tree is grown

  • learn_rate: controls how much we regularize each tree

  • loss_reduction: signals the amount of reduction in your loss function (for example, RMSE) that will allow each split in a decision tree to continue to grow. You can see this as cost-effective step: only if the tree improves it's prediction by X, we allow the tree to produce another split.

  • sample_size: controls the percentage of the data used in each iteration of the decision tree. This is similar to the bagging approach where we perform bootstraps on each iteration.

15 / 34

Unsupervised regression

  • No dependent variables

  • Methods are certainly less advanced (finding similarities with no dependent variables)

  • True AI is dependent-variable-free

  • Humans are excelent unsupervised models

  • In the course: PCA and KMeans Clustering

16 / 34

PCA

  • Principal Component Analysis or PCA

  • Summarizes many columns into a very small subset that captures the greatest variability of the original columns.

PCA works by creating several components which are the normalized linear combination of the variables of interest.

17 / 34

PCA

In the pisa data there are a six variables which asks the students whether they've suffered negative behavior from their friends in the past 12 months:

  • Other students left them out of things on purpose

  • Other students made fun of them

  • They were threatened by other students

  • Other students took away or destroyed things that belonged to them

  • They got hit or pushed around by other students

  • Other students spread nasty rumours about them

Scale ranges from 1 to 4, the higher the number, the more negative their response.

18 / 34

PCA

pisa <-
rename(
pisa,
past12_left_out = ST038Q03NA, past12_madefun_of_me = ST038Q04NA,
past12_threatened = ST038Q05NA, past12_destroyed_personal = ST038Q06NA,
past12_got_hit = ST038Q07NA, past12_spread_rumours = ST038Q08NA
)
pisa_selected <- pisa %>% select(starts_with("past12"))
cor(pisa_selected)
past12_left_out past12_madefun_of_me
past12_left_out 1.0000000 0.6073982
past12_madefun_of_me 0.6073982 1.0000000
past12_threatened 0.4454125 0.4712083
past12_destroyed_personal 0.4037351 0.4165931
past12_got_hit 0.3918129 0.4480862
past12_spread_rumours 0.4746302 0.5069299
past12_threatened past12_destroyed_personal
past12_left_out 0.4454125 0.4037351
past12_madefun_of_me 0.4712083 0.4165931
past12_threatened 1.0000000 0.5685773
past12_destroyed_personal 0.5685773 1.0000000
past12_got_hit 0.5807617 0.6206485
past12_spread_rumours 0.5513099 0.4543380
past12_got_hit past12_spread_rumours
past12_left_out 0.3918129 0.4746302
past12_madefun_of_me 0.4480862 0.5069299
past12_threatened 0.5807617 0.5513099
past12_destroyed_personal 0.6206485 0.4543380
past12_got_hit 1.0000000 0.4451408
past12_spread_rumours 0.4451408 1.0000000
  • Most correlations lie between 0.4 and 0.6
19 / 34

PCA

  • PCA works by receiving as input P variables (in this case six)

  • > Calculate the normalized linear combination of the P variables.

  • > This new variable is the linear combination of the six variables that captures the greatest variance out of all of them.

  • > PCA continues to calculate other normalized linear combinations but uncorrelated

  • Constructs as many principal components as possible (achieve 100% variability)

  • Each PC is assessed by how much variance it explains

20 / 34

PCA

  • We need to center and scale the independent variables, however, our variables are in the same scale

  • Let's pass in our six variables to the function prcomp, which estimates these principal components based on our six variables.

pc <- prcomp(pisa_selected)
all_pcs <- as.data.frame(pc$x)
head(all_pcs)
PC1 PC2 PC3 PC4 PC5 PC6
1 -2.836172297 -0.7549602 -1.91065434 -0.232647114 -0.368981283 -1.885607656
2 -1.478020766 0.6622561 0.94113153 0.181451711 0.149387648 0.678384471
3 1.025953306 0.1602906 -0.03806864 -0.008994148 0.009439987 -0.002391996
4 -0.002173173 -0.7902197 -0.10112894 -0.197389118 0.013521080 -0.002718289
5 -4.832075955 0.1996595 0.39221922 -0.256660522 -1.178883084 0.150399629
6 -1.132036976 -1.8534154 -0.68913950 0.914561923 0.065907346 0.087208533
  • The result of all of this is a dataframe with six new columns.

  • They are variables that summarize the relationship of these six variables.

21 / 34

PCA

  • We judge by how much variance each 'component' explains
tidy(pc, "pcs")
# A tibble: 6 x 4
PC std.dev percent cumulative
<dbl> <dbl> <dbl> <dbl>
1 1 1.34 0.591 0.591
2 2 0.640 0.135 0.726
3 3 0.530 0.0929 0.819
4 4 0.522 0.0899 0.909
5 5 0.394 0.0513 0.960
6 6 0.347 0.0397 1
  • First principal component explains about 58% of the variance

  • Second principal component explains an additional 13.7%

  • Total of 71.4%

22 / 34

PCA

23 / 34

PCA

  • They are supposed to be uncorrelated
cor(all_pcs[c("PC1", "PC2")])
PC1 PC2
PC1 1.00000000000000000000 -0.00000000000001545012
PC2 -0.00000000000001545012 1.00000000000000000000
  • As expected, the correlation between these two variables is 0.

  • Social Scientist would make sure that their expected explanatory power of the two components is high enough.

  • If it is, they would include these two columns in their statistical models instead of the six variables.
24 / 34

PCA

  • PCA is all about exploratory data analysis.

  • We might want to go further and explore how the original six variables are related to these principal components.

  • These two principal components are a bit of a black box at this point. Which variables do they represent? We can check that with the initial output of prcomp:

pc$rotation[, 1:2]
PC1 PC2
past12_left_out -0.4631946 -0.4189125
past12_madefun_of_me -0.5649319 -0.5315979
past12_threatened -0.3446963 0.4025682
past12_destroyed_personal -0.2694606 0.3405411
past12_got_hit -0.2987481 0.3715999
past12_spread_rumours -0.4308453 0.3546832
  • First PC: all correlations are negative.

  • Informally, we could call this variable a 'negative-peer index'.

25 / 34

PCA

pc$rotation[, 1:2]
PC1 PC2
past12_left_out -0.4631946 -0.4189125
past12_madefun_of_me -0.5649319 -0.5315979
past12_threatened -0.3446963 0.4025682
past12_destroyed_personal -0.2694606 0.3405411
past12_got_hit -0.2987481 0.3715999
past12_spread_rumours -0.4308453 0.3546832
  • Second PC: four of these six variables correlate positively

  • The principal components tend capture the exact opposite relationship.

  • This is a 'positive-peer index'

26 / 34

PCA

  • This plot shows how the variables cluster between the principal components

  • Mean 0 for both variables

set.seed(6652)
autoplot(pc, loadings = TRUE, loadings.label = TRUE, loadings.label.repel = TRUE, alpha = 1/6) +
theme_minimal()

27 / 34

PCA

  • The two variables are located in the bottom left of the plot, showing that for both principal components both variables are associated with lower values of PC1 and PC2:

28 / 34

PCA

  • The other four variables from the correlation showed negative correlations with PC1 and positive correlations with PC2.

  • This means that these variables should cluster below the average of PC1 and higher than the average of PC2.

29 / 34

PCA

  • The remaining four variables cluster at lower values of PC1 and at higher values of PC1:

30 / 34

PCA

  • You might reject to focus on the first two principal components and explore this same plot for PC1 and PC3 or PC2 and PC4.

  • There's no clear cut rule for the number of principal components to use.

  • Exploratorion is key

In any case, this method is inherently exploratory. It serves as way to understand whether we can reduce correlated variables into a small subset of variables that represent them. For a social science point of view, this method is often used for reducing the number of variables. However, there is still room for using it as a clustering method to understand whether certain variables can help us summarize our understanding into simpler concepts.

31 / 34

PCA

  • Grid search of number of components using a random forest:
rcp <- ~ recipe(.x, math_score ~ .) %>% step_pca(starts_with("past12_"), num_comp = tune())
tflow <-
tidyflow(pisa, seed = 25131) %>%
plug_split(initial_split) %>%
plug_recipe(rcp) %>%
plug_model(set_engine(rand_forest(mode = "regression"), "ranger")) %>%
plug_resample(vfold_cv) %>%
plug_grid(expand.grid, num_comp = 1:3)
res_rf <- fit(tflow)
pull_tflow_fit_tuning(res_rf) %>% collect_metrics() %>% filter(.metric == "rmse")
# A tibble: 3 x 6
num_comp .metric .estimator mean n std_err
<int> <chr> <chr> <dbl> <int> <dbl>
1 1 rmse standard 40.8 10 0.402
2 2 rmse standard 40.8 10 0.456
3 3 rmse standard 40.9 10 0.394
32 / 34

PCA

  • Alternative approach:

    • step_pca allows you to specify the minimum explanatory power of the principal components.

    As discussed in the documentation of step_pca, you specify the fraction of the total variance that should be covered by the components. For example, threshold = .75 means that step_pca should generate enough components to capture 75\% of the variance.

rcp <- ~ recipe(.x, math_score ~ .) %>% step_pca(starts_with("past12_"), threshold = .90)
tflow <- tflow %>% replace_recipe(rcp) %>% drop_grid()
res_rf <- fit(tflow)
res_cv <- res_rf %>% pull_tflow_fit_tuning() %>% collect_metrics()
res_cv
  • PCA is a very useful method for summarizing information

  • However, it is based on the notion that the variables to be summarized are best summarized through a linear combination.

33 / 34

Load the data

library(tidymodels)
library(tidyflow)
library(ggfortify)
library(rpart.plot)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
2 / 34
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow