Ensembling

class: title-slide, center

# Ensembling

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]

<div style = "position:fixed; visibility: hidden">
  `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$`
  `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$`
  `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$`
  `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$`
  `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$`
  `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$`
  `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$`
</div>
  
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      Macros: {
        blue: ["{\\color{blue}{#1}}", 1],
        light_blue: ["{\\color{light_blue}{#1}}", 1],
        yellow: ["{\\color{yellow}{#1}}", 1],
        dark_yellow: ["{\\color{dark_yellow}{#1}}", 1],
        pink: ["{\\color{pink}{#1}}", 1],
        light_pink: ["{\\color{light_pink}{#1}}", 1],
        grey: ["{\\color{grey}{#1}}", 1]
      },
      loader: {load: ['[tex]/color']},
      tex: {packages: {'[+]': ['color']}}
    }
  });
</script>

---
class: your-turn

# Your Turn 0

.big[
* Open the R Notebook **materials/exercises/11-ensembling.Rmd**
* Run the setup chunk
]

---
class: center
background-image: url("images/ensembling/tm-hex0.png")
background-position: center 70%
background-size: 90%

# tidymodels

---
class: center
background-image: url("images/ensembling/tm-hex1.png")
background-position: center 70%
background-size: 90%

# tidymodels

---
class: center
background-image: url("images/ensembling/tm-hex2.png")
background-position: center 70%
background-size: 90%

# tidymodels

---
class: center
background-image: url("images/ensembling/tm-hex3.png")
background-position: center 70%
background-size: 90%

# tidymodels

---
class: center
background-image: url("images/ensembling/tm-hex4.png")
background-position: center 70%
background-size: 90%

# tidymodels

---
class: center middle, frame

# To specify a model with parsnip

1\. Pick a .display[model]

2\. Set the .display[engine]

3\. Set the .display[mode] (if needed)

---
class: middle, frame

# .center[To specify a classification tree with parsnip]

```r
decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")
```

---
class: your-turn

# Your turn 1

Here is our very-vanilla parsnip model specification for a decision tree (also in your Rmd)...

```r
vanilla_tree_spec <-
  decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")
```

---
class: your-turn

# Your turn 1

Fill in the blanks to return the accuracy and ROC AUC for this model.

---
class: your-turn

```r
set.seed(100)
so_folds <- vfold_cv(so_train, strata = remote)

dt_mod <- fit_resamples(vanilla_tree_spec,
                        remote ~ .,
                        resamples = so_folds)

dt_preds <- dt_mod %>%
  collect_metrics()
```

---
# `args()`

.big[Print the arguments for **parsnip** model specification.]

```r
args(decision_tree)
```

---
# `decision_tree()`

.big[Specifies a decision tree model]

```r
decision_tree(tree_depth = 30, min_n = 20, cost_complexity = .01)
```

*either* mode works!

---
# `decision_tree()`

.big[Specifies a decision tree model]

```r
decision_tree(
  tree_depth = 30,       # max tree depth
  min_n = 20,            # smallest node allowed
  cost_complexity = .01  # 0 > cp > 0.1
)
```

---
# `set_args()`

.big[Change the arguments for a **parsnip** model specification.]

```r
_spec %>% set_args(tree_depth = 3)
```

---

```r
decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification") %>%
* set_args(tree_depth = 3)
#> Decision Tree Model Specification (classification)
#> 
#> Main Arguments:
#>   tree_depth = 3
#> 
#> Computational engine: rpart
```

---

```r
*decision_tree(tree_depth = 3) %>%
  set_engine("rpart") %>%
  set_mode("classification")
#> Decision Tree Model Specification (classification)
#> 
#> Main Arguments:
#>   tree_depth = 3
#> 
#> Computational engine: rpart
```

---
# `tree_depth`

.big[
Cap the maximum tree depth.

A method to stop the tree early. Used to prevent overfitting.
]

```r
vanilla_tree_spec %>% set_args(tree_depth = 30)
#> Decision Tree Model Specification (classification)
#> 
#> Main Arguments:
#>   tree_depth = 30
#> 
#> Computational engine: rpart
```

---
exclude: true

---
<img src="images/ensembling/plots/splits-train-error-1.png" width="90%" style="display: block; margin: auto;" />

---
<img src="images/ensembling/plots/cp-train-error-1.png" width="90%" style="display: block; margin: auto;" />

---
<img src="images/ensembling/plots/cp-test-error-1.png" width="90%" style="display: block; margin: auto;" />

---
# `min_n`

.big[
Set minimum `n` to split at any node.

Another early stopping method. Used to prevent overfitting.
]

```r
vanilla_tree_spec %>% set_args(min_n = 20)
```

---
class: pop-quiz

# Pop quiz!

.big[What value of `min_n` would lead to the *most overfit* tree?]

`min_n` = 1

---
class: middle, center, frame

# Recap: early stopping

| `parsnip` arg | `rpart` arg | default | overfit? |
|---------------|-------------|:-------:|:--------:|
| `tree_depth`  | `maxdepth`  |    30   |⬆️|
| `min_n`       | `minsplit`  |    20   |⬇️|

---
# `cost_complexity`

.big[
Adds a cost or penalty to error rates of more complex trees.

A way to prune a tree. Used to prevent overfitting.
]

```r
vanilla_tree_spec %>% set_args(cost_complexity = .01)
#> Decision Tree Model Specification (classification)
#> 
#> Main Arguments:
#>   cost_complexity = 0.01
#> 
#> Computational engine: rpart
```

.center[
Closer to zero ➡️ larger trees.

Higher penalty ➡️ smaller trees. 
]

---

---
name: bonsai
background-image: url(images/ensembling/kari-shea-AVqh83jStMA-unsplash.jpg)
background-position: left
background-size: contain
class: middle

---
template: bonsai

.pull-right[

# Consider the bonsai

1. Small pot

1. Strong shears

]

---
template: bonsai

.pull-right[

# Consider the bonsai

1. ~~Small pot~~ .display[Early stopping]

1. ~~Strong shears~~ .display[Pruning]

]

---
class: middle, center, frame

# Recap: early stopping & pruning

---
class: middle, center

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> engine </th>
   <th style="text-align:left;"> parsnip </th>
   <th style="text-align:left;"> original </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> rpart </td>
   <td style="text-align:left;"> tree_depth </td>
   <td style="text-align:left;"> maxdepth </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rpart </td>
   <td style="text-align:left;"> min_n </td>
   <td style="text-align:left;"> minsplit </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rpart </td>
   <td style="text-align:left;"> cost_complexity </td>
   <td style="text-align:left;"> cp </td>
  </tr>
</tbody>
</table>

<https://rdrr.io/cran/rpart/man/rpart.control.html>

---
class: your-turn

# Your turn 2

Create a new classification tree model spec; call it `big_tree_spec`. Set the cost complexity to `0`, and the minimum number of data points in a node to split to be `1`.

Compare the metrics of the big tree to the vanilla tree. Which one predicts the test set better?

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
big_tree_spec <-
* decision_tree(min_n = 1, cost_complexity = 0) %>%
  set_engine("rpart") %>%
  set_mode("classification")

set.seed(100) # Important!
big_dt_mod <- fit_resamples(big_tree_spec,
                            remote ~ .,
                            resamples = so_folds)

big_dt_preds <- big_dt_mod %>%
  collect_metrics()
```
]

.panel[.panel-name[Metrics]

```r
big_dt_preds
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.590    10  0.0139 Preprocessor1_Model1
#> 2 roc_auc  binary     0.590    10  0.0139 Preprocessor1_Model1
```

Compare to `vanilla`: accuracy = 0.64; ROC AUC = 0.66

]
]

---
exclude: true

---
# The trouble with trees?

---
# Bootstrapping + decision trees

.big[Back to rainbows and unicorns!]

---
background-image: url(images/ensembling/ensemble/ensemble.001.jpeg)
background-size: cover

---
background-image: url(images/ensembling/ensemble/ensemble.002.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.003.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.004.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.005.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.006.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.007.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.008.jpeg)
background-size: contain

---
background-image: url(images/ensembling/ensemble/ensemble.009.jpeg)
background-size: contain

---
class: middle, frame, center

# Axiom

There is an inverse relationship between  
model *accuracy* and model *interpretability*.

---
exclude: true

```r
plot_tree_resample <- function(rset, id = "Bootstrap01", title = "Sample Variation") {
  lm_train <- function(rset) {
      lm(rainbows ~ unicorns, analysis(rset))
  }
  
  rt_train <- function(rset) {
      rpart::rpart(rainbows ~ unicorns, 
                   data = analysis(rset))
  }
  
  preds <- rset %>% 
      mutate(model = map(splits, lm_train)) %>% 
      mutate(tree = map(splits, rt_train)) %>% 
      mutate(augmented = map(model, augment)) %>% 
      mutate(.fitted_tree = map(tree, predict)) %>% 
      unnest(c(augmented, .fitted_tree))
  
  ggplot(preds, aes(x = unicorns, y = rainbows)) +
      geom_point(size = 3, color = "gray80", alpha = .2) +
      geom_count(data = filter(preds, id == {{ id }}), 
                 color = blue) +
      geom_line(data = filter(preds, id == {{ id }}),
                 aes(x = unicorns, y = .fitted_tree), 
                 colour = pink, size = 2) +
      coord_cartesian(ylim = c(-5, 5), xlim = c(-4, 4)) +
      theme(axis.text.x = element_blank(),
            axis.text.y = element_blank(),
            plot.title = element_text(hjust = 0.5)) + 
      labs(title = title) +
      scale_size_area(max_size = 7, guide = "none")
}

plot_tree_resamples <- function(rset, title = "Sample Variation") {
  lm_train <- function(rset) {
      lm(rainbows ~ unicorns, analysis(rset))
  }
  
  rt_train <- function(rset) {
      rpart::rpart(rainbows ~ unicorns, 
                   data = analysis(rset))
  }
  
  rset %>% 
      mutate(model = map(splits, lm_train)) %>% 
      mutate(tree = map(splits, rt_train)) %>% 
      mutate(augmented = map(model, augment)) %>% 
      mutate(.fitted_tree = map(tree, predict)) %>% 
      unnest(c(augmented, .fitted_tree)) %>% 
    ggplot(aes(unicorns, rainbows)) +
      geom_point(size = 3, color = "gray80") +
      geom_line(aes(y = .fitted_tree, group = id), 
                colour = pink, alpha=.5, size = 2) +
      coord_cartesian(ylim = c(-5, 5), xlim = c(-4, 4)) +
      theme(axis.text = element_blank(),
            plot.title = element_text(hjust = 0.5)) + 
      labs(title = title)
}

get_training <- function(rset, resample = 1) {
  rset %>% 
    pluck("splits", resample) %>% 
    analysis()
}

# make zero correlation variables
set.seed(100)
x <- rnorm(500)

# shuffle x to get y
set.seed(100)
y <- sample(x, size = 500)

# linear combos of x + y
unicorns <- x + y
rainbows <- x - y
cor(unicorns, rainbows)
#> [1] 0.00000000000000000002792251
uni <- tibble(unicorns = unicorns, rainbows = rainbows)

set.seed(1)
sample_1 <- sample_n(uni, 30)

set.seed(1)
boots <- bootstraps(sample_1, times = 25)

set.seed(1)
big_samples <- mc_cv(uni, prop = 0.6, times = 25)

set.seed(1)
big_boots <- bootstraps(get_training(big_samples, 1), times = 25) 
```

---

---

---

---

---

---

---

---

---

---
# `rand_forest()`

.big[Specifies a random forest model]

```r
rand_forest(mtry = 4, trees = 500, min_n = 1)
```

*either* mode works!

---
# `rand_forest()`

.big[Specifies a random forest model]

```r
rand_forest(
  mtry = 4,     # predictors seen at each node
  trees = 500,  # trees per forest
  min_n = 1,    # smallest node allowed
)
```

---
class: your-turn

# Your turn 3

Create a new model spec called `rf_spec`, which will learn an ensemble of classification trees from our training data using the **ranger** package.

Compare the metrics of the random forest to your two single tree models (vanilla and big)- which predicts the test set better?

*Hint: you'll need https://www.tidymodels.org/find/parsnip/*

---
class: your-turn

.panelset[
.panel[.panel-name[Random Forest]

```r
rf_spec <-
  rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

set.seed(100)
rf_mod <- fit_resamples(rf_spec,
                        remote ~ .,
                        resamples = so_folds)

rf_preds <- rf_mod %>%
  collect_metrics()
```
]

.panel[.panel-name[Performance]

```r
rf_preds
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.644    10  0.0132 Preprocessor1_Model1
#> 2 roc_auc  binary     0.704    10  0.0151 Preprocessor1_Model1
```

]

.panel[.panel-name[Comparison]
.pull-left[
.big[**Vanilla Decision Tree**]

```
#> # A tibble: 2 x 3
#>   .metric  .estimator  mean
#>   <chr>    <chr>      <dbl>
#> 1 accuracy binary     0.642
#> 2 roc_auc  binary     0.657
```

.big[**Big Decision Tree**]

```
#> # A tibble: 2 x 3
#>   .metric  .estimator  mean
#>   <chr>    <chr>      <dbl>
#> 1 accuracy binary     0.590
#> 2 roc_auc  binary     0.590
```
]

.pull-right[
.big[**Random Forest**]

```
#> # A tibble: 2 x 3
#>   .metric  .estimator  mean
#>   <chr>    <chr>      <dbl>
#> 1 accuracy binary     0.644
#> 2 roc_auc  binary     0.704
```
]
]
]

---
# `mtry`

.big[The number of predictors that will be randomly sampled at each split when creating the tree models.]

```r
rand_forest(mtry = 4)
```

**ranger** default = `floor(sqrt(num_predictors))`

---
class: your-turn

# Your turn 4

.big[Challenge: make 4 more random forest model specs, each using 4, 8, 12, and 20 variables at each split. Which value maximizes the area under the ROC curve?]

---
class: your-turn

.panelset[
.panel[.panel-name[mtry = 4]

```r
rf4_spec <- rf_spec %>%
  set_args(mtry = 4)

set.seed(100)
fit_resamples(rf4_spec, remote ~ .,
              resample = so_folds) %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.644    10  0.0132 Preprocessor1_Model1
#> 2 roc_auc  binary     0.704    10  0.0151 Preprocessor1_Model1
```
]

.panel[.panel-name[mtry = 8]

```r
rf8_spec <- rf_spec %>%
  set_args(mtry = 8)

set.seed(100)
fit_resamples(rf8_spec, remote ~ .,
              resample = so_folds) %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.622    10  0.0146 Preprocessor1_Model1
#> 2 roc_auc  binary     0.696    10  0.0138 Preprocessor1_Model1
```
]

.panel[.panel-name[mtry = 12]

```r
rf12_spec <- rf_spec %>%
  set_args(mtry = 12)

set.seed(100)
fit_resamples(rf12_spec, remote ~ .,
              resample = so_folds) %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.632    10  0.0116 Preprocessor1_Model1
#> 2 roc_auc  binary     0.690    10  0.0140 Preprocessor1_Model1
```
]

.panel[.panel-name[mtry = 20]

```r
rf20_spec <- rf_spec %>%
  set_args(mtry = 20)

set.seed(100)
fit_resamples(rf20_spec, remote ~ .,
              resample = so_folds) %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.620    10  0.0139 Preprocessor1_Model1
#> 2 roc_auc  binary     0.680    10  0.0142 Preprocessor1_Model1
```
]
]

---
class: middle, center

---

```r
treebag_spec <-
* rand_forest(mtry = .preds()) %>%
  set_engine("ranger") %>% 
  set_mode("classification")

set.seed(100)
fit_resamples(treebag_spec,
              remote ~ ., 
              resamples = so_folds) %>% 
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.620    10  0.0139 Preprocessor1_Model1
#> 2 roc_auc  binary     0.680    10  0.0142 Preprocessor1_Model1
```

---
# `.preds()`

.big[The number of columns in the data set that are associated with the predictors prior to dummy variable creation.]

```r
rand_forest(mtry = .preds())
```

---
.pull-left[

.big[**Vanilla Decision Tree**]

```
#> # A tibble: 2 x 3
#>   .metric  .estimator  mean
#>   <chr>    <chr>      <dbl>
#> 1 accuracy binary     0.642
#> 2 roc_auc  binary     0.657
```

.big[**Big Decision Tree**]

```
#> # A tibble: 2 x 3
#>   .metric  .estimator  mean
#>   <chr>    <chr>      <dbl>
#> 1 accuracy binary     0.590
#> 2 roc_auc  binary     0.590
```
]

.pull-right[
.big[**Random Forest**]

```
#> # A tibble: 2 x 3
#>   .metric  .estimator  mean
#>   <chr>    <chr>      <dbl>
#> 1 accuracy binary     0.644
#> 2 roc_auc  binary     0.704
```

.big[**Bagging**]

```
#> # A tibble: 2 x 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.620    10  0.0139 Preprocessor1_Model1
#> 2 roc_auc  binary     0.680    10  0.0142 Preprocessor1_Model1
```
]

---
class: middle, frame

# .center[To specify a model with parsnip]

.right-column[

.fade[

1\. Pick a .display[model]

]

2\. Set the .display[engine]

.fade[

3\. Set the .display[mode] (if needed)
]
]

---
# `set_engine()`

Adds to a model an R package to train the model.

```r
spec %>% set_engine(engine = "ranger", ...)
```

---
# `set_engine()`

Adds to a model an R package to train the model.

```r
spec %>% 
  set_engine(
    engine = "ranger", # package name in quotes
    ...                # optional arguments to pass to function
    )
```

---
.fade[

# `set_engine()`

Adds to a model an R package to train the model.
]

```r
rf_imp_spec <-
  rand_forest(mtry = 4) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")
```

---

```r
rf_imp_spec <-
  rand_forest(mtry = 4) %>% 
  set_engine("ranger", importance = 'impurity') %>% 
  set_mode("classification")

imp_fit <- fit(rf_imp_spec,
               remote ~ .,
               data = so_train) 
```

---

```r
imp_fit
#> parsnip model object
#> 
#> Fit time:  252ms 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~4,      x), importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      864 
#> Number of independent variables:  20 
#> Mtry:                             4 
#> Target node size:                 10 
#> Variable importance mode:         impurity 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.2204509
```

---
# `vip`

.big[Plot variable importance]

.center[
<iframe src="https://koalaverse.github.io/vip/index.html" width="80%" height="400px"></iframe>
]

---
# `vip()`

.big[Plot variable importance scores for the predictors in a model.]

```r
vip(object, geom = "point", ...)
```

---
# `vip()`

.big[Plot variable importance scores for the predictors in a model.]

```r
vip(object,          # fitted model object
    geom = "point",  # one of "col", "point", "boxplot", "violin"
    ...
    )
```

---

```r
vip(imp_fit, geom = "point")
```

---
class: your-turn

# Your turn 5

Make a new model spec called `treebag_imp_spec` to fit a bagged classification tree model. Set the variable `importance` mode to "permutation". Plot the variable importance- which variable was the most important?

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
treebag_imp_spec <-
  rand_forest(mtry = .preds()) %>% 
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("classification")

imp_fit <- 
  fit(treebag_imp_spec,
      remote ~ ., 
      data = so_train)

vip(imp_fit)
```
]

.panel[.panel-name[Plot]
<img src="images/ensembling/plots/treebag-vip-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---
class: title-slide, center

# Ensembling

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]