Case Study

class: title-slide, center

# Case Study

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]

<div style = "position:fixed; visibility: hidden">
  `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$`
  `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$`
  `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$`
  `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$`
  `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$`
  `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$`
  `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$`
</div>
  
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      Macros: {
        blue: ["{\\color{blue}{#1}}", 1],
        light_blue: ["{\\color{light_blue}{#1}}", 1],
        yellow: ["{\\color{yellow}{#1}}", 1],
        dark_yellow: ["{\\color{dark_yellow}{#1}}", 1],
        pink: ["{\\color{pink}{#1}}", 1],
        light_pink: ["{\\color{light_pink}{#1}}", 1],
        grey: ["{\\color{grey}{#1}}", 1]
      },
      loader: {load: ['[tex]/color']},
      tex: {packages: {'[+]': ['color']}}
    }
  });
</script>

---
background-image: url(images/case-study-2/all-hex.png)
background-position: center middle
background-size: 70%

---
class: your-turn

# Your Turn 0

.big[
* Open the R Notebook **materials/exercises/15-case-study-2.Rmd**
* Run the setup chunk
]

---
background-image: url(images/case-study-2/tm-process/tmprocess.001.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.002.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.003.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.004.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.005.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.006.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.007.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.008.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.009.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.010.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.011.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.012.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.013.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.014.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.015.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.016.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.017.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.018.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.019.png)
background-size: cover

---
# Hotel bookings

* Data from [Antonio, Almeida, & Nunes (2019)](https://doi.org/10.1016/j.dib.2018.11.126)

* We're using a slightly modified version that includes only *hotel stays*

.big[
* **Which hotel stays included children and/or babies?**
]

* Two methods:
    * Logistic regression
    * Random forest

.footnote[
Based on tidymodels [case study](https://www.tidymodels.org/start/case-study/).
]

???

In paper authors caution that distributions of some variables were different for canceled vs. not canceled. This makes sense because much of that information is gathered (or gathered again more accurately) when guests check in for their stay, so canceled bookings are likely to have more missing data than non-canceled bookings, and/or to have different characteristics when data is not missing

---
# `hotels`

```r
glimpse(hotels)
#> Rows: 50,000
#> Columns: 23
#> $ hotel                          <fct> City_Hotel, City_Hotel, Resort_Hotel, R…
#> $ lead_time                      <dbl> 217, 2, 95, 143, 136, 67, 47, 56, 80, 6…
#> $ stays_in_weekend_nights        <dbl> 1, 0, 2, 2, 1, 2, 0, 0, 0, 2, 1, 0, 1, …
#> $ stays_in_week_nights           <dbl> 3, 1, 5, 6, 4, 2, 2, 3, 4, 2, 2, 1, 2, …
#> $ adults                         <dbl> 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, …
#> $ children                       <fct> none, none, none, none, none, none, chi…
#> $ meal                           <fct> BB, BB, BB, HB, HB, SC, BB, BB, BB, BB,…
#> $ country                        <fct> DEU, PRT, GBR, ROU, PRT, GBR, ESP, ESP,…
#> $ market_segment                 <fct> Offline_TA/TO, Direct, Online_TA, Onlin…
#> $ distribution_channel           <fct> TA/TO, Direct, TA/TO, TA/TO, Direct, TA…
#> $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ reserved_room_type             <fct> A, D, A, A, F, A, C, B, D, A, A, D, A, …
#> $ assigned_room_type             <fct> A, K, A, A, F, A, C, A, D, A, D, D, A, …
#> $ booking_changes                <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ deposit_type                   <fct> No_Deposit, No_Deposit, No_Deposit, No_…
#> $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ customer_type                  <fct> Transient-Party, Transient, Transient, …
#> $ average_daily_rate             <dbl> 80.75, 170.00, 8.00, 81.00, 157.60, 49.…
#> $ required_car_parking_spaces    <fct> none, none, none, none, none, none, non…
#> $ total_of_special_requests      <dbl> 1, 3, 2, 1, 4, 1, 1, 1, 1, 1, 0, 1, 0, …
#> $ arrival_date                   <date> 2016-09-01, 2017-08-25, 2016-11-19, 20…
```

---
class: your-turn

# Your turn 1

Look at our outcome variable, `children`.

* What are the levels?

* What proportion of cases are in each level?

---
class: your-turn

```r
hotels %>%
  count(children) %>%
  mutate(prop = n / sum(n))
#> # A tibble: 2 x 3
#>   children     n   prop
#>   <fct>    <int>  <dbl>
#> 1 children  4038 0.0808
#> 2 none     45962 0.919
```

???

With such a large class imbalance, we may consider upsampling or downsampling. For now, let's use the data as is.

---
class: your-turn

# Your turn 2

Create an initial split of our data to create a training and a testing set. Call the split `splits`.

Because of the class imbalance, stratify the split by our outcome variable, `children`.

Extract the training data as `hotel_other` and the testing data as `hotel_test`.

Keep `set.seed(123)` in your code!

---
class: your-turn

```r
set.seed(123)

splits <- initial_split(hotels, strata = children)
splits
#> <Analysis/Assess/Total>
#> <37500/12500/50000>

hotel_other <- training(splits)
hotel_test <- testing(splits)
```

---

```r
hotel_other %>%
  count(children) %>%
  mutate(prop = n / sum(n))
#> # A tibble: 2 x 3
#>   children     n   prop
#>   <fct>    <int>  <dbl>
#> 1 children  3048 0.0813
#> 2 none     34452 0.919

hotel_test %>%
  count(children) %>%
  mutate(prop = n / sum(n))
#> # A tibble: 2 x 3
#>   children     n   prop
#>   <fct>    <int>  <dbl>
#> 1 children   990 0.0792
#> 2 none     11510 0.921
```

---
class: your-turn

# Your turn 3

How to choose a prediction model? Let's use resampling to evaluate our potential models.

Create a 10-fold cross validation, stratified by `children`. Call it `folds`.

---
class: your-turn

```r
set.seed(789)

folds <- vfold_cv(hotel_other, v = 10, strata = children)
folds
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 x 2
#>    splits               id    
#>    <list>               <chr> 
#>  1 <split [33750/3750]> Fold01
#>  2 <split [33750/3750]> Fold02
#>  3 <split [33750/3750]> Fold03
#>  4 <split [33750/3750]> Fold04
#>  5 <split [33750/3750]> Fold05
#>  6 <split [33750/3750]> Fold06
#>  7 <split [33750/3750]> Fold07
#>  8 <split [33750/3750]> Fold08
#>  9 <split [33750/3750]> Fold09
#> 10 <split [33750/3750]> Fold10
```

---
# Model 1: Penalized logistic regression

* Can be estimated in R using the {glmnet} package.

* A *penalty* is applied to slope parameters to pull less relevant predictors toward zero.

* What size penalty should be used?

???

Similar to Lasso regression, where slopes can actually be set to 0 if there is a large enough penalty.

---
class: your-turn

# Your turn 4

Build a model specification for the penalized logistic regression, called `lr_mod`.

Define the model so that the `penalty` for the model can be tuned.

Use `glmnet` as the model engine.

---
class: your-turn

```r
lr_mod <- 
  logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")
```

???

Mixture = 1 means that the glmnet model will potentially remove irrelevant predictors and choose a simpler model.

---
# Model 1 recipe

---
class: pop-quiz

# Pop quiz!

What date-based predictors might be useful related to arrival date?

Year, month, day

Holidays

---
# Model 1 recipe

* Date-based predictors
    * [`step_date()`](https://recipes.tidymodels.org/reference/step_date.html)
    * [`step_holiday()`](https://recipes.tidymodels.org/reference/step_holiday.html)

* Convert factors to dummy coded variables

* Remove variables that have only 1 value

* Center and scale numeric variables

---
class: your-turn

# Your turn 5

Using the provided list of holidays, create a recipe, called `lr_recipe`, for the logistic regression model that:

1. Creates `year`, `month`, and `day` variables from `arrival_date`.
2. Creates holiday indicators based on the `arrival_date`.
3. Remove `arrival_date` (not necessary with other features).
4. Create dummy variables from all predictors that are factors.
5. Remove all variables have only one value.
6. Center and scale all predictors.

---
class: your-turn

```r
holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter", 
              "ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")

lr_recipe <- 
  recipe(children ~ ., data = hotel_other) %>% 
  step_date(arrival_date) %>% 
  step_holiday(arrival_date, holidays = holidays) %>% 
  step_rm(arrival_date) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())
```

---
class: your-turn

# Your turn 6

Create an `lr_workflow` from the model specification and recipe you just created.

---
class: your-turn

```r
lr_workflow <-
  workflow() %>%
  add_model(lr_mod) %>%
  add_recipe(lr_recipe)
```

---
class: your-turn

# Your turn 7

Tune the logistic regression workflow. Use the provided grid of tuning parameter values.

Which penalty value provides the best area under the ROC curve?

Create a plot showing `penalty` on the x-axis and area under the ROC curve on the y-axis.

---
class: your-turn

.panelset[
.panel[.panel-name[Tune Workflow]

```r
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))

lr_res <-
  lr_workflow %>%
  tune_grid(folds,
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))

lr_res %>%
  show_best()
#> # A tibble: 5 x 7
#>    penalty .metric .estimator  mean     n std_err .config              
#>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 0.00137  roc_auc binary     0.875    10 0.00303 Preprocessor1_Model12
#> 2 0.00174  roc_auc binary     0.875    10 0.00300 Preprocessor1_Model13
#> 3 0.00108  roc_auc binary     0.875    10 0.00310 Preprocessor1_Model11
#> 4 0.000853 roc_auc binary     0.875    10 0.00315 Preprocessor1_Model10
#> 5 0.00221  roc_auc binary     0.875    10 0.00299 Preprocessor1_Model14
```
]

.panel[.panel-name[Create Plot]

```r
lr_res %>%
  collect_metrics() %>%
  ggplot(aes(x = penalty, y = mean)) +
  geom_point() +
  geom_line() +
  labs(y = "Area under the ROC Curve") +
  scale_x_log10(labels = scales::label_number())
```

]

.panel[.panel-name[Plot]
<img src="images/case-study-2/plots/show-lr-roc-1.png" width="80%" style="display: block; margin: auto;" />
]
]

???

What does this show us? Performance is better at the smaller penalty values, suggesting that the majority of the predictors are important. We also see a steep drop in the area under the ROC curve towards the highest penalty values. This happens because a large enough penalty will remove all predictors from the model, and not surprisingly predictive accuracy plummets with no predictors in the model (recall that an ROC AUC value of 0.50 means that the model does no better than chance at predicting the correct class).

---

```r
top_models <- lr_res %>%
  show_best("roc_auc", n = 15) %>%
  arrange(penalty)
top_models
#> # A tibble: 15 x 7
#>     penalty .metric .estimator  mean     n std_err .config              
#>       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#>  1 0.0001   roc_auc binary     0.873    10 0.00322 Preprocessor1_Model01
#>  2 0.000127 roc_auc binary     0.873    10 0.00326 Preprocessor1_Model02
#>  3 0.000161 roc_auc binary     0.873    10 0.00329 Preprocessor1_Model03
#>  4 0.000204 roc_auc binary     0.874    10 0.00330 Preprocessor1_Model04
#>  5 0.000259 roc_auc binary     0.874    10 0.00331 Preprocessor1_Model05
#>  6 0.000329 roc_auc binary     0.874    10 0.00328 Preprocessor1_Model06
#>  7 0.000418 roc_auc binary     0.874    10 0.00324 Preprocessor1_Model07
#>  8 0.000530 roc_auc binary     0.875    10 0.00322 Preprocessor1_Model08
#>  9 0.000672 roc_auc binary     0.875    10 0.00319 Preprocessor1_Model09
#> 10 0.000853 roc_auc binary     0.875    10 0.00315 Preprocessor1_Model10
#> 11 0.00108  roc_auc binary     0.875    10 0.00310 Preprocessor1_Model11
#> 12 0.00137  roc_auc binary     0.875    10 0.00303 Preprocessor1_Model12
#> 13 0.00174  roc_auc binary     0.875    10 0.00300 Preprocessor1_Model13
#> 14 0.00221  roc_auc binary     0.875    10 0.00299 Preprocessor1_Model14
#> 15 0.00281  roc_auc binary     0.874    10 0.00299 Preprocessor1_Model15
```

???

Our model performance seems to plateau at the smaller penalty values, so going by the `roc_auc` metric alone could lead us to multiple options for the “best” value for this hyperparameter.

---
# Which is best?

```r
lr_best <- lr_res %>%
  select_best()
lr_best
#> # A tibble: 1 x 2
#>   penalty .config              
#>     <dbl> <chr>                
#> 1 0.00137 Preprocessor1_Model12
```

---
<img src="images/case-study-2/plots/lr-select-best-1.png" width="90%" style="display: block; margin: auto;" />

---

???

We may want to choose a penalty value further along the x-axis, closer to where we start to see the decline in model performance. For example, candidate model 14 with a penalty value of 0.00221 has effectively the same performance as the numerically best model, but might eliminate more predictors. This penalty value is marked by the solid line above. In general, fewer irrelevant predictors is better. If performance is about the same, we’d prefer to choose a higher penalty value.

---

```r
lr_best <- lr_res %>%
  collect_metrics() %>%
  arrange(penalty) %>%
  slice(14)

lr_best
#> # A tibble: 1 x 7
#>   penalty .metric .estimator  mean     n std_err .config              
#>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 0.00221 roc_auc binary     0.875    10 0.00299 Preprocessor1_Model14
```

---
class: your-turn

# Your turn 8

Create an ROC curve for the selected `penalty` value.

1. Use `collect_predictions()` with `parameters = lr_best` to only get predictions for our selected penalty value.

2. Use the predictions and `roc_curve()` to make the data for the curve.

3. Add an additional variable called `model` that has the value `"Logistic Regression"`. We'll need this later.

4. Plot the ROC curve.

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
lr_auc <- lr_res %>%
  collect_predictions(parameters = lr_best) %>%
  roc_curve(truth = children, estimate = .pred_children) %>%
  mutate(model = "Logistic Regression")

autoplot(lr_auc)
```
]

.panel[.panel-name[Plot]
<img src="images/case-study-2/plots/unnamed-chunk-22-1.png" width="50%" style="display: block; margin: auto;" />
]
]

???

ROC of 0.8747951 is good, but maybe we can do better with a non-liner model. Let's try a random forest!

---
# Model 2: Random forest

* Several packages, but we'll use {ranger}.

* More flexible than logistic regression.

* Ensemble of many decision trees.

---
class: your-turn

# Your turn 9

Build a model specification for the random forest model, called `rf_mod`.

Define the model so that the `mtry` and `min_n` for the model can be tuned.

Use `ranger` as the model engine.

Remember to set the mode!

---
class: your-turn

```r
rf_mod <-
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
  set_engine("ranger") %>%
  set_mode("classification")
```

---
# Model 2 recipe

* Random forests do not require dummy or normalized predictors.

* We still want our date-based features.

---
class: your-turn

# Your turn 10

Using the same list of holidays, create a recipe, called `rf_recipe`, for the random forest model that:

1. Creates `year`, `month`, and `day` variables from `arrival_date`.
2. Creates holiday indicators based on the `arrival_date`.
3. Remove `arrival_date` (not necessary with other features).

---
class: your-turn

```r
rf_recipe <-
  recipe(children ~ ., data = hotel_other) %>%
  step_date(arrival_date) %>% 
  step_holiday(arrival_date, holidays = holidays) %>% 
  step_rm(arrival_date)
```

---
class: your-turn

# Your turn 11

Create an `rf_workflow` from the model sepcification and the recipe you just created.

---
class: your-turn

```r
rf_workflow <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(rf_recipe)
```

---
# Tune the workflow

---
class: center middle inverse

# ⚠️ Warning ⚠️

Tuning this model will take a **long** time (> 1 hour)

---
# Tune the workflow

```r
set.seed(345)

rf_res <- rf_workflow %>%
  tune_grid(folds,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))
```

```r
rf_res <- read_rds(here("materials", "data", "hotels-rf-tune.rds"))
```

---
class: your-turn

# Your turn 12

Which `mtry` and `min_n` values provide the best area under the ROC curve?

---
class: your-turn

.panelset[
.panel[.panel-name[Tune Workflow]

```r
all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(345)

rf_res <- rf_workflow %>%
  tune_grid(folds,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))
```
]

.panel[.panel-name[Results]

```r
rf_res %>%
  show_best(metric = "roc_auc")
#> # A tibble: 5 x 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1     7    11 roc_auc binary     0.925    10 0.00192 Preprocessor1_Model09
#> 2     4     5 roc_auc binary     0.925    10 0.00176 Preprocessor1_Model22
#> 3     9    10 roc_auc binary     0.924    10 0.00189 Preprocessor1_Model24
#> 4     7    15 roc_auc binary     0.924    10 0.00181 Preprocessor1_Model07
#> 5    12     6 roc_auc binary     0.923    10 0.00203 Preprocessor1_Model23
```
]

.panel[.panel-name[Plot]
<img src="images/case-study-2/plots/rf-tune-plot-1.png" width="80%" style="display: block; margin: auto;" />
]
]

???

Plotting the results of the tuning process highlights that both mtry (number of predictors at each node) and min_n (minimum number of data points required to keep splitting) should be fairly small to optimize performance. However, the range of the y-axis indicates that the model is very robust to the choice of these parameter values---all but one of the ROC AUC values are greater than 0.90.

---

```r
rf_best <- rf_res %>%
  select_best(metric = "roc_auc")

rf_best
#> # A tibble: 1 x 3
#>    mtry min_n .config              
#>   <int> <int> <chr>                
#> 1     7    11 Preprocessor1_Model09
```

---
class: your-turn

# Your turn 13

Create an ROC curve for the selected `mtry` and `min_n` values.

1. Use `collect_predictions()` with `parameters = rf_best` to only get predictions for our selected penalty value.

2. Use the predictions and `roc_curve()` to make the data for the curve.

3. Add an additional variable called `model` that has the value `"Random Forest"`. We'll need this later.

4. Plot the ROC curve.

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
rf_auc <- rf_res %>%
  collect_predictions(parameters = rf_best) %>%
  roc_curve(truth = children, estimate = .pred_children) %>%
  mutate(model = "Random Forest")

autoplot(rf_auc)
```
]

.panel[.panel-name[Plot]
<img src="images/case-study-2/plots/unnamed-chunk-35-1.png" width="50%" style="display: block; margin: auto;" />
]
]

---
class: your-turn

# Your turn 14

Compare the area under the ROC curves for our selected logistic regression and random forest models.

1. Which model provides the best ROC AUC?

2. Plot both ROC curves together. Is one model uniformly better?

Which model should we select as our final model?

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
lr_best
#> # A tibble: 1 x 7
#>   penalty .metric .estimator  mean     n std_err .config              
#>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1 0.00221 roc_auc binary     0.875    10 0.00299 Preprocessor1_Model14

show_best(rf_res, n = 1)
#> # A tibble: 1 x 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1     7    11 roc_auc binary     0.925    10 0.00192 Preprocessor1_Model09

bind_rows(lr_auc, rf_auc) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_line(size = 1.5, alpha = 0.8) +
  geom_abline(linetype = "dotted") +
  scale_color_viridis_d(option = "plasma", end = 0.6) +
  coord_equal()
```
]

.panel[.panel-name[Plot]
<img src="images/case-study-2/plots/rocauc-compare-1.png" width="50%" style="display: block; margin: auto;" />
]
]

---
class: center middle inverse

# Now what?

---
background-image: url(images/case-study-2/tm-process/tmprocess.007.png)
background-size: cover

---
background-image: url(images/case-study-2/tm-process/tmprocess.010.png)
background-size: cover

---
class: your-turn

# Your turn 15

Create a new random forest model specification that uses our tuned values of `mtry` and `min_n`. Use `ranger` as the engine, with `importance = "impurity"`.

Create a new workflow, called `last_rf_workflow`, that is based on our original random forest workflow, but updated with the new model specification.

---
class: your-turn

```r
rf_best
#> # A tibble: 1 x 3
#>    mtry min_n .config              
#>   <int> <int> <chr>                
#> 1     7    11 Preprocessor1_Model09

last_rf_mod <-
  rand_forest(mtry = 7, min_n = 11, trees = 1000) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

last_rf_workflow <-
  rf_workflow %>%
  update_model(last_rf_mod)
```

---
class: your-turn

# Your turn 16

Using our final workflow, fit the random forest model to our entire training set. Then using predictions from the test set, find the area under the ROC curve for our final model.

---
class: your-turn

```r
set.seed(345)

last_rf_fit <-
  last_rf_workflow %>%
* last_fit(splits)

last_rf_fit %>%
  collect_metrics()
#> # A tibble: 2 x 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.945 Preprocessor1_Model1
#> 2 roc_auc  binary         0.923 Preprocessor1_Model1
```

---
class: center middle inverse

# final looks

---
# Performance Comparison

```r
show_best(rf_res, metric = "roc_auc", n = 1)
#> # A tibble: 1 x 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1     7    11 roc_auc binary     0.925    10 0.00192 Preprocessor1_Model09

collect_metrics(last_rf_fit)
#> # A tibble: 2 x 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.945 Preprocessor1_Model1
#> 2 roc_auc  binary         0.923 Preprocessor1_Model1
```

---
.panelset[
.panel[.panel-name[ROC Curve]

```r
last_rf_fit %>%
  collect_predictions() %>%
  roc_curve(children, .pred_children) %>%
  mutate(model = "Final") %>%
  bind_rows(rf_auc, lr_auc) %>%
  mutate(model = factor(model, levels = c("Logistic Regression", "Random Forest", "Final"))) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_line(size = 1.5, alpha = 0.8) +
  geom_abline(linetype = "dotted") +
  scale_color_manual(values = c("Random Forest" = light_pink,
                                "Logistic Regression" = light_blue,
                                "Final" = pink)) +
  coord_equal()
```

]

.panel[.panel-name[Plot]

]
]

---
.panelset[
.panel[.panel-name[Variable Importance]

```r
last_rf_fit %>%
  pluck(".workflow", 1) %>%
  pull_workflow_fit() %>%
  vip(num_features = 20)
```

]

.panel[.panel-name[Plot]

]
]

---
class: title-slide, center

# Case Study

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]