Classification

# Classification

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]

<div style = "position:fixed; visibility: hidden">
  `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$`
  `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$`
  `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$`
  `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$`
  `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$`
  `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$`
  `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$`
</div>
  
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      Macros: {
        blue: ["{\\color{blue}{#1}}", 1],
        light_blue: ["{\\color{light_blue}{#1}}", 1],
        yellow: ["{\\color{yellow}{#1}}", 1],
        dark_yellow: ["{\\color{dark_yellow}{#1}}", 1],
        pink: ["{\\color{pink}{#1}}", 1],
        light_pink: ["{\\color{light_pink}{#1}}", 1],
        grey: ["{\\color{grey}{#1}}", 1]
      },
      loader: {load: ['[tex]/color']},
      tex: {packages: {'[+]': ['color']}}
    }
  });
</script>

---
class: your-turn

# Your Turn 0

---
class: middle, center, frame

# Goal of Machine Learning

## 🔨 construct .display[models] that

## 🎯 generate .display[accurate predictions]

## 🆕 for .display[future, yet-to-be-seen data]

---
class: inverse, middle, center

A model doesn't have to be a straight line...

---
class: inverse, middle, center

.pull-left[
<img src="images/classification/plots/lm-fig-1.svg" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/classification/plots/poly-fig-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

# Decision Trees

* Use rules learned from splits

* Each split maximizes information gain

---
class: middle, center

![](https://media.giphy.com/media/gj4ZruUQUnpug/source.gif)

---

---
<img src="images/classification/plots/rt-split-smooth-1.png" width="55%" style="display: block; margin: auto;" />

---
class: pop-quiz

# Consider

RMSE?

---
<img src="images/classification/plots/rt-split-rmse-1.png" width="55%" style="display: block; margin: auto;" />

---

### LM RMSE = 53884.78
<img src="images/classification/plots/lm-test-resid-1.png" width="100%" style="display: block; margin: auto;" />

]

### Tree RMSE = 61687.24
<img src="images/classification/plots/print-rt-split-rmse-1.png" width="100%" style="display: block; margin: auto;" />

]

---
class: inverse, middle, center

.pull-left[
<img src="images/classification/plots/print-lm-fit-1.svg" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/classification/plots/dt-fig-1.svg" width="100%" style="display: block; margin: auto;" />
]

---
class: middle, center, inverse

# What is a model?

---
# K Nearest Neighbors (KNN)

* Find the K most similar old data points

* Take the average/mode/etc. outcome

---

```r
library(kknn)
knn_spec <- nearest_neighbor(neighbors = 5) %>% 
           set_engine("kknn") %>% 
           set_mode("regression")

set.seed(100)
knn_fit <- fit(knn_spec, Sale_Price ~ ., data = ames_train)

knn_pred <- knn_fit %>% 
  predict(new_data = ames_test) %>% 
  mutate(price_truth = ames_test$Sale_Price)

rmse(knn_pred, truth = price_truth, estimate = .pred)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      35870.

rsq(knn_pred, truth = price_truth, estimate = .pred)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rsq     standard       0.812
```

---
exclude: true

---

---

---

---

---

---

.pull-left[
<img src="images/classification/plots/knn-home2-1-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/classification/plots/underfit-knn-1.png" width="100%" style="display: block; margin: auto;" />
]

---
.pull-left[
<img src="images/classification/plots/knn-home2-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/classification/plots/fit-knn-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: pop-quiz

# Pop quiz!

[Why is logistic regression considered a linear model?](https://sebastianraschka.com/faq/docs/logistic_regression_linear.html)

.center[
<img src="images/classification/plots/lr-fig-1.svg" width="40%" style="display: block; margin: auto;" />
]

---
class: middle, center
<img src="https://raw.githubusercontent.com/EmilHvitfeldt/blog/master/static/blog/2019-08-09-authorship-classification-with-tidymodels-and-textrecipes_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" />

https://www.hvitfeldt.me/blog/authorship-classification-with-tidymodels-and-textrecipes/

---
class: middle, center
<img src="https://www.kaylinpavlik.com/content/images/2019/12/dt-1.png" width="50%" style="display: block; margin: auto;" />

https://www.kaylinpavlik.com/classifying-songs-genres/

---
class: middle, center
<img src="images/classification/sing-tree.png" width="607" style="display: block; margin: auto;" />

[The Science of Singing Along](http://www.doc.gold.ac.uk/~mas03dm/papers/PawleyMullensiefen_Singalong_2012.pdf)

---
class: middle, center

---
name: guess-the-animal
class: middle, center, inverse

---
# What makes a good guesser?

---
class: inverse, middle, center

# Congratulations!

You just built a decision tree 🎉

---
background-image: url(images/classification/aus-standard-animals.png)
background-size: cover

.footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)]

---
background-image: url(images/classification/annotated-tree-00.png)
background-size: 80%

.footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)]

---
background-image: url(images/classification/annotated-tree-01.png)
background-size: 80%

---
background-image: url(images/classification/annotated-tree-02.png)
background-size: 80%

---
background-image: url(images/classification/annotated-tree-03.png)
background-size: 80%

---
background-image: url(images/classification/annotated-tree-04.png)
background-size: 80%

---
background-image: url(images/classification/annotated-tree-05.png)
background-size: 80%

---
background-image: url(images/classification/bonsai-anatomy.jpg)
background-size: cover

---
background-image: url(images/classification/bonsai-anatomy-flip.jpg)
background-size: cover

---
class: pop-quiz

# Pop quiz!

---
class: pop-quiz

---
class: pop-quiz

---
class: center, middle

# Show of hands

How many people have .display[fit] a logistic regression model with `glm()`?

---
exclude: true

---
class: middle, center, inverse

.pull-left[
<img src="images/classification/plots/show-unicorn-1.png" width="1695" style="display: block; margin: auto;" />

]

.pull-right[
<img src="images/classification/plots/show-horse-1.png" width="1695" style="display: block; margin: auto;" />
]

---

```r
uni_train %>% 
  count(unicorn)
#>   unicorn   n
#> 1       0 100
#> 2       1  50
```
]

.pull-right[
<img src="images/classification/plots/unicorn-box-1.png" width="100%" style="display: block; margin: auto;" />

]

---

---

???

Logistic regression model

---
<img src="images/classification/plots/unicorn-probs-1.png" width="55%" style="display: block; margin: auto;" />

???

The probability that each observation is a unicorn

---
<img src="images/classification/plots/unicorn-pred-class-1.png" width="55%" style="display: block; margin: auto;" />

???

Predicted class of each observation

---
class: middle, center

---

```
#> parsnip model object
#> 
#> Fit time:  2ms 
#> n= 150 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#> 1) root 150 50 0 (0.6666667 0.3333333)  
#>   2) n_butterflies>=29.5 93 16 0 (0.8279570 0.1720430) *
#>   3) n_butterflies< 29.5 57 23 1 (0.4035088 0.5964912)  
#>     6) n_kittens>=62.5 18  6 0 (0.6666667 0.3333333) *
#>     7) n_kittens< 62.5 39 11 1 (0.2820513 0.7179487) *
```

---
<img src="images/classification/plots/uni-tree-partykit-1.png" width="720" style="display: block; margin: auto;" />

---

```
#>  nn ..y    0   1                                               cover
#>   2   0 [.83 .17] when n_butterflies >= 30                       62%
#>   6   0 [.67 .33] when n_butterflies <  30 & n_kittens >= 63     12%
#>   7   1 [.28 .72] when n_butterflies <  30 & n_kittens <  63     26%
```

---

.pull-left[
<img src="images/classification/plots/og-data-no-div-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/classification/plots/pred-data-no-div-1.png" width="100%" style="display: block; margin: auto;" />

]

---

]

]

### .center[🦋 split wins]

---

]

]

### .center[🐱 split wins]

---
class: middle, center

# Sadly, we are not classifying unicorns today

---
background-image: url(images/classification/copyingandpasting-big.png)
background-size: contain
background-position: center
class: middle, center

---
background-image: url(images/classification/so-dev-survey.png)
background-size: contain
background-position: center
class: middle, center

---

???

Notes: The specific question we are going to address is what makes a developer more likely to work remotely. Developers can work in their company offices or they can work remotely, and it turns out that there are specific characteristics of developers, such as the size of the company that they work for, how much experience they have, or where in the world they live, that affect how likely they are to be a remote developer.

---

# StackOverflow Data

```r
glimpse(stackoverflow)
#> Rows: 1,150
#> Columns: 21
#> $ country                              <fct> United States, United States, Uni…
#> $ salary                               <dbl> 63750.00, 93000.00, 40625.00, 450…
#> $ years_coded_job                      <int> 4, 9, 8, 3, 8, 12, 20, 17, 20, 4,…
#> $ open_source                          <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, …
#> $ hobby                                <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, …
#> $ company_size_number                  <dbl> 20, 1000, 10000, 1, 10, 100, 20, …
#> $ remote                               <fct> Remote, Remote, Remote, Remote, R…
#> $ career_satisfaction                  <int> 8, 8, 5, 10, 8, 10, 9, 7, 8, 7, 9…
#> $ data_scientist                       <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ database_administrator               <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, …
#> $ desktop_applications_developer       <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
#> $ dev_ops                              <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ embedded_developer                   <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
#> $ graphic_designer                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ graphics_programming                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ machine_learning_specialist          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ mobile_developer                     <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
#> $ quality_assurance_engineer           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ systems_administrator                <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ web_developer                        <dbl> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, …
```

---

# `initial_split()`

.big["Splits" data randomly into a single testing and a single training set;
extract `training` and `testing` sets from an rsplit]

```r
set.seed(100) # Important!
so_split <- initial_split(stackoverflow, strata = remote)
so_train <- training(so_split)
so_test  <- testing(so_split)
```

---
class: your-turn

# Your turn 1

.big[Using the `so_train` and `so_test` data sets, how many individuals in our training set are remote? How about in the testing set?]

---
class: your-turn

```r
so_train %>% 
  count(remote)
#> # A tibble: 2 x 2
#>   remote         n
#>   <fct>      <int>
#> 1 Remote       432
#> 2 Not remote   432

so_test %>% 
  count(remote)
#> # A tibble: 2 x 2
#>   remote         n
#>   <fct>      <int>
#> 1 Remote       143
#> 2 Not remote   143
```

---

```r
so_train %>% 
  count(remote)
#> # A tibble: 2 x 2
#>   remote         n
#>   <fct>      <int>
#> 1 Remote       432
#> 2 Not remote   432

so_test %>% 
  count(remote)
#> # A tibble: 2 x 2
#>   remote         n
#>   <fct>      <int>
#> 1 Remote       143
#> 2 Not remote   143
```

]

<img src="images/classification/plots/so-box-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: inverse

# How would we fit a tree with parsnip?

---
class: middle, frame

# .center[To specify a model with parsnip]

1\. Pick a .display[model]

2\. Set the .display[engine]

3\. Set the .display[mode] (if needed)

]

---
# 1\. Pick a .display[model]

All available models are listed at <https://www.tidymodels.org/find/parsnip/>

.center[
<iframe src="https://www.tidymodels.org/find/parsnip/" width="80%" height="400px"></iframe>
]

---
# 2\. Set the .display[engine]

We'll use `rpart` for building `C`lassification `A`nd `R`egression `T`rees

```r
set_engine("rpart") 
```

---
# 3\. Set the .display[mode]

A character string for the model type (e.g. "classification" or "regression")

```r
set_mode("classification") 
```

---
class: middle, frame

# .center[To specify a model with parsnip]

```r
decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")
```

---
class: your-turn

# Your turn 2

Fill in the blanks. Use the `tree_spec` model provided and `fit()` to:

1. Train a CART-based model with the formula = `remote ~ years_coded_job + salary`.

1. Remind yourself what the output looks like!

1. Predict remote status with the testing data.

1. Keep `set.seed(100)` at the start of your code.

---
class: your-turn

```r
tree_spec <- 
  decision_tree() %>%         
  set_engine("rpart") %>%      
  set_mode("classification")

set.seed(100) # Important!
tree_fit <- fit(tree_spec,
                remote ~ years_coded_job + salary,
                data = so_train) 
```
]

```r
tree_fit
#> parsnip model object
#> 
#> Fit time:  7ms 
#> n= 864 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#> 1) root 864 432 Remote (0.5000000 0.5000000)  
#>   2) salary>=89196.97 329 103 Remote (0.6869301 0.3130699) *
#>   3) salary< 89196.97 535 206 Not remote (0.3850467 0.6149533)  
#>     6) salary< 6423.433 40  16 Remote (0.6000000 0.4000000) *
#>     7) salary>=6423.433 495 182 Not remote (0.3676768 0.6323232) *
```
]

```r
predict(tree_fit, new_data = so_test)
#> # A tibble: 286 x 1
#>    .pred_class
#>    <fct>      
#>  1 Remote     
#>  2 Remote     
#>  3 Not remote 
#>  4 Not remote 
#>  5 Remote     
#>  6 Not remote 
#>  7 Remote     
#>  8 Not remote 
#>  9 Remote     
#> 10 Not remote 
#> # … with 276 more rows
```
]
]

---

# Goal of Machine Learning

## 🔨 construct .display[models] that

## 🔮 generate accurate .display[predictions]

## 🆕 for .display[future, yet-to-be-seen data]

]

---

# Goal of Machine Learning

## 🔮 generate accurate .display[predictions]

]

## 🆕 for .display[future, yet-to-be-seen data]

---

# Goal of Machine Learning

]

## 🔮 generate accurate .display[predictions]

## 🆕 for .display[future, yet-to-be-seen data]

]
---

# Goal of Machine Learning

]

## 🎯 generate .display[accurate predictions]

## 🆕 for .display[future, yet-to-be-seen data]

]

---
class: your-turn

# Your turn 3

Create a data frame of the observed and predicted remote status for the `so_test` data. Then use `count()` to count the number of individuals (i.e., rows) by their true and predicted remote status. Answer the following questions:

1. How many predictions did we make?

1. How many times is "remote" status predicted?

1. How many respondents are actually remote?

1. How many predictions did we get right?

*Hint: You can create a 2x2 table using* `count(var1, var2)`

---
class: your-turn

```r
tree_predict <- predict(tree_fit, new_data = so_test)

all_preds <- so_test %>%
  select(remote) %>%
  bind_cols(tree_predict)

all_preds
#> # A tibble: 286 x 2
#>    remote .pred_class
#>    <fct>  <fct>      
#>  1 Remote Remote     
#>  2 Remote Remote     
#>  3 Remote Not remote 
#>  4 Remote Not remote 
#>  5 Remote Remote     
#>  6 Remote Not remote 
#>  7 Remote Remote     
#>  8 Remote Not remote 
#>  9 Remote Remote     
#> 10 Remote Not remote 
#> # … with 276 more rows
```
]

```r
all_preds %>%
  count(.pred_class, truth = remote)
#> # A tibble: 4 x 3
#>   .pred_class truth          n
#>   <fct>       <fct>      <int>
#> 1 Remote      Remote        89
#> 2 Remote      Not remote    40
#> 3 Not remote  Remote        54
#> 4 Not remote  Not remote   103
```
]
]

---
# `conf_mat()`

.big[
Creates confusion matrix, or truth table, from a data frame with observed and predicted classes.
]

```r
conf_mat(data, truth = remote, estimate = .pred_class)
```

---

```r
all_preds %>%
  conf_mat(truth = remote, estimate = .pred_class)
#>             Truth
#> Prediction   Remote Not remote
#>   Remote         89         40
#>   Not remote     54        103
```

---

```r
all_preds %>%
  conf_mat(truth = remote, estimate = .pred_class) %>%
  autoplot(type = "heatmap")
```

---
background-image: url(images/classification/metrics/metrics-01.png)
background-position: 50% 90%
background-size: 80%

# Confusion matrix

---
background-image: url(images/classification/metrics/metrics-02.png)
background-position: 50% 90%
background-size: 80%

# Confusion matrix

---
background-image: url(images/classification/metrics/metrics-03.png)
background-position: 50% 90%
background-size: 80%

# Confusion matrix

---
background-image: url(images/classification/metrics/metrics-04.png)
background-position: 50% 90%
background-size: 80%

# Confusion matrix

---
background-image: url(images/classification/metrics/metrics-05.png)
background-position: 50% 90%
background-size: 80%

# Accuracy

---
background-image: url(images/classification/metrics/metrics-06.png)
background-position: 50% 90%
background-size: 80%

# Accuracy

---
background-image: url(images/classification/metrics/metrics-07.png)
background-position: 50% 90%
background-size: 80%

# Accuracy

---
background-image: url(images/classification/metrics/metrics-01.png)
background-position: 50% 90%
background-size: 80%

# Sensitivity vs. Specificity

---
background-image: url(images/classification/metrics/metrics-08.png)
background-position: 50% 90%
background-size: 80%

# Sensitivity

---
background-image: url(images/classification/metrics/metrics-09.png)
background-position: 50% 90%
background-size: 80%

# Sensitivity

---
background-image: url(images/classification/metrics/metrics-10.png)
background-position: 50% 90%
background-size: 80%

# Specificity

---
background-image: url(images/classification/metrics/metrics-11.png)
background-position: 50% 90%
background-size: 80%

# Specificity

---
# Metrics

All available metrics are listed at <https://yardstick.tidymodels.org/articles/metric-types.html#metrics>

.center[
<iframe src="https://yardstick.tidymodels.org/articles/metric-types.html#metrics" width="80%" height="400px"></iframe>
]

---
# Calculating metrics

```r
accuracy(all_preds, truth = remote, estimate = .pred_class)
#> # A tibble: 1 x 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.671

sensitivity(all_preds, truth = remote, estimate = .pred_class)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 sens    binary         0.622

specificity(all_preds, truth = remote, estimate = .pred_class)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 spec    binary         0.720
```

---
# `metric_set()`

```r
so_metrics <- metric_set(accuracy, sensitivity, specificity)

so_metrics(all_preds, truth = remote, estimate = .pred_class)
#> # A tibble: 3 x 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.671
#> 2 sens     binary         0.622
#> 3 spec     binary         0.720
```

---
# `roc_curve()`

```r
roc_curve(all_preds, truth = remote, estimate = .pred_Remote)
```

Estimate = the .display[probability] of the target response
]

???

We don't have `.pred_Remote`. How do we get that?

---

```r
all_preds <- so_test %>%
  select(remote) %>%
  bind_cols(predict(tree_fit, new_data = so_test)) %>%
  bind_cols(predict(tree_fit, new_data = so_test, type = "prob"))

all_preds
#> # A tibble: 286 x 4
#>    remote .pred_class .pred_Remote `.pred_Not remote`
#>    <fct>  <fct>              <dbl>              <dbl>
#>  1 Remote Remote             0.687              0.313
#>  2 Remote Remote             0.687              0.313
#>  3 Remote Not remote         0.368              0.632
#>  4 Remote Not remote         0.368              0.632
#>  5 Remote Remote             0.687              0.313
#>  6 Remote Not remote         0.368              0.632
#>  7 Remote Remote             0.687              0.313
#>  8 Remote Not remote         0.368              0.632
#>  9 Remote Remote             0.687              0.313
#> 10 Remote Not remote         0.368              0.632
#> # … with 276 more rows
```

---
# `roc_curve()`

```r
roc_curve(all_preds, truth = remote, estimate = .pred_Remote)
#> # A tibble: 5 x 3
#>   .threshold specificity sensitivity
#>        <dbl>       <dbl>       <dbl>
#> 1   -Inf           0           1    
#> 2      0.368       0           1    
#> 3      0.6         0.720       0.622
#> 4      0.687       0.762       0.573
#> 5    Inf           1           0
```

???

`.threshold` = probability threshold needed to place an individual in the class.

---
class: your-turn

# Your turn 4

.big[
Build the necessary data frame, and use `roc_curve()` to calculate the data needed to construct the full ROC curve.

What is the necessary threshold for achieving specificity >.75?
]

---
class: your-turn

```r
all_preds <- so_test %>%
  select(remote) %>%
  bind_cols(predict(tree_fit, new_data = so_test)) %>%
  bind_cols(predict(tree_fit, new_data = so_test, type = "prob"))

roc_curve(all_preds, truth = remote, estimate = .pred_Remote)
#> # A tibble: 5 x 3
#>   .threshold specificity sensitivity
#>        <dbl>       <dbl>       <dbl>
#> 1   -Inf           0           1    
#> 2      0.368       0           1    
#> 3      0.6         0.720       0.622
#> 4      0.687       0.762       0.573
#> 5    Inf           1           0
```

???

For specificity of .75, we need a threshold of .687.

---
.panelset[
.panel[.panel-name[Plot Code]

```r
roc_curve(all_preds, truth = remote, estimate = .pred_Remote) %>%
  ggplot(mapping = aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(color = "midnightblue", size = 1.5) +
  geom_abline(lty = 2, alpha = 0.5, color = "gray50", size = 1.2)
```

]

]
]

---

```r
roc_curve(all_preds, truth = remote, estimate = .pred_Remote) %>%
  autoplot()
```

---
# Area under the curve

.pull-left[
<img src="images/classification/plots/good-rocauc-1.png" width="100%" style="display: block; margin: auto;" />
]

* AUC = 1: perfect classifer

* In general AUC of above 0.8 considered "good"

* {yardstick} metric: `roc_auc()`
]

---
# ROC curve: Guessing

---
# ROC curve: Perfect

---
# ROC curve: Poor

---
# ROC curve: OK

---
# ROC curve: Good

---
class: your-turn

# Your turn 5

.big[
Use `roc_auc()` to calculate the area under the ROC curve. Then plot the ROC curve using `autoplot()`.
]

---
class: your-turn

```r
roc_auc(all_preds, truth = remote, estimate = .pred_Remote)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.678

roc_curve(all_preds, truth = remote, estimate = .pred_Remote) %>%
  autoplot()
```
]

.panel[.panel-name[Plot]
<img src="images/classification/plots/yt-rocauc-sol-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---
class: title-slide, center

# Classification

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]