class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#009FB7;">9</strong> </span> # Classification ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).] <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$` `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$` `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$` `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$` `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$` `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$` `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { blue: ["{\\color{blue}{#1}}", 1], light_blue: ["{\\color{light_blue}{#1}}", 1], yellow: ["{\\color{yellow}{#1}}", 1], dark_yellow: ["{\\color{dark_yellow}{#1}}", 1], pink: ["{\\color{pink}{#1}}", 1], light_pink: ["{\\color{light_pink}{#1}}", 1], grey: ["{\\color{grey}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> --- class: your-turn # Your Turn 0 .big[ * Open the R Notebook **materials/exercises/09-classification.Rmd** * Run the setup chunk ]
01
:
00
--- class: middle, center, frame # Goal of Machine Learning -- ## 🔨 construct .display[models] that -- ## 🎯 generate .display[accurate predictions] -- ## 🆕 for .display[future, yet-to-be-seen data] -- .footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/] --- class: inverse, middle, center A model doesn't have to be a straight line... <img src="images/classification/plots/lm-fig-1.svg" width="50%" style="display: block; margin: auto;" /> --- class: inverse, middle, center .pull-left[ <img src="images/classification/plots/lm-fig-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/classification/plots/poly-fig-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # Decision Trees .big[ To predict the outcome of a new data point: ] * Use rules learned from splits * Each split maximizes information gain --- class: middle, center ![](https://media.giphy.com/media/gj4ZruUQUnpug/source.gif) --- <img src="images/classification/plots/rt-splits-1.png" width="55%" style="display: block; margin: auto;" /> --- <img src="images/classification/plots/rt-split-smooth-1.png" width="55%" style="display: block; margin: auto;" /> --- class: pop-quiz # Consider .big[How do we assess predictions here?] -- RMSE? --- <img src="images/classification/plots/rt-split-rmse-1.png" width="55%" style="display: block; margin: auto;" /> --- .pull-left[ ### LM RMSE = 53884.78 <img src="images/classification/plots/lm-test-resid-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ ### Tree RMSE = 61687.24 <img src="images/classification/plots/print-rt-split-rmse-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse, middle, center .pull-left[ <img src="images/classification/plots/print-lm-fit-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/classification/plots/dt-fig-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: middle, center, inverse # What is a model? --- # K Nearest Neighbors (KNN) .big[ To predict the outcome of a new data point: ] * Find the K most similar old data points * Take the average/mode/etc. outcome --- ```r library(kknn) knn_spec <- nearest_neighbor(neighbors = 5) %>% set_engine("kknn") %>% set_mode("regression") set.seed(100) knn_fit <- fit(knn_spec, Sale_Price ~ ., data = ames_train) knn_pred <- knn_fit %>% predict(new_data = ames_test) %>% mutate(price_truth = ames_test$Sale_Price) rmse(knn_pred, truth = price_truth, estimate = .pred) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 35870. rsq(knn_pred, truth = price_truth, estimate = .pred) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rsq standard 0.812 ``` --- exclude: true --- <img src="images/classification/plots/knn-home1-1.png" width="55%" style="display: block; margin: auto;" /> --- <img src="images/classification/plots/knn-home2-1.png" width="55%" style="display: block; margin: auto;" /> --- <img src="images/classification/plots/knn-home2-10-1.png" width="55%" style="display: block; margin: auto;" /> --- <img src="images/classification/plots/knn-home2-25-1.png" width="55%" style="display: block; margin: auto;" /> --- <img src="images/classification/plots/knn-home2-50-1.png" width="55%" style="display: block; margin: auto;" /> --- .pull-left[ <img src="images/classification/plots/knn-home2-1-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/classification/plots/underfit-knn-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ <img src="images/classification/plots/knn-home2-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/classification/plots/fit-knn-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: pop-quiz # Pop quiz! [Why is logistic regression considered a linear model?](https://sebastianraschka.com/faq/docs/logistic_regression_linear.html) .center[ <img src="images/classification/plots/lr-fig-1.svg" width="40%" style="display: block; margin: auto;" /> ] --- class: middle, center <img src="https://raw.githubusercontent.com/EmilHvitfeldt/blog/master/static/blog/2019-08-09-authorship-classification-with-tidymodels-and-textrecipes_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" /> https://www.hvitfeldt.me/blog/authorship-classification-with-tidymodels-and-textrecipes/ --- class: middle, center <img src="https://www.kaylinpavlik.com/content/images/2019/12/dt-1.png" width="50%" style="display: block; margin: auto;" /> https://www.kaylinpavlik.com/classifying-songs-genres/ --- class: middle, center <img src="images/classification/sing-tree.png" width="607" style="display: block; margin: auto;" /> [The Science of Singing Along](http://www.doc.gold.ac.uk/~mas03dm/papers/PawleyMullensiefen_Singalong_2012.pdf) --- class: middle, center <img src="https://a3.typepad.com/6a0105360ba1c6970c01b7c95c61fb970b-pi" width="40%" style="display: block; margin: auto;" /> .footnote[[tweetbotornot2](https://github.com/mkearney/tweetbotornot2)] --- name: guess-the-animal class: middle, center, inverse <img src="http://www.atarimania.com/8bit/screens/guess_the_animal.gif" width="90%" style="display: block; margin: auto;" /> --- # What makes a good guesser? -- .big[High information gain per question (can it fly?)] -- .big[Clear features (feathers vs. is it "small"?)] -- .big[Order matters] --- class: inverse, middle, center # Congratulations! You just built a decision tree 🎉 --- background-image: url(images/classification/aus-standard-animals.png) background-size: cover .footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)] --- background-image: url(images/classification/annotated-tree-00.png) background-size: 80% .footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)] --- background-image: url(images/classification/annotated-tree-01.png) background-size: 80% --- background-image: url(images/classification/annotated-tree-02.png) background-size: 80% --- background-image: url(images/classification/annotated-tree-03.png) background-size: 80% --- background-image: url(images/classification/annotated-tree-04.png) background-size: 80% --- background-image: url(images/classification/annotated-tree-05.png) background-size: 80% --- background-image: url(images/classification/bonsai-anatomy.jpg) background-size: cover --- background-image: url(images/classification/bonsai-anatomy-flip.jpg) background-size: cover --- class: pop-quiz # Pop quiz! .big[Name that variable type!] <img src="images/classification/vartypes_quiz.png" width="60%" style="display: block; margin: auto;" />
02
:
00
--- class: pop-quiz <img src="images/classification/vartypes_answers.png" width="80%" style="display: block; margin: auto;" /> --- class: pop-quiz <img src="images/classification/vartypes_unicorn.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center, middle # Show of hands How many people have .display[fit] a logistic regression model with `glm()`? --- exclude: true --- class: middle, center, inverse .pull-left[ <img src="images/classification/plots/show-unicorn-1.png" width="1695" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/classification/plots/show-horse-1.png" width="1695" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r uni_train %>% count(unicorn) #> unicorn n #> 1 0 100 #> 2 1 50 ``` ] .pull-right[ <img src="images/classification/plots/unicorn-box-1.png" width="100%" style="display: block; margin: auto;" /> ] --- <img src="images/classification/plots/unicorn-density-1.png" width="55%" style="display: block; margin: auto;" /> --- <img src="images/classification/plots/unicorn-lr-1.png" width="55%" style="display: block; margin: auto;" /> ??? Logistic regression model --- <img src="images/classification/plots/unicorn-probs-1.png" width="55%" style="display: block; margin: auto;" /> ??? The probability that each observation is a unicorn --- <img src="images/classification/plots/unicorn-pred-class-1.png" width="55%" style="display: block; margin: auto;" /> ??? Predicted class of each observation --- class: middle, center <img src="images/classification/plots/unicorn-clusters-1.png" width="55%" style="display: block; margin: auto;" /> --- ``` #> parsnip model object #> #> Fit time: 2ms #> n= 150 #> #> node), split, n, loss, yval, (yprob) #> * denotes terminal node #> #> 1) root 150 50 0 (0.6666667 0.3333333) #> 2) n_butterflies>=29.5 93 16 0 (0.8279570 0.1720430) * #> 3) n_butterflies< 29.5 57 23 1 (0.4035088 0.5964912) #> 6) n_kittens>=62.5 18 6 0 (0.6666667 0.3333333) * #> 7) n_kittens< 62.5 39 11 1 (0.2820513 0.7179487) * ``` --- <img src="images/classification/plots/uni-tree-partykit-1.png" width="720" style="display: block; margin: auto;" /> --- ``` #> nn ..y 0 1 cover #> 2 0 [.83 .17] when n_butterflies >= 30 62% #> 6 0 [.67 .33] when n_butterflies < 30 & n_kittens >= 63 12% #> 7 1 [.28 .72] when n_butterflies < 30 & n_kittens < 63 26% ``` --- .pull-left[ <img src="images/classification/plots/og-data-no-div-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="images/classification/plots/pred-data-no-div-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ <img src="images/classification/plots/kitten-div-1-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="images/classification/plots/butterfly-div-1-1.png" width="100%" style="display: block; margin: auto;" /> ] -- ### .center[🦋 split wins] --- .pull-left[ <img src="images/classification/plots/kitten-div-2-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="images/classification/plots/butterfly-div-2-1.png" width="100%" style="display: block; margin: auto;" /> ] -- ### .center[🐱 split wins] --- class: middle, center # Sadly, we are not classifying unicorns today <img src="images/classification/sad_unicorn.png" width="20%" style="display: block; margin: auto;" /> --- background-image: url(images/classification/copyingandpasting-big.png) background-size: contain background-position: center class: middle, center --- background-image: url(images/classification/so-dev-survey.png) background-size: contain background-position: center class: middle, center --- <img src="https://github.com/juliasilge/supervised-ML-case-studies-course/blob/master/img/remote_size.png?raw=true" width="80%" style="display: block; margin: auto;" /> .footnote[[Julia Silge](https://supervised-ml-course.netlify.app/)] ??? Notes: The specific question we are going to address is what makes a developer more likely to work remotely. Developers can work in their company offices or they can work remotely, and it turns out that there are specific characteristics of developers, such as the size of the company that they work for, how much experience they have, or where in the world they live, that affect how likely they are to be a remote developer. --- # StackOverflow Data ```r glimpse(stackoverflow) #> Rows: 1,150 #> Columns: 21 #> $ country <fct> United States, United States, Uni… #> $ salary <dbl> 63750.00, 93000.00, 40625.00, 450… #> $ years_coded_job <int> 4, 9, 8, 3, 8, 12, 20, 17, 20, 4,… #> $ open_source <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, … #> $ hobby <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, … #> $ company_size_number <dbl> 20, 1000, 10000, 1, 10, 100, 20, … #> $ remote <fct> Remote, Remote, Remote, Remote, R… #> $ career_satisfaction <int> 8, 8, 5, 10, 8, 10, 9, 7, 8, 7, 9… #> $ data_scientist <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ database_administrator <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, … #> $ desktop_applications_developer <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, … #> $ dev_ops <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, … #> $ embedded_developer <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, … #> $ graphic_designer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ graphics_programming <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ machine_learning_specialist <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ mobile_developer <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, … #> $ quality_assurance_engineer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ systems_administrator <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, … #> $ web_developer <dbl> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, … ``` --- # `initial_split()` .big["Splits" data randomly into a single testing and a single training set; extract `training` and `testing` sets from an rsplit] ```r set.seed(100) # Important! so_split <- initial_split(stackoverflow, strata = remote) so_train <- training(so_split) so_test <- testing(so_split) ``` --- class: your-turn # Your turn 1 .big[Using the `so_train` and `so_test` data sets, how many individuals in our training set are remote? How about in the testing set?]
02
:
00
--- class: your-turn ```r so_train %>% count(remote) #> # A tibble: 2 x 2 #> remote n #> <fct> <int> #> 1 Remote 432 #> 2 Not remote 432 so_test %>% count(remote) #> # A tibble: 2 x 2 #> remote n #> <fct> <int> #> 1 Remote 143 #> 2 Not remote 143 ``` --- .pull-left[ ```r so_train %>% count(remote) #> # A tibble: 2 x 2 #> remote n #> <fct> <int> #> 1 Remote 432 #> 2 Not remote 432 so_test %>% count(remote) #> # A tibble: 2 x 2 #> remote n #> <fct> <int> #> 1 Remote 143 #> 2 Not remote 143 ``` ] .pull-right[ <img src="images/classification/plots/so-box-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse # How would we fit a tree with parsnip? <img src="images/hex/parsnip.png" width="30%" style="display: block; margin: auto;" /> --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ 1\. Pick a .display[model] 2\. Set the .display[engine] 3\. Set the .display[mode] (if needed) ] --- # 1\. Pick a .display[model] All available models are listed at <https://www.tidymodels.org/find/parsnip/> .center[ <iframe src="https://www.tidymodels.org/find/parsnip/" width="80%" height="400px"></iframe> ] --- # 2\. Set the .display[engine] We'll use `rpart` for building `C`lassification `A`nd `R`egression `T`rees ```r set_engine("rpart") ``` --- # 3\. Set the .display[mode] A character string for the model type (e.g. "classification" or "regression") ```r set_mode("classification") ``` --- class: middle, frame # .center[To specify a model with parsnip] ```r decision_tree() %>% set_engine("rpart") %>% set_mode("classification") ``` --- class: your-turn # Your turn 2 Fill in the blanks. Use the `tree_spec` model provided and `fit()` to: 1. Train a CART-based model with the formula = `remote ~ years_coded_job + salary`. 1. Remind yourself what the output looks like! 1. Predict remote status with the testing data. 1. Keep `set.seed(100)` at the start of your code.
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Fit Model] ```r tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") set.seed(100) # Important! tree_fit <- fit(tree_spec, remote ~ years_coded_job + salary, data = so_train) ``` ] .panel[.panel-name[View Model] ```r tree_fit #> parsnip model object #> #> Fit time: 7ms #> n= 864 #> #> node), split, n, loss, yval, (yprob) #> * denotes terminal node #> #> 1) root 864 432 Remote (0.5000000 0.5000000) #> 2) salary>=89196.97 329 103 Remote (0.6869301 0.3130699) * #> 3) salary< 89196.97 535 206 Not remote (0.3850467 0.6149533) #> 6) salary< 6423.433 40 16 Remote (0.6000000 0.4000000) * #> 7) salary>=6423.433 495 182 Not remote (0.3676768 0.6323232) * ``` ] .panel[.panel-name[Make Predictions] ```r predict(tree_fit, new_data = so_test) #> # A tibble: 286 x 1 #> .pred_class #> <fct> #> 1 Remote #> 2 Remote #> 3 Not remote #> 4 Not remote #> 5 Remote #> 6 Not remote #> 7 Remote #> 8 Not remote #> 9 Remote #> 10 Not remote #> # … with 276 more rows ``` ] ] --- class: middle, center, frame # Goal of Machine Learning ## 🔨 construct .display[models] that .fade[ ## 🔮 generate accurate .display[predictions] ## 🆕 for .display[future, yet-to-be-seen data] ] --- class: middle, center, frame # Goal of Machine Learning .fade[ ## 🔨 construct .display[models] that ## 🔮 generate accurate .display[predictions] ] ## 🆕 for .display[future, yet-to-be-seen data] --- class: middle, center, frame # Goal of Machine Learning .fade[ ## 🔨 construct .display[models] that ] ## 🔮 generate accurate .display[predictions] .fade[ ## 🆕 for .display[future, yet-to-be-seen data] ] --- class: middle, center, frame # Goal of Machine Learning .fade[ ## 🔨 construct .display[models] that ] ## 🎯 generate .display[accurate predictions] .fade[ ## 🆕 for .display[future, yet-to-be-seen data] ] --- class: your-turn # Your turn 3 Create a data frame of the observed and predicted remote status for the `so_test` data. Then use `count()` to count the number of individuals (i.e., rows) by their true and predicted remote status. Answer the following questions: 1. How many predictions did we make? 1. How many times is "remote" status predicted? 1. How many respondents are actually remote? 1. How many predictions did we get right? *Hint: You can create a 2x2 table using* `count(var1, var2)`
06
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Create Predictions] ```r tree_predict <- predict(tree_fit, new_data = so_test) all_preds <- so_test %>% select(remote) %>% bind_cols(tree_predict) all_preds #> # A tibble: 286 x 2 #> remote .pred_class #> <fct> <fct> #> 1 Remote Remote #> 2 Remote Remote #> 3 Remote Not remote #> 4 Remote Not remote #> 5 Remote Remote #> 6 Remote Not remote #> 7 Remote Remote #> 8 Remote Not remote #> 9 Remote Remote #> 10 Remote Not remote #> # … with 276 more rows ``` ] .panel[.panel-name[Evalutate Predictions] ```r all_preds %>% count(.pred_class, truth = remote) #> # A tibble: 4 x 3 #> .pred_class truth n #> <fct> <fct> <int> #> 1 Remote Remote 89 #> 2 Remote Not remote 40 #> 3 Not remote Remote 54 #> 4 Not remote Not remote 103 ``` ] ] --- # `conf_mat()` .big[ Creates confusion matrix, or truth table, from a data frame with observed and predicted classes. ] ```r conf_mat(data, truth = remote, estimate = .pred_class) ``` --- ```r all_preds %>% conf_mat(truth = remote, estimate = .pred_class) #> Truth #> Prediction Remote Not remote #> Remote 89 40 #> Not remote 54 103 ``` --- ```r all_preds %>% conf_mat(truth = remote, estimate = .pred_class) %>% autoplot(type = "heatmap") ``` <img src="images/classification/plots/conf-map-heat-1.png" width="40%" style="display: block; margin: auto;" /> --- background-image: url(images/classification/metrics/metrics-01.png) background-position: 50% 90% background-size: 80% # Confusion matrix --- background-image: url(images/classification/metrics/metrics-02.png) background-position: 50% 90% background-size: 80% # Confusion matrix --- background-image: url(images/classification/metrics/metrics-03.png) background-position: 50% 90% background-size: 80% # Confusion matrix --- background-image: url(images/classification/metrics/metrics-04.png) background-position: 50% 90% background-size: 80% # Confusion matrix --- background-image: url(images/classification/metrics/metrics-05.png) background-position: 50% 90% background-size: 80% # Accuracy --- background-image: url(images/classification/metrics/metrics-06.png) background-position: 50% 90% background-size: 80% # Accuracy --- background-image: url(images/classification/metrics/metrics-07.png) background-position: 50% 90% background-size: 80% # Accuracy --- background-image: url(images/classification/metrics/metrics-01.png) background-position: 50% 90% background-size: 80% # Sensitivity vs. Specificity --- background-image: url(images/classification/metrics/metrics-08.png) background-position: 50% 90% background-size: 80% # Sensitivity --- background-image: url(images/classification/metrics/metrics-09.png) background-position: 50% 90% background-size: 80% # Sensitivity --- background-image: url(images/classification/metrics/metrics-10.png) background-position: 50% 90% background-size: 80% # Specificity --- background-image: url(images/classification/metrics/metrics-11.png) background-position: 50% 90% background-size: 80% # Specificity --- # Metrics All available metrics are listed at <https://yardstick.tidymodels.org/articles/metric-types.html#metrics> .center[ <iframe src="https://yardstick.tidymodels.org/articles/metric-types.html#metrics" width="80%" height="400px"></iframe> ] --- # Calculating metrics ```r accuracy(all_preds, truth = remote, estimate = .pred_class) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.671 sensitivity(all_preds, truth = remote, estimate = .pred_class) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 sens binary 0.622 specificity(all_preds, truth = remote, estimate = .pred_class) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 spec binary 0.720 ``` --- # `metric_set()` .big[Combine multiple metrics functions together.] ```r so_metrics <- metric_set(accuracy, sensitivity, specificity) so_metrics(all_preds, truth = remote, estimate = .pred_class) #> # A tibble: 3 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.671 #> 2 sens binary 0.622 #> 3 spec binary 0.720 ``` --- # `roc_curve()` .big[ Takes predictions, returns a tibble with probabilities. ] ```r roc_curve(all_preds, truth = remote, estimate = .pred_Remote) ``` .big[ Truth = the .display[observed] class Estimate = the .display[probability] of the target response ] ??? We don't have `.pred_Remote`. How do we get that? --- ```r all_preds <- so_test %>% select(remote) %>% bind_cols(predict(tree_fit, new_data = so_test)) %>% bind_cols(predict(tree_fit, new_data = so_test, type = "prob")) all_preds #> # A tibble: 286 x 4 #> remote .pred_class .pred_Remote `.pred_Not remote` #> <fct> <fct> <dbl> <dbl> #> 1 Remote Remote 0.687 0.313 #> 2 Remote Remote 0.687 0.313 #> 3 Remote Not remote 0.368 0.632 #> 4 Remote Not remote 0.368 0.632 #> 5 Remote Remote 0.687 0.313 #> 6 Remote Not remote 0.368 0.632 #> 7 Remote Remote 0.687 0.313 #> 8 Remote Not remote 0.368 0.632 #> 9 Remote Remote 0.687 0.313 #> 10 Remote Not remote 0.368 0.632 #> # … with 276 more rows ``` --- # `roc_curve()` ```r roc_curve(all_preds, truth = remote, estimate = .pred_Remote) #> # A tibble: 5 x 3 #> .threshold specificity sensitivity #> <dbl> <dbl> <dbl> #> 1 -Inf 0 1 #> 2 0.368 0 1 #> 3 0.6 0.720 0.622 #> 4 0.687 0.762 0.573 #> 5 Inf 1 0 ``` ??? `.threshold` = probability threshold needed to place an individual in the class. --- class: your-turn # Your turn 4 .big[ Build the necessary data frame, and use `roc_curve()` to calculate the data needed to construct the full ROC curve. What is the necessary threshold for achieving specificity >.75? ]
05
:
00
--- class: your-turn ```r all_preds <- so_test %>% select(remote) %>% bind_cols(predict(tree_fit, new_data = so_test)) %>% bind_cols(predict(tree_fit, new_data = so_test, type = "prob")) roc_curve(all_preds, truth = remote, estimate = .pred_Remote) #> # A tibble: 5 x 3 #> .threshold specificity sensitivity #> <dbl> <dbl> <dbl> #> 1 -Inf 0 1 #> 2 0.368 0 1 #> 3 0.6 0.720 0.622 #> 4 0.687 0.762 0.573 #> 5 Inf 1 0 ``` ??? For specificity of .75, we need a threshold of .687. --- .panelset[ .panel[.panel-name[Plot Code] ```r roc_curve(all_preds, truth = remote, estimate = .pred_Remote) %>% ggplot(mapping = aes(x = 1 - specificity, y = sensitivity)) + geom_line(color = "midnightblue", size = 1.5) + geom_abline(lty = 2, alpha = 0.5, color = "gray50", size = 1.2) ``` ] .panel[.panel-name[Plot] <img src="images/classification/plots/yt-roccurve-plot-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ```r roc_curve(all_preds, truth = remote, estimate = .pred_Remote) %>% autoplot() ``` <img src="images/classification/plots/roccurve-autoplot-1.png" width="50%" style="display: block; margin: auto;" /> --- # Area under the curve .pull-left[ <img src="images/classification/plots/good-rocauc-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * AUC = 0.5: random guessing * AUC = 1: perfect classifer * In general AUC of above 0.8 considered "good" * {yardstick} metric: `roc_auc()` ] --- # ROC curve: Guessing <img src="images/classification/plots/guess-rocauc-1.png" width="70%" style="display: block; margin: auto;" /> --- # ROC curve: Perfect <img src="images/classification/plots/perfect-rocauc-1.png" width="70%" style="display: block; margin: auto;" /> --- # ROC curve: Poor <img src="images/classification/plots/poor-rocauc-1.png" width="70%" style="display: block; margin: auto;" /> --- # ROC curve: OK <img src="images/classification/plots/ok-rocauc-1.png" width="70%" style="display: block; margin: auto;" /> --- # ROC curve: Good <img src="images/classification/plots/print-good-rocauc-1.png" width="70%" style="display: block; margin: auto;" /> --- class: your-turn # Your turn 5 .big[ Use `roc_auc()` to calculate the area under the ROC curve. Then plot the ROC curve using `autoplot()`. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r roc_auc(all_preds, truth = remote, estimate = .pred_Remote) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 0.678 roc_curve(all_preds, truth = remote, estimate = .pred_Remote) %>% autoplot() ``` ] .panel[.panel-name[Plot] <img src="images/classification/plots/yt-rocauc-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: title-slide, center # Classification <img src="images/classification/pred-hex.png" width="40%" style="display: block; margin: auto;" /> ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]