class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#009FB7;">12</strong> </span> # Workflows ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).] <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$` `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$` `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$` `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$` `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$` `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$` `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { blue: ["{\\color{blue}{#1}}", 1], light_blue: ["{\\color{light_blue}{#1}}", 1], yellow: ["{\\color{yellow}{#1}}", 1], dark_yellow: ["{\\color{dark_yellow}{#1}}", 1], pink: ["{\\color{pink}{#1}}", 1], light_pink: ["{\\color{light_pink}{#1}}", 1], grey: ["{\\color{grey}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> --- class: your-turn # Your Turn 0 .big[ * Open the R Notebook **materials/exercises/12-workflows.Rmd** * Run the setup chunk ]
01
:
00
--- background-image: url(images/workflows/daan-mooij-91LGCVN5SAI-unsplash.jpg) background-size: cover ??? Data analysis as a pipeline. Just with water pipelines there is a threat of leakage, in data analysis there can be... --- class: middle, center, inverse # ⚠️ Data Leakage ⚠️ --- class: pop-quiz # Pop quiz! What will this code do? ```r ames_zsplit <- ames %>% mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) %>% initial_split() ``` -- ``` #> # A tibble: 2,198 x 2 #> Sale_Price z_price #> <int> <dbl> #> 1 105000 -0.949 #> 2 172000 -0.110 #> 3 244000 0.791 #> 4 213500 0.409 #> 5 191500 0.134 #> 6 236500 0.697 #> 7 189000 0.103 #> 8 175900 -0.0613 #> 9 185000 0.0526 #> 10 180400 -0.00496 #> # … with 2,188 more rows ``` --- class: pop-quiz # Pop quiz! What could go wrong? 1. Take the `mean` and `sd` of `Sale_Price` 1. Transform all sale prices in `ames` 1. Train with training set 1. Predict sale prices with testing set ??? Training and testing data are not independent! --- class: pop-quiz # What (else) could go wrong? ```r ames_train <- training(ames_split) %>% mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) ames_test <- testing(ames_split) %>% mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) lm_fit <- fit(lm_spec, Sale_Price ~ Gr_Liv_Area, data = ames_train) price_pred <- lm_fit %>% predict(new_data = ames_test) %>% mutate(price_truth = ames_test$Sale_Price) rmse(price_pred, truth = price_truth, estimate = .pred) ``` --- # Better 1. Split the data 1. Transform training set sale prices based on `mean` and `sd` of `Sale_Price` of the training set 1. Train with training set 1. Transform testing set sale prices based on `mean` and `sd` of `Sale_Price` of the **training set** 1. Predict sale prices with testing set --- class: middle, center, frame # Data Leakage "When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict." .footnote[Daniel Gutierrez, [Ask a Data Scientist: Data Leakage](http://insidebigdata.com/2014/11/26/ask-data-scientist-data-leakage/)] --- class: middle, center, frame # Axiom Your learner is more than a model. --- class: middle, center, frame # Lemma #1 Your learner is more than a model. Your learner is only as good as your data. --- class: middle, center, frame # Lemma #2 Your learner is more than a model. Your learner is only as good as your data. Your data is only as good as your workflow. --- class: middle, center, frame # **Revised** Goal of Machine Learning -- ## 🛠 build reliable .display[workflows] that -- ## 🎯 generate .display[accurate predictions] -- ## 🆕 for .display[future, yet-to-be-seen data] --- class: pop-quiz # Pop quiz! .big[What does .display[GIGO] stand for?] -- .big[Garbage in, garbage out] --- class: center middle frame # Axiom Feature engineering and modeling are two halves of a single predictive workflow. --- background-image: url(images/workflows/model-wf/wf-01.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-02.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-03.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-04.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-05.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-06.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-07.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-08.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-09.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-10.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-11.png) background-position: center 85% background-size: 70% # Machine Learning --- background-image: url(images/workflows/model-wf/wf-12.png) background-position: center 85% background-size: 70% # Machine Learning --- class: center middle inverse # Workflows --- <div class="hex-book"> <a href="https://workflows.tidymodels.org/"> <img class="hex" src="images/hex/workflows.png"> </a> <a href="https://www.tmwr.org/workflows.html"> <img class="book" src="images/books/tmwr-workflows.png"> </a> </div> --- # `workflow()` Creates a workflow to add a model and more to ```r workflow() ``` --- # `add_formula()` Adds a formula to a workflow `*` ```r workflow() %>% add_formula(Sale_Price ~ Year) ``` .footnote[`*` If you do not plan to do your own preprocessing] --- # `add_model()` Adds a parsnip model spec to a workflow ```r workflow() %>% add_model(lm_spec) ``` --- background-image: url(images/workflows/zestimate.png) background-position: center background-size: contain --- class: your-turn # Your Turn 1 Build a workflow that uses a linear model to predict `Sale_Price` with `Bedrooms_AbvGr`, `Full_Bath` and `Half_Bath` in ames. Save it as `bb_wf`.
03
:
00
--- class: your-turn ```r lm_spec <- linear_reg() %>% set_engine("lm") bb_wf <- workflow() %>% add_formula(Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath) %>% add_model(lm_spec) ``` --- ```r bb_wf #> ══ Workflow ════════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: linear_reg() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────── #> Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath #> #> ── Model ─────────────────────────────────────────────────────────────────────── #> Linear Regression Model Specification (regression) #> #> Computational engine: lm ``` --- `fit()` and `fit_resamples()` also use workflows. Pass a workflow in place of a formula and model. .pull-left[ ```r fit( * lm_spec, * Sale_Price ~ Bedroom_AbvGr + * Full_Bath + Half_Bath, data = ames_train ) ``` ] .pull-right[ ```r fit( * bb_wf, data = ames_train ) ``` ] --- # `update_formula()` Removes the formula, then replaces with the new one. ```r workflow() %>% update_formula(Sale_Price ~ Bedroom_AbvGr) #> Warning: The workflow has no formula preprocessor to remove. #> ══ Workflow ════════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: None #> #> ── Preprocessor ──────────────────────────────────────────────────────────────── #> Sale_Price ~ Bedroom_AbvGr ``` --- class: your-turn # Your Turn 2 Test the linear model that predicts `Sale_Price` with everything else in `ames`. Use cross validation to estimate the RMSE. 1. Create a new workflow by updating `bb_wf`. 1. Use `vfold_cv()` to create a 10-fold cross validation of `ames_train`. 1. Fit the workflow 1. Use `collect_metrics()` to estimate the RMSE.
05
:
00
--- class: your-turn ```r all_wf <- bb_wf %>% update_formula(Sale_Price ~ .) ames_folds <- vfold_cv(ames_train, v = 10) fit_resamples(all_wf, resamples = ames_folds) %>% collect_metrics() #> # A tibble: 2 x 6 #> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 40999. 10 5005. Preprocessor1_Model1 #> 2 rsq standard 0.759 10 0.0442 Preprocessor1_Model1 ``` --- # `update_model()` Removes the model spec, then replaces with the new one. ```r workflow() %>% update_model(knn_spec) ``` --- class: your-turn # Your Turn 3 Fill in the blanks to test the regression tree model that predicts `Sale_Price` with _everything else in `ames`_ on `ames_folds`. What RMSE do you get? Hint: Create a new workflow by updating `all_wf`.
05
:
00
--- class: your-turn ```r rt_spec <- decision_tree() %>% set_engine(engine = "rpart") %>% set_mode("regression") rt_wf <- all_wf %>% update_model(rt_spec) fit_resamples(rt_wf, resamples = ames_folds) %>% collect_metrics() #> # A tibble: 2 x 6 #> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 41498. 10 1726. Preprocessor1_Model1 #> 2 rsq standard 0.726 10 0.0230 Preprocessor1_Model1 ``` --- class: your-turn # Your Turn 4 But what about the predictions of our model? Save the fitted object from your regression tree, and use `collect_predictions()` to see the predictions generated from the test data.
03
:
00
--- class: your-turn ```r all_fitwf <- fit_resamples( rt_wf, resamples = ames_folds, * control = control_resamples(save_pred = TRUE) ) all_fitwf %>% collect_predictions() #> # A tibble: 2,198 x 5 #> id .pred .row Sale_Price .config #> <chr> <dbl> <int> <int> <chr> #> 1 Fold01 166955. 10 171500 Preprocessor1_Model1 #> 2 Fold01 166955. 13 141000 Preprocessor1_Model1 #> 3 Fold01 130970. 18 149000 Preprocessor1_Model1 #> 4 Fold01 98159. 21 115000 Preprocessor1_Model1 #> 5 Fold01 385912. 31 395192 Preprocessor1_Model1 #> 6 Fold01 238020. 42 205000 Preprocessor1_Model1 #> 7 Fold01 238020. 57 221500 Preprocessor1_Model1 #> 8 Fold01 160724. 75 169000 Preprocessor1_Model1 #> 9 Fold01 231861. 96 205000 Preprocessor1_Model1 #> 10 Fold01 130970. 108 172500 Preprocessor1_Model1 #> # … with 2,188 more rows ``` --- class: title-slide, center # Workflows <img src="images/hex/workflows.png" width="20%" style="display: block; margin: auto;" /> ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]