Workflows

class: title-slide, center

# Workflows

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]

<div style = "position:fixed; visibility: hidden">
  `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$`
  `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$`
  `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$`
  `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$`
  `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$`
  `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$`
  `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$`
</div>
  
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      Macros: {
        blue: ["{\\color{blue}{#1}}", 1],
        light_blue: ["{\\color{light_blue}{#1}}", 1],
        yellow: ["{\\color{yellow}{#1}}", 1],
        dark_yellow: ["{\\color{dark_yellow}{#1}}", 1],
        pink: ["{\\color{pink}{#1}}", 1],
        light_pink: ["{\\color{light_pink}{#1}}", 1],
        grey: ["{\\color{grey}{#1}}", 1]
      },
      loader: {load: ['[tex]/color']},
      tex: {packages: {'[+]': ['color']}}
    }
  });
</script>

---
class: your-turn

# Your Turn 0

.big[
* Open the R Notebook **materials/exercises/12-workflows.Rmd**
* Run the setup chunk
]

---
background-image: url(images/workflows/daan-mooij-91LGCVN5SAI-unsplash.jpg)
background-size: cover

???

Data analysis as a pipeline. Just with water pipelines there is a threat of leakage, in data analysis there can be...

---
class: middle, center, inverse

# ⚠️ Data Leakage ⚠️

---
class: pop-quiz

# Pop quiz!

What will this code do?

```r
ames_zsplit <- ames %>% 
  mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) %>% 
  initial_split()
```

```
#> # A tibble: 2,198 x 2
#>    Sale_Price  z_price
#>         <int>    <dbl>
#>  1     105000 -0.949  
#>  2     172000 -0.110  
#>  3     244000  0.791  
#>  4     213500  0.409  
#>  5     191500  0.134  
#>  6     236500  0.697  
#>  7     189000  0.103  
#>  8     175900 -0.0613 
#>  9     185000  0.0526 
#> 10     180400 -0.00496
#> # … with 2,188 more rows
```

---
class: pop-quiz

# Pop quiz!

What could go wrong?

1. Take the `mean` and `sd` of `Sale_Price`

1. Transform all sale prices in `ames`

1. Train with training set

1. Predict sale prices with testing set

???

Training and testing data are not independent!

---
class: pop-quiz

# What (else) could go wrong?

```r
ames_train <- training(ames_split) %>% 
  mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price))

ames_test <- testing(ames_split) %>% 
  mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price))

lm_fit <- fit(lm_spec,
              Sale_Price ~ Gr_Liv_Area, 
              data = ames_train)

price_pred  <- lm_fit %>% 
  predict(new_data = ames_test) %>% 
  mutate(price_truth = ames_test$Sale_Price)

rmse(price_pred, truth = price_truth, estimate = .pred)
```

---

# Better

1. Split the data

1. Transform training set sale prices based on `mean` and `sd` of `Sale_Price` of the training set

1. Train with training set

1. Transform testing set sale prices based on `mean` and `sd` of `Sale_Price` of the **training set**

1. Predict sale prices with testing set