Resampling

class: title-slide, center

# Resampling

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]

<div style = "position:fixed; visibility: hidden">
  `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$`
  `$$\require{color}\definecolor{light_blue}{rgb}{0.0392156862745098, 0.870588235294118, 1}$$`
  `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$`
  `$$\require{color}\definecolor{dark_yellow}{rgb}{0.635294117647059, 0.47843137254902, 0.00392156862745098}$$`
  `$$\require{color}\definecolor{pink}{rgb}{0.796078431372549, 0.16078431372549, 0.482352941176471}$$`
  `$$\require{color}\definecolor{light_pink}{rgb}{1, 0.552941176470588, 0.776470588235294}$$`
  `$$\require{color}\definecolor{grey}{rgb}{0.411764705882353, 0.403921568627451, 0.450980392156863}$$`
</div>
  
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      Macros: {
        blue: ["{\\color{blue}{#1}}", 1],
        light_blue: ["{\\color{light_blue}{#1}}", 1],
        yellow: ["{\\color{yellow}{#1}}", 1],
        dark_yellow: ["{\\color{dark_yellow}{#1}}", 1],
        pink: ["{\\color{pink}{#1}}", 1],
        light_pink: ["{\\color{light_pink}{#1}}", 1],
        grey: ["{\\color{grey}{#1}}", 1]
      },
      loader: {load: ['[tex]/color']},
      tex: {packages: {'[+]': ['color']}}
    }
  });
</script>

---
exclude: true

---
class: your-turn

# Your Turn 0

.big[
* Open the R Notebook **materials/exercises/10-resampling.Rmd**
* Run the setup chunk
]

---
background-image: url(images/resampling/unicorns-rainbows/joshua-hoehne-wnHeb_pRJBo-unsplash.jpg)
background-size: cover

---
background-image: url(images/resampling/unicorns-rainbows/unicorns.001.jpeg)
background-size: cover

---
background-image: url(images/resampling/unicorns-rainbows/unicorns.002.jpeg)
background-size: cover

---
background-image: url(images/resampling/unicorns-rainbows/unicorns.003.jpeg)
background-size: cover

---
background-image: url(images/resampling/unicorns-rainbows/unicorns.004.jpeg)
background-size: cover

---
background-image: url(images/resampling/unicorns-rainbows/unicorns.005.jpeg)
background-size: cover

---
class: frame, middle, center

# Hypothesis

.big[
As the number of 🦄 increases, so does the number of 🌈.
]

---

---

---

---

---
class: inverse, middle, center

# The Challenge

---

---

---

---

---
class: middle, center, frame

# The Solution

Random Sampling

---

---

---

---

---

---

---
class: middle, center, frame

# The New Challenge

Sample Variation

---

---

---

---

---
<img src="images/resampling/plots/plot-many-big-resamples-1.png" width="90%" style="display: block; margin: auto;" />

---

---

---

---

---

---
<img src="images/resampling/plots/plot-many-poly-1.png" width="90%" style="display: block; margin: auto;" />

---
class: middle, center, frame

# The good news

You don't have to collect more data.

You don't have to sacrifice fit for flexibility.

---

---

---

---

---

---

---

---

---
class: middle

.pull-left[
<img src="images/resampling/plots/many-samples-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/resampling/plots/many-bootstraps-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: middle

.pull-left[
<img src="images/resampling/plots/many-big-samples-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/resampling/plots/many-big-bootstraps-1.png" width="100%" style="display: block; margin: auto;" />
]

---

---
<div class="hex-book">
  <a href="https://rsample.tidymodels.org/">
    <img class="hex" src="images/hex/rsample.png">
  </a>
  <a href="https://www.tmwr.org/resampling.html">
    <img class="book" src="images/books/tmwr-rsample.png">
  </a>
</div>

---
class: center middle

.large-left[
# [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/)
]

.small-right[
<img src="images/hex/palmerpenguins.png" width="100%" style="display: block; margin: auto;" />
]

---
background-image: url(images/resampling/penguins.png)
background-size: cover

.footnote[Artwork by [@allison_horst](https://twitter.com/allison_horst)]

???
[Ah]-dell-ee

---
.big[What is the correlation between bill length and bill depth?]

```r
library(palmerpenguins)
penguins
#> # A tibble: 344 x 8
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.1          18.7               181        3750
#>  2 Adelie  Torgersen           39.5          17.4               186        3800
#>  3 Adelie  Torgersen           40.3          18                 195        3250
#>  4 Adelie  Torgersen           NA            NA                  NA          NA
#>  5 Adelie  Torgersen           36.7          19.3               193        3450
#>  6 Adelie  Torgersen           39.3          20.6               190        3650
#>  7 Adelie  Torgersen           38.9          17.8               181        3625
#>  8 Adelie  Torgersen           39.2          19.6               195        4675
#>  9 Adelie  Torgersen           34.1          18.1               193        3475
#> 10 Adelie  Torgersen           42            20.2               190        4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
```

---
class: plain-white
background-image: url(images/resampling/culmen_depth.png)
background-size: 90%
background-position: center middle

.footnote[Artwork by [@allison_horst](https://twitter.com/allison_horst)]

---
# `bootstraps()`

.big[Create bootstrap samples from a data set.]

```r
set.seed(100) # Important!
penguin_boot <- bootstraps(penguins, times = 25)
```

---
class: your-turn

# Your turn 1

.big[
Use `bootstraps()` to create 100 bootstrap samples of the `penguins` data.

Save the bootstrap samples as `penguin_boot`.

Keep `set.seed(100)`!
]

---
class: your-turn

```r
set.seed(100)
penguin_boot <- bootstraps(penguins, times = 100)
penguin_boot
#> # Bootstrap sampling 
#> # A tibble: 100 x 2
#>    splits            id          
#>    <list>            <chr>       
#>  1 <split [344/134]> Bootstrap001
#>  2 <split [344/126]> Bootstrap002
#>  3 <split [344/124]> Bootstrap003
#>  4 <split [344/129]> Bootstrap004
#>  5 <split [344/127]> Bootstrap005
#>  6 <split [344/124]> Bootstrap006
#>  7 <split [344/127]> Bootstrap007
#>  8 <split [344/125]> Bootstrap008
#>  9 <split [344/128]> Bootstrap009
#> 10 <split [344/123]> Bootstrap010
#> # … with 90 more rows
```

???

What is a `<list>`?

---

```r
penguin_boot
#> # Bootstrap sampling 
#> # A tibble: 100 x 2
#>    splits            id          
#>    <list>            <chr>       
#>  1 <split [344/134]> Bootstrap001
#>  2 <split [344/126]> Bootstrap002
#>  3 <split [344/124]> Bootstrap003
#>  4 <split [344/129]> Bootstrap004
#>  5 <split [344/127]> Bootstrap005
#>  6 <split [344/124]> Bootstrap006
#>  7 <split [344/127]> Bootstrap007
#>  8 <split [344/125]> Bootstrap008
#>  9 <split [344/128]> Bootstrap009
#> 10 <split [344/123]> Bootstrap010
#> # … with 90 more rows
```

<code class ='r hljs remark-code'>penguin_boot$splits<span style="background-color:#FED766;color:#009FB7">[[1]]</span><br>#> <Analysis/Assess/Total><br>#> <344/134/344></code>

---
# Anatomy of a split

.big[`<344/134/344>`]

.big[`<`.yellow-highlight[`344`]`/134/344>` >>> Size of resample (analysis set)]

.big[`<344/`.yellow-highlight[`134`]`/344>` >>> Size of holdout/unused data (assessment set)]

.big[`<344/134/`.yellow-highlight[`344`]`>` >>> Total size of data set]

---
# Split data

.smaller[

```r
boot1 <- penguin_boot$splits[[1]]
boot1
#> <Analysis/Assess/Total>
#> <344/134/344>
```
]

.pull-left[
### `analysis()`

.smaller[

```r
analysis(boot1)
#> # A tibble: 344 x 8
#>    species   island    bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>     <fct>              <dbl>         <dbl>            <int>       <int>
#>  1 Gentoo    Biscoe              45.2          15.8              215        5300
#>  2 Adelie    Biscoe              45.6          20.3              191        4600
#>  3 Gentoo    Biscoe              50.1          15                225        5000
#>  4 Adelie    Torgersen           NA            NA                 NA          NA
#>  5 Chinstrap Dream               49.7          18.6              195        3600
#>  6 Chinstrap Dream               49.8          17.3              198        3675
#>  7 Adelie    Dream               40.3          18.5              196        4350
#>  8 Adelie    Torgersen           38.9          17.8              181        3625
#>  9 Gentoo    Biscoe              47.3          15.3              222        5250
#> 10 Chinstrap Dream               43.2          16.6              187        2900
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
```
]
]

.pull-right[
### `assessment()`

.smaller[

```r
assessment(boot1)
#> # A tibble: 134 x 8
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           36.7          19.3               193        3450
#>  2 Adelie  Torgersen           39.3          20.6               190        3650
#>  3 Adelie  Torgersen           39.2          19.6               195        4675
#>  4 Adelie  Torgersen           34.1          18.1               193        3475
#>  5 Adelie  Torgersen           42            20.2               190        4250
#>  6 Adelie  Torgersen           38.7          19                 195        3450
#>  7 Adelie  Torgersen           42.5          20.7               197        4500
#>  8 Adelie  Biscoe              37.7          18.7               180        3600
#>  9 Adelie  Biscoe              38.2          18.1               185        3950
#> 10 Adelie  Biscoe              40.6          18.6               183        3550
#> # … with 124 more rows, and 2 more variables: sex <fct>, year <int>
```
]
]

---
class: pop-quiz

# Pop quiz!

Why is the assessment set a different size for each bootstrap resample?

```
#> # Bootstrap sampling 
#> # A tibble: 100 x 2
#>    splits            id          
#>    <list>            <chr>       
#>  1 <split [344/134]> Bootstrap001
#>  2 <split [344/126]> Bootstrap002
#>  3 <split [344/124]> Bootstrap003
#>  4 <split [344/129]> Bootstrap004
#>  5 <split [344/127]> Bootstrap005
#>  6 <split [344/124]> Bootstrap006
#>  7 <split [344/127]> Bootstrap007
#>  8 <split [344/125]> Bootstrap008
#>  9 <split [344/128]> Bootstrap009
#> 10 <split [344/123]> Bootstrap010
#> # … with 90 more rows
```

---
# Correlation

To estimate the correlation for a data set, use the `cor()` function.

```r
boot1 <- penguin_boot$splits[[1]]
boot_sample <- analysis(boot1)

cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm)
#> [1] NA
```

```r
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
    use = "complete.obs")
#> [1] -0.1249339
```

---
class: your-turn

# Your turn 2

.big[Complete the code to calculate the correlation for the fifth bootstrap sample.]

<code class ='r hljs remark-code'>boot5 <- <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><br>boot_sample <- <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>(boot5)<br><br>cor(boot_sample$<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>, boot_sample$<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;use = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>)</code>

---
class: your-turn

```r
boot5 <- penguin_boot$splits[[5]]
boot_sample <- analysis(boot5)

cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
    use = "complete.obs")
#> [1] -0.2450851
```

---
class: center middle inverse

# Automation

---
<div class="hex-book">
  <a href="https://purrr.tidyverse.org/">
    <img class="hex" src="images/hex/purrr.png">
  </a>
  <a href="https://r4ds.had.co.nz/iteration.html">
    <img class="book" src="images/books/r4ds-iteration.png">
  </a>
</div>

---
background-image: url(images/resampling/applied-ds-program.png)
background-position: center 60%
background-size: 85%

# .nobold[(Applied)] Data Science

---
# `map()` and friends

Applies a function to every element of a list.

```r
map(.x, .f, ...)
```

What output do you expect?

`map_chr()`, `map_dbl()`, `map_int()`, `map_lgl()`, `map_df()`, or the general `map()`

---
# Building our `map()`

<code class ='r hljs remark-code'>map(penguin_boot$splits, <span style="background-color:#FED766;color:#009FB7">.f</span>)</code>

---
# Custom functions

We need a function that:

1\. Takes in a split

2\. Pull out the analysis set

3\. Calculates the correlation of the analysis set

---
# Step 1: Create the code

Use our code from earlier

<code class ='r hljs remark-code'><span style="background-color:white"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><br>&nbsp;&nbsp;boot_sample <- analysis(split)<br>&nbsp;&nbsp;<br>&nbsp;&nbsp;cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;use = "complete.obs")<br><span style="background-color:white"> </span></code>

---
# Step 1: Create the code

Use our code from earlier

<code class ='r hljs remark-code'><span style="background-color:white"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><br>&nbsp;&nbsp;boot_sample <- analysis(<span style="background-color:#FED766;color:#009FB7">split</span>)<br>&nbsp;&nbsp;<br>&nbsp;&nbsp;cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;use = "complete.obs")<br><span style="background-color:white"> </span></code>

---
# Step 2: Wrap inside of `function()`

The `function()` function defines a new function

<code class ='r hljs remark-code'>penguin_cor <- function(split) {<br>&nbsp;&nbsp;boot_sample <- analysis(split)<br>&nbsp;&nbsp;<br>&nbsp;&nbsp;cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;use = "complete.obs")<br>}</code>

---
# Step 2: Wrap inside of `function()`

The code gets wrapped inside of the curly braces

<code class ='r hljs remark-code'>penguin_cor <- function(split) <span style="background-color:#FED766;color:#009FB7">{</span><br>&nbsp;&nbsp;boot_sample <- analysis(split)<br>&nbsp;&nbsp;<br>&nbsp;&nbsp;cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;use = "complete.obs")<br><span style="background-color:#FED766;color:#009FB7">}</span></code>

---
# Step 2: Wrap inside of `function()`

Give the function a name

<code class ='r hljs remark-code'><span style="background-color:#FED766;color:#009FB7">penguin_cor</span> <- function(split) {<br>&nbsp;&nbsp;boot_sample <- analysis(split)<br>&nbsp;&nbsp;<br>&nbsp;&nbsp;cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;use = "complete.obs")<br>}</code>

---
# Step 3: Verify it works as expected

```r
boot1 <- penguin_boot$splits[[1]]
boot1_sample <- analysis(boot1)
cor(boot1_sample$bill_length_mm, boot1_sample$bill_depth_mm,
    use = "complete.obs")
#> [1] -0.1249339
```

```r
penguin_cor(boot1)
#> [1] -0.1249339
```

---
# Building our `map()`

```r
map(penguin_boot$splits, penguin_cor)
#> [[1]]
#> [1] -0.1249339
#> 
#> [[2]]
#> [1] -0.2541791
#> 
#> [[3]]
#> [1] -0.1991897
#> 
#> [[4]]
#> [1] -0.2478776
#> 
#> [[5]]
#> [1] -0.2450851
#> 
#> [[6]]
#> [1] -0.2792748
#> 
#> [[7]]
#> [1] -0.2776204
#> 
#> [[8]]
#> [1] -0.2045329
#> 
#> [[9]]
#> [1] -0.2431401
#> 
#> [[10]]
#> [1] -0.2745247
#> 
#> [[11]]
#> [1] -0.3171123
#> 
#> [[12]]
#> [1] -0.2259223
#> 
#> [[13]]
#> [1] -0.2044941
#> 
#> [[14]]
#> [1] -0.1027711
#> 
#> [[15]]
#> [1] -0.2506249
#> 
#> [[16]]
#> [1] -0.225205
#> 
#> [[17]]
#> [1] -0.2336472
#> 
#> [[18]]
#> [1] -0.2458328
#> 
#> [[19]]
#> [1] -0.2302427
#> 
#> [[20]]
#> [1] -0.2119384
#> 
#> [[21]]
#> [1] -0.1615742
#> 
#> [[22]]
#> [1] -0.308794
#> 
#> [[23]]
#> [1] -0.08155008
#> 
#> [[24]]
#> [1] -0.2548912
#> 
#> [[25]]
#> [1] -0.244613
#> 
#> [[26]]
#> [1] -0.2485528
#> 
#> [[27]]
#> [1] -0.23714
#> 
#> [[28]]
#> [1] -0.2015473
#> 
#> [[29]]
#> [1] -0.2691933
#> 
#> [[30]]
#> [1] -0.2407836
#> 
#> [[31]]
#> [1] -0.2307748
#> 
#> [[32]]
#> [1] -0.254897
#> 
#> [[33]]
#> [1] -0.2027794
#> 
#> [[34]]
#> [1] -0.2285699
#> 
#> [[35]]
#> [1] -0.164203
#> 
#> [[36]]
#> [1] -0.2912028
#> 
#> [[37]]
#> [1] -0.2936487
#> 
#> [[38]]
#> [1] -0.2951136
#> 
#> [[39]]
#> [1] -0.1986556
#> 
#> [[40]]
#> [1] -0.2981575
#> 
#> [[41]]
#> [1] -0.2325189
#> 
#> [[42]]
#> [1] -0.3074985
#> 
#> [[43]]
#> [1] -0.2465553
#> 
#> [[44]]
#> [1] -0.1778199
#> 
#> [[45]]
#> [1] -0.253044
#> 
#> [[46]]
#> [1] -0.2090783
#> 
#> [[47]]
#> [1] -0.2371913
#> 
#> [[48]]
#> [1] -0.2546507
#> 
#> [[49]]
#> [1] -0.2835118
#> 
#> [[50]]
#> [1] -0.2080868
#> 
#> [[51]]
#> [1] -0.2627895
#> 
#> [[52]]
#> [1] -0.198087
#> 
#> [[53]]
#> [1] -0.214096
#> 
#> [[54]]
#> [1] -0.2348333
#> 
#> [[55]]
#> [1] -0.2161979
#> 
#> [[56]]
#> [1] -0.2515138
#> 
#> [[57]]
#> [1] -0.1813075
#> 
#> [[58]]
#> [1] -0.2178123
#> 
#> [[59]]
#> [1] -0.3049228
#> 
#> [[60]]
#> [1] -0.2239921
#> 
#> [[61]]
#> [1] -0.1970032
#> 
#> [[62]]
#> [1] -0.2332788
#> 
#> [[63]]
#> [1] -0.2436859
#> 
#> [[64]]
#> [1] -0.2748855
#> 
#> [[65]]
#> [1] -0.2315254
#> 
#> [[66]]
#> [1] -0.2064697
#> 
#> [[67]]
#> [1] -0.3404718
#> 
#> [[68]]
#> [1] -0.1539548
#> 
#> [[69]]
#> [1] -0.3189776
#> 
#> [[70]]
#> [1] -0.2708725
#> 
#> [[71]]
#> [1] -0.2243308
#> 
#> [[72]]
#> [1] -0.2800821
#> 
#> [[73]]
#> [1] -0.2363227
#> 
#> [[74]]
#> [1] -0.2387688
#> 
#> [[75]]
#> [1] -0.2112919
#> 
#> [[76]]
#> [1] -0.2377676
#> 
#> [[77]]
#> [1] -0.1885858
#> 
#> [[78]]
#> [1] -0.2639419
#> 
#> [[79]]
#> [1] -0.3930653
#> 
#> [[80]]
#> [1] -0.2140946
#> 
#> [[81]]
#> [1] -0.1766409
#> 
#> [[82]]
#> [1] -0.1848252
#> 
#> [[83]]
#> [1] -0.1868983
#> 
#> [[84]]
#> [1] -0.3049031
#> 
#> [[85]]
#> [1] -0.2336121
#> 
#> [[86]]
#> [1] -0.2459616
#> 
#> [[87]]
#> [1] -0.2671155
#> 
#> [[88]]
#> [1] -0.3290882
#> 
#> [[89]]
#> [1] -0.2258902
#> 
#> [[90]]
#> [1] -0.2425831
#> 
#> [[91]]
#> [1] -0.2154876
#> 
#> [[92]]
#> [1] -0.2990681
#> 
#> [[93]]
#> [1] -0.1735009
#> 
#> [[94]]
#> [1] -0.2217043
#> 
#> [[95]]
#> [1] -0.2614308
#> 
#> [[96]]
#> [1] -0.2955381
#> 
#> [[97]]
#> [1] -0.2524951
#> 
#> [[98]]
#> [1] -0.2194297
#> 
#> [[99]]
#> [1] -0.2214609
#> 
#> [[100]]
#> [1] -0.1783059
```

---
# `map()` + `mutate()`

```r
penguin_boot %>%
  mutate(corr = map(splits, penguin_cor))
#> # Bootstrap sampling 
#> # A tibble: 100 x 3
#>    splits            id           corr     
#>    <list>            <chr>        <list>   
#>  1 <split [344/134]> Bootstrap001 <dbl [1]>
#>  2 <split [344/126]> Bootstrap002 <dbl [1]>
#>  3 <split [344/124]> Bootstrap003 <dbl [1]>
#>  4 <split [344/129]> Bootstrap004 <dbl [1]>
#>  5 <split [344/127]> Bootstrap005 <dbl [1]>
#>  6 <split [344/124]> Bootstrap006 <dbl [1]>
#>  7 <split [344/127]> Bootstrap007 <dbl [1]>
#>  8 <split [344/125]> Bootstrap008 <dbl [1]>
#>  9 <split [344/128]> Bootstrap009 <dbl [1]>
#> 10 <split [344/123]> Bootstrap010 <dbl [1]>
#> # … with 90 more rows
```

???

What's wrong with this? What could be improved?

---
# `map_dbl()` + `mutate()`

```r
penguin_boot %>%
  mutate(corr = map_dbl(splits, penguin_cor))
#> # Bootstrap sampling 
#> # A tibble: 100 x 3
#>    splits            id             corr
#>    <list>            <chr>         <dbl>
#>  1 <split [344/134]> Bootstrap001 -0.125
#>  2 <split [344/126]> Bootstrap002 -0.254
#>  3 <split [344/124]> Bootstrap003 -0.199
#>  4 <split [344/129]> Bootstrap004 -0.248
#>  5 <split [344/127]> Bootstrap005 -0.245
#>  6 <split [344/124]> Bootstrap006 -0.279
#>  7 <split [344/127]> Bootstrap007 -0.278
#>  8 <split [344/125]> Bootstrap008 -0.205
#>  9 <split [344/128]> Bootstrap009 -0.243
#> 10 <split [344/123]> Bootstrap010 -0.275
#> # … with 90 more rows
```

---
class: your-turn

# Your turn 3

.big[
Use the mapping functions and `mutate()` to calculate the correlation between bill length and bill depth.

1\. Write a function to calculate the correlation from the analysis set of a split.

2\. Apply that function to every bootstrap sample using `mutate()` and mapping function.

3\. Make a histogram of the bootstrapped correlations.
]

---
class: your-turn

.panelset[
.panel[.panel-name[Function]

```r
penguin_cor <- function(split) {
  boot_sample <- analysis(split)
  
  cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
      use = "complete.obs")
}
```
]

.panel[.panel-name[Code]

```r
penguin_boot %>%
  mutate(corr = map_dbl(splits, penguin_cor)) %>%
  ggplot(mapping = aes(x = corr)) +
  geom_histogram()
```
]

.panel[.panel-name[Plot]
<img src="images/resampling/plots/yt-boot-cor-sol-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---
# Bootstrap summaries

```r
penguin_boot %>%
  mutate(corr = map_dbl(splits, penguin_cor)) %>%
  summarize(avg_corr = mean(corr))
#> # A tibble: 1 x 1
#>   avg_corr
#>      <dbl>
#> 1   -0.237
```

---
# Bootstrap summaries

```r
penguin_boot %>%
  mutate(corr = map_dbl(splits, penguin_cor)) %>%
  summarize(interval_95 = quantile(corr, probs = c(0.025, 0.975)),
            quantile = c(0.025, 0.975))
#> # A tibble: 2 x 2
#>   interval_95 quantile
#>         <dbl>    <dbl>
#> 1      -0.324    0.025
#> 2      -0.139    0.975
```

---
.panelset[
.panel[.panel-name[Overall]
<img src="images/resampling/plots/overall-simpson-1.png" width="70%" style="display: block; margin: auto;" />

.center[
*r* =  &minus;.24
]
]

.panel[.panel-name[By Group]
<img src="images/resampling/plots/group-simpson-1.png" width="70%" style="display: block; margin: auto;" />

.center[
Adelie *r* = .39; Chinstrap *r* = .65; Gentoo *r* = .64
]
]
]

---
name: cv
class: center middle inverse

# Cross-validation

---
class: your-turn

# Your turn 4

.big[
1\. Use `initial_split()` to create a training and testing set of the penguins data.

2\. Write a parsnip specification to fit a linear model that uses flipper length to predict bill length.

3\. Use the testing data to calculate the RMSE.
]

---
class: your-turn

```r
penguin_split <- initial_split(penguins)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

lm_model <- fit(lm_spec,
                bill_length_mm ~ flipper_length_mm,
                data = penguin_train)

lm_preds <- predict(lm_model, new_data = penguin_test) %>%
  mutate(.truth = penguin_test$bill_length_mm)

rmse(lm_preds, truth = .truth, estimate = .pred)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        4.69
```

---
class: your-turn

# Your turn 5

.big[
What would happen if you repeated this process? Would you get the same answers?

Rerun the code chunk from the last exercise. Do you get the same answer?
]

---
.pull-left[

```
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        3.66
```

```
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        3.91
```

```
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        4.61
```

]

.pull-right[

```
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        4.16
```

```
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        4.01
```

```
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        3.88
```

]

---
class: pop-quiz

# Pop quiz!

.big[Why is the new estimate different?]

---
exclude: true

---
class: center middle

---

.right[Mean RMSE]

---

```r
rmses %>% enframe(name = "rmse")
#> # A tibble: 10 x 2
#>     rmse value
#>    <int> <dbl>
#>  1     1  3.72
#>  2     2  3.65
#>  3     3  4.70
#>  4     4  4.31
#>  5     5  3.82
#>  6     6  4.39
#>  7     7  4.08
#>  8     8  4.85
#>  9     9  4.57
#> 10    10  4.08

mean(rmses)
#> [1] 4.216917
```

---
class: pop-quiz

# Consider

.big[Which do you think is more accurate, the best result or the mean of the results? Why?]

---
# There has to be a better way...

```r
rmses <- vector(length = 10, mode = "double")
for (i in seq_along(rmses)) {
  new_split <- initial_split(penguins)
  penguin_train <- training(new_split)
  penguin_test <- testing(new_split)
  
  lm_model <- fit(lm_spec,
                  bill_length_mm ~ flipper_length_mm,
                  data = penguin_train)
  
  lm_preds <- predict(lm_model,
                      new_data = penguin_test) %>%
    mutate(.truth = penguin_test$bill_length_mm)
  
  rmses[i] <- rmse(lm_preds, truth = .truth, estimate = .pred) %>%
    pull(.estimate)
}
```

---
class: center middle

# V-fold cross-validation

```r
vfold_cv(data, v = 10, ...)
```

---
<img src="images/resampling/plots/cv-gif-1.gif" style="display: block; margin: auto;" />

---
# Guess

How many times does an observation/row appear in the assessment set?

---

---
class: pop-quiz

# Pop quiz!

.big[If we use 10 folds, what percent of our data will end up in the training set and what percent in the testing set for each fold?]

---
class: pop-quiz

# Pop quiz!

.big[If we use 10 folds, what percent of our data will end up in the training set and what percent in the testing set for each fold?]

.big[
**90% training**

**10% testing**
]

---
class: your-turn

# Your turn 6

Run the code below. What does it return?

```r
set.seed(100)
cv_folds <- vfold_cv(penguins, v = 10, strata = species)

cv_folds
```

---
class: your-turn

```r
cv_folds
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 x 2
#>    splits           id    
#>    <list>           <chr> 
#>  1 <split [308/36]> Fold01
#>  2 <split [308/36]> Fold02
#>  3 <split [309/35]> Fold03
#>  4 <split [309/35]> Fold04
#>  5 <split [310/34]> Fold05
#>  6 <split [310/34]> Fold06
#>  7 <split [310/34]> Fold07
#>  8 <split [310/34]> Fold08
#>  9 <split [311/33]> Fold09
#> 10 <split [311/33]> Fold10
```

???

How does this help us?

---
class: center middle inverse

# `fit_resamples()`

---
# `fit_resamples()`

Trains and tests a model with cross-validation.

```r
fit_resamples(lm_spec,
              bill_length_mm ~ flipper_length_mm,
              resamples = cv_folds)
```

---

```r
fit_resamples(lm_spec,
              bill_length_mm ~ flipper_length_mm,
              resamples = cv_folds)
#> # Resampling results
#> # 10-fold cross-validation using stratification 
#> # A tibble: 10 x 4
#>    splits           id     .metrics             .notes              
#>    <list>           <chr>  <list>               <list>              
#>  1 <split [308/36]> Fold01 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  2 <split [308/36]> Fold02 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  3 <split [309/35]> Fold03 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  4 <split [309/35]> Fold04 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  5 <split [310/34]> Fold05 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  6 <split [310/34]> Fold06 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  7 <split [310/34]> Fold07 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  8 <split [310/34]> Fold08 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#>  9 <split [311/33]> Fold09 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
#> 10 <split [311/33]> Fold10 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
```

---
# `collect_metrics()`

Collect metrics from a cross-validation.

.pull-left[

```r
cv_results %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   4.11     10  0.158  Preprocessor1_Model1
#> 2 rsq     standard   0.444    10  0.0422 Preprocessor1_Model1
```
]

.pull-right[

```r
cv_results %>%
  collect_metrics(summarize = FALSE)
#> # A tibble: 20 x 5
#>    id     .metric .estimator .estimate .config             
#>    <chr>  <chr>   <chr>          <dbl> <chr>               
#>  1 Fold01 rmse    standard       3.77  Preprocessor1_Model1
#>  2 Fold01 rsq     standard       0.451 Preprocessor1_Model1
#>  3 Fold02 rmse    standard       4.43  Preprocessor1_Model1
#>  4 Fold02 rsq     standard       0.295 Preprocessor1_Model1
#>  5 Fold03 rmse    standard       3.83  Preprocessor1_Model1
#>  6 Fold03 rsq     standard       0.315 Preprocessor1_Model1
#>  7 Fold04 rmse    standard       4.16  Preprocessor1_Model1
#>  8 Fold04 rsq     standard       0.496 Preprocessor1_Model1
#>  9 Fold05 rmse    standard       4.16  Preprocessor1_Model1
#> 10 Fold05 rsq     standard       0.560 Preprocessor1_Model1
#> 11 Fold06 rmse    standard       5.11  Preprocessor1_Model1
#> 12 Fold06 rsq     standard       0.220 Preprocessor1_Model1
#> 13 Fold07 rmse    standard       4.24  Preprocessor1_Model1
#> 14 Fold07 rsq     standard       0.594 Preprocessor1_Model1
#> 15 Fold08 rmse    standard       4.45  Preprocessor1_Model1
#> 16 Fold08 rsq     standard       0.387 Preprocessor1_Model1
#> 17 Fold09 rmse    standard       3.56  Preprocessor1_Model1
#> 18 Fold09 rsq     standard       0.593 Preprocessor1_Model1
#> 19 Fold10 rmse    standard       3.39  Preprocessor1_Model1
#> 20 Fold10 rsq     standard       0.528 Preprocessor1_Model1
```
]

---
# `metric_set()`

Specify which metrics you want to get back.

```r
fit_resamples(lm_spec,
              bill_length_mm ~ flipper_length_mm,
              resamples = cv_folds,
              metrics = metric_set(rsq)) %>%
  collect_metrics()
#> # A tibble: 1 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rsq     standard   0.444    10  0.0422 Preprocessor1_Model1
```

---
class: your-turn

# Your Turn 7

Modify the code below to estimate our model on each of the folds and calculate the average RMSE for our penguin model.

```r
fit(lm_spec,
    bill_length_mm ~ flipper_length_mm,
    data = penguins)
```

---
class: your-turn

```r
fit_resamples(lm_spec,
              bill_length_mm ~ flipper_length_mm,
              resamples = cv_folds,
              metrics = metric_set(rmse)) %>%
  collect_metrics()
#> # A tibble: 1 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard    4.11    10   0.158 Preprocessor1_Model1
```

---
class: center middle inverse

# Comparing Models

---
class: your-turn

# Your Turn 8

.big[
Use `fit_resamples()` and `cv_folds` to estimate to models two predict bill length.

1\. `bill_length_mm ~ flipper_length_mm`

2\. `bill_length_mm ~ species + sex`

Compare the performance of each.
]

---
class: your-turn

```r
fit_resamples(lm_spec,
              bill_length_mm ~ flipper_length_mm,
              resamples = cv_folds) %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   4.11     10  0.158  Preprocessor1_Model1
#> 2 rsq     standard   0.444    10  0.0422 Preprocessor1_Model1

fit_resamples(lm_spec,
              bill_length_mm ~ species + sex,
              resamples = cv_folds) %>%
  collect_metrics()
#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   2.33     10  0.118  Preprocessor1_Model1
#> 2 rsq     standard   0.833    10  0.0130 Preprocessor1_Model1
```

---
class: pop-quiz

# Pop quiz!

.big[Why should you use the same data splits to compare each model?]

.big[🍎 to 🍎]

---
class: pop-quiz

# Pop quiz!

.big[Does cross-validation measure just the accuracy of your model, or your entire workflow?]

.big[Your entire workflow]

---
class: middle, center, inverse

# Other types of cross-validation

---
class: middle, center

# `vfold_cv()` - V Fold cross-validation

---
class: middle, center

# `loo_cv()` - Leave one out CV

---
class: middle, center

# `mc_cv()` - Monte Carlo (random) CV

(Test sets sampled without replacement)

---
class: middle, center

# `bootstraps()`

(Test sets sampled with replacement)

---
class: title-slide, center

# Resampling

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]