class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#009FB7;">3</strong> </span> # Data Transformation ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).] <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$` `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { yellow: ["{\\color{yellow}{#1}}", 1], blue: ["{\\color{blue}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .yellow {color: #FED766;} .blue {color: #009FB7;} </style> --- <div class="hex-book"> <a href="https://dplyr.tidyverse.org"> <img class="hex" src="images/hex/dplyr.png"> </a> <a href="https://r4ds.had.co.nz/transform.html"> <img class="book" src="images/books/r4ds-data-trans.png"> </a> </div> --- background-image: url(images/transform/applied-ds-trans.png) background-position: center 60% background-size: 85% # .nobold[(Applied)] Data Science --- # Example Data: `babynames` ```r library(babynames) babynames #> # A tibble: 1,924,665 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1880 F Mary 7065 0.0724 #> 2 1880 F Anna 2604 0.0267 #> 3 1880 F Emma 2003 0.0205 #> 4 1880 F Elizabeth 1939 0.0199 #> 5 1880 F Minnie 1746 0.0179 #> 6 1880 F Margaret 1578 0.0162 #> 7 1880 F Ida 1472 0.0151 #> 8 1880 F Alice 1414 0.0145 #> 9 1880 F Bertha 1320 0.0135 #> 10 1880 F Sarah 1288 0.0132 #> # … with 1,924,655 more rows ``` ??? Year, Sex assigned at birth, Name, Number, and Proportion (n / sum(n |year,gender)) --- class: middle .left-column[ <a href="https://docs.ropensci.org/skimr/"> <img src="images/hex/skimr.png"> </a> ] .right-column[ # skimr ```r library(skimr) ``` * Skims your data * Summarizes variable types, distributions, etc. ```r skim(babynames) ``` ] --- class: your-turn # Your turn 1 .big[ * Open the R Notebook **materials/exercises/03-transform.Rmd** * Let's look at the `babynames` data set * Run this code to view a summary of the data ] ```r skim(babynames) ```
02
:
00
--- class: your-turn wide .panelset[ .panel[.panel-name[Code] ```r skim(babynames) ``` ] .panel[.panel-name[Output] .smaller[ ``` #> ── Data Summary ──────────────────────── #> Values #> Name babynames #> Number of rows 1924665 #> Number of columns 5 #> _______________________ #> Column type frequency: #> character 2 #> numeric 3 #> ________________________ #> Group variables None #> #> ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── #> skim_variable n_missing complete_rate min max empty n_unique whitespace #> 1 sex 0 1 1 1 0 2 0 #> 2 name 0 1 2 15 0 97310 0 #> #> ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── #> skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist #> 1 year 0 1 1975. 34.0 1880 1951 1985 2003 2017 ▁▂▃▅▇ #> 2 n 0 1 181. 1533. 5 7 12 32 99686 ▇▁▁▁▁ #> 3 prop 0 1 0.000136 0.00115 0.00000226 0.00000387 0.0000073 0.0000229 0.0815 ▇▁▁▁▁ ``` ] ] ] --- # Isolating Data .pull-left[ <img src="images/transform/select.png" height="75px"> ] .pull-right[ .big[Extract variables with **`select()`**] ] --- # Isolating Data .pull-left[ <img src="images/transform/select.png" height="75px"> </br> </br> </br> <img src="images/transform/filter.png" height="75px"> ] .pull-right[ .big[Extract variables with **`select()`**] </br> </br> </br> </br> .big[Extract cases with **`filter()`**] ] --- # Isolating Data .pull-left[ <img src="images/transform/select.png" height="75px"> </br> </br> </br> <img src="images/transform/filter.png" height="75px"> </br> </br> </br> <img src="images/transform/arrange.png" height="75px"> ] .pull-right[ .big[Extract variables with **`select()`**] </br> </br> </br> </br> .big[Extract cases with **`filter()`**] </br> </br> </br> .big[Arrange cases with **`arrange()`**] ] --- name: select class: center middle # `select()` --- class: middle <code class ='r hljs remark-code'>select(.data, ...)</code> --- class: middle <code class ='r hljs remark-code'>select(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> ??? Data to transform --- class: middle <code class ='r hljs remark-code'>select(.data, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? Name of column(s) to select, or select helper function --- ```r select(babynames, name, prop) ``` .pull-left[
year
sex
name
n
prop
1880
F
Mary
7065
0.072
1880
F
Anna
2604
0.027
1880
F
Emma
2003
0.021
1880
F
Elizabeth
1939
0.020
1880
F
Minnie
1746
0.018
1880
F
Margaret
1578
0.016
1880
F
Ida
1472
0.015
1880
F
Alice
1414
0.014
1880
F
Bertha
1320
0.014
1880
F
Sarah
1288
0.013
] -- .pull-right[
name
prop
Mary
0.072
Anna
0.027
Emma
0.021
Elizabeth
0.020
Minnie
0.018
Margaret
0.016
Ida
0.015
Alice
0.014
Bertha
0.014
Sarah
0.013
] --- class: your-turn # Your turn 2 .big[ * Alter the code to select just the **`n`** column ] ```r select(babynames, name, prop) ```
02
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r select(babynames, n) ``` ] .panel[.panel-name[Result] ``` # A tibble: 1,924,665 x 1 n <int> 1 7065 2 2604 3 2003 4 1939 5 1746 6 1578 7 1472 8 1414 9 1320 10 1288 # … with 1,924,655 more rows ``` ] ] --- # Example Data: `storms` ```r storms #> # A tibble: 10,010 x 13 #> name year month day hour lat long status category wind pressure #> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> #> 1 Amy 1975 6 27 0 27.5 -79 tropical d… -1 25 1013 #> 2 Amy 1975 6 27 6 28.5 -79 tropical d… -1 25 1013 #> 3 Amy 1975 6 27 12 29.5 -79 tropical d… -1 25 1013 #> 4 Amy 1975 6 27 18 30.5 -79 tropical d… -1 25 1013 #> 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… -1 25 1012 #> 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… -1 25 1012 #> 7 Amy 1975 6 28 12 33.3 -78 tropical d… -1 25 1011 #> 8 Amy 1975 6 28 18 34 -77 tropical d… -1 30 1006 #> 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… 0 35 1004 #> 10 Amy 1975 6 29 6 34 -74.8 tropical s… 0 40 1002 #> # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>, #> # hu_diameter <dbl> ``` ??? Hurricane data; year, month, day, hour of report; position (lat/long); classification; category; wind speed; air pressure; diameter of the area with tropical storm winds; diameter of area with hurricane winds --- # `select()` helpers **`:`** selects a range of columns ```r select(storms, name:pressure) ``` **`-`** selects every column but ```r select(storms, -c(name, pressure)) ``` **`starts_with()`**/**`ends_with()`** selects based on start/end ```r select(storms, starts_with("w")) select(storms, ends_with("e")) ``` --- # `select()` helpers **`contains()`** selects based on anywhere ```r select(storms, contains("d")) ``` **`matches()`** selects based on expressions ```r select(storms, matches("^.{4}$")) ``` **`any_of()`**/**`all_of()`** selects a set ```r select(storms, any_of(c("name", "names", "Name"))) ``` ??? regex: name starts, has any character 4 times, then ends --- background-image: url(images/transform/cheatsheet-select.png) background-position: center middle background-size: 85% --- class: pop-quiz # Consider Which of these is NOT a way to select the **`name`** and **`n`** columns together? ```r select(babynames, -c(year, sex, prop)) ``` ```r select(babynames, name:n) ``` ```r select(babynames, starts_with("n")) ``` ```r select(babynames, ends_with("n")) ```
01
:
00
--- class: pop-quiz .panelset[ .panel[.panel-name[Option 1] ```r select(babynames, -c(year, sex, prop)) #> # A tibble: 1,924,665 x 2 #> name n #> <chr> <int> #> 1 Mary 7065 #> 2 Anna 2604 #> 3 Emma 2003 #> 4 Elizabeth 1939 #> 5 Minnie 1746 #> 6 Margaret 1578 #> 7 Ida 1472 #> 8 Alice 1414 #> 9 Bertha 1320 #> 10 Sarah 1288 #> # … with 1,924,655 more rows ``` ] .panel[.panel-name[Option 2] ```r select(babynames, name:n) #> # A tibble: 1,924,665 x 2 #> name n #> <chr> <int> #> 1 Mary 7065 #> 2 Anna 2604 #> 3 Emma 2003 #> 4 Elizabeth 1939 #> 5 Minnie 1746 #> 6 Margaret 1578 #> 7 Ida 1472 #> 8 Alice 1414 #> 9 Bertha 1320 #> 10 Sarah 1288 #> # … with 1,924,655 more rows ``` ] .panel[.panel-name[Option 3] ```r select(babynames, starts_with("n")) #> # A tibble: 1,924,665 x 2 #> name n #> <chr> <int> #> 1 Mary 7065 #> 2 Anna 2604 #> 3 Emma 2003 #> 4 Elizabeth 1939 #> 5 Minnie 1746 #> 6 Margaret 1578 #> 7 Ida 1472 #> 8 Alice 1414 #> 9 Bertha 1320 #> 10 Sarah 1288 #> # … with 1,924,655 more rows ``` ] .panel[.panel-name[Option 4] ```r select(babynames, ends_with("n")) #> # A tibble: 1,924,665 x 1 #> n #> <int> #> 1 7065 #> 2 2604 #> 3 2003 #> 4 1939 #> 5 1746 #> 6 1578 #> 7 1472 #> 8 1414 #> 9 1320 #> 10 1288 #> # … with 1,924,655 more rows ``` ] ] --- name: filter class: center middle # `filter()` --- class: middle <code class ='r hljs remark-code'>filter(.data, ...)</code> --- class: middle <code class ='r hljs remark-code'>filter(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> ??? Data to transform --- class: middle <code class ='r hljs remark-code'>filter(.data, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? One or more logical tests. Filter returns each row where the test is TRUE --- ```r filter(babynames, name == "Ida") ``` .pull-left[
year
sex
name
n
prop
1880
F
Mary
7065
0.072
1880
F
Anna
2604
0.027
1880
F
Emma
2003
0.021
1880
F
Elizabeth
1939
0.020
1880
F
Minnie
1746
0.018
1880
F
Margaret
1578
0.016
1880
F
Ida
1472
0.015
1880
F
Alice
1414
0.014
1880
F
Bertha
1320
0.014
1880
F
Sarah
1288
0.013
] -- .pull-right[
year
sex
name
n
prop
1880
F
Ida
1472
0.015
1880
M
Ida
8
0.000
1881
F
Ida
1439
0.015
1881
M
Ida
5
0.000
1882
F
Ida
1673
0.014
1882
M
Ida
5
0.000
1883
F
Ida
1634
0.014
1883
M
Ida
5
0.000
1884
F
Ida
1882
0.014
1884
M
Ida
8
0.000
] --- class: middle <code class ='r hljs remark-code'>filter(babynames, name <span style="background-color:#FED766;color:#009FB7">==</span> "Ida")</code> --- # Logical Tests .center[ `?Comparison` <table> <colgroup> <col span="1" style="width: 175px;"> <col span="1" style="width: 225px;"> </colgroup> <tbody> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> < </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Less than </td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> > </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Greater than </td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> == </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Equal to </td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> <= </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Less than or equal to </td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> >= </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Greater than or equal to</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> != </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Not equal to</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">x </code> <code class="r,css,yaml remark-inline-code"><strong> %in% </strong></code> <code class="r,css,yaml remark-inline-code code-fade">y</code> </td> <td style="text-align: left"> Group membership</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code"><strong>is.na(</strong></code> <code class="r,css,yaml remark-inline-code code-fade">x</code> <code class="r,css,yaml remark-inline-code"><strong>)</strong></code> </td> <td style="text-align: left"> Is missing (<code class="r,css,yaml remark-inline-code code-dark">NA</code>)</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code"><strong>!is.na(</strong></code> <code class="r,css,yaml remark-inline-code code-fade">x</code> <code class="r,css,yaml remark-inline-code"><strong>)</strong></code> </td> <td style="text-align: left"> Is not missing (not <code class="r,css,yaml remark-inline-code code-dark">NA</code>)</td> </tr> </tbody> </table> ] --- class: your-turn # Your turn 3 .big[ Use the logical operators to manipulate the `babynames` data to show: 1. All of the names where **`prop`** is greater than .08 2. All of the children named "Daenerys" 3. All of the names that have a missing value for **`n`** .tiny[(Hint: this should return an empty data set)] ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[`prop` > .08] ```r filter(babynames, prop > .08) #> # A tibble: 3 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1880 M John 9655 0.0815 #> 2 1880 M William 9532 0.0805 #> 3 1881 M John 8769 0.0810 ``` ] .panel[.panel-name[Daenerys] ```r filter(babynames, name == "Daenerys") #> # A tibble: 6 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 2012 F Daenerys 21 0.0000108 #> 2 2013 F Daenerys 68 0.0000354 #> 3 2014 F Daenerys 86 0.0000441 #> 4 2015 F Daenerys 82 0.0000422 #> 5 2016 F Daenerys 101 0.0000524 #> 6 2017 F Daenerys 110 0.0000587 ``` ] .panel[.panel-name[Missing] ```r filter(babynames, is.na(n)) #> # A tibble: 0 x 5 #> # … with 5 variables: year <dbl>, sex <chr>, name <chr>, n <int>, prop <dbl> ``` ] ] --- ```r filter(babynames, name == "Ida", year == 1880) ``` .pull-left[
year
sex
name
n
prop
1880
F
Mary
7065
0.072
1880
F
Anna
2604
0.027
1880
F
Emma
2003
0.021
1880
F
Elizabeth
1939
0.020
1880
F
Minnie
1746
0.018
1880
F
Margaret
1578
0.016
1880
F
Ida
1472
0.015
1880
F
Alice
1414
0.014
1880
F
Bertha
1320
0.014
1880
F
Sarah
1288
0.013
] -- .pull-right[
year
sex
name
n
prop
1880
F
Ida
1472
0.015
1880
M
Ida
8
0.000
] ??? Specify multiple logical conditions that must be met. Listing comma separated values is equivalent to "AND"/& --- # Boolean operators .center[ `?base::Logic` <table> <colgroup> <col span="1" style="width: 225px;"> <col span="1" style="width: 175px;"> </colgroup> <tbody> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">a </code> <code class="r,css,yaml remark-inline-code"><strong> & </strong></code> <code class="r,css,yaml remark-inline-code code-fade">b</code> </td> <td style="text-align: left"> and</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">a </code> <code class="r,css,yaml remark-inline-code"><strong> | </strong></code> <code class="r,css,yaml remark-inline-code code-fade">b</code> </td> <td style="text-align: left"> or</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code"><strong>xor(</strong></code> <code class="r,css,yaml remark-inline-code code-fade">a, b</code> <code class="r,css,yaml remark-inline-code"><strong>)</strong></code> </td> <td style="text-align: left"> Exactly or</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code"><strong>!</strong></code> <code class="r,css,yaml remark-inline-code code-fade">a</code> </td> <td style="text-align: left"> not</td> </tr> <tr> <td style="text-align: center"> <code class="r,css,yaml remark-inline-code code-fade">a </code> <code class="r,css,yaml remark-inline-code"><strong> %in% </strong></code> <code class="r,css,yaml remark-inline-code code-fade">c(a, b)</code> </td> <td style="text-align: left"> One of (in)</td> </tr> </tbody> </table> ] ??? xor -> one is true and one is false --- class: your-turn # Your turn 4 .big[ Use the Boolean operators to manipulate the `babynames` data to show: 1. Girls named Sea 2. Names that were used by exactly 5 or 6 children in 1880 3. Names that are one of Acura, Lexus, or Yugo ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Sea] ```r filter(babynames, name == "Sea", sex == "F") #> # A tibble: 2 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1982 F Sea 5 0.00000276 #> 2 1998 F Sea 5 0.00000258 ``` ] .panel[.panel-name[Sea Alternate] ```r filter(babynames, name == "Sea" & sex == "F") #> # A tibble: 2 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1982 F Sea 5 0.00000276 #> 2 1998 F Sea 5 0.00000258 ``` ] .panel[.panel-name[1880] ```r filter(babynames, n %in% c(5, 6), year == 1880) #> # A tibble: 455 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1880 F Abby 6 0.0000615 #> 2 1880 F Aileen 6 0.0000615 #> 3 1880 F Alba 6 0.0000615 #> 4 1880 F Alda 6 0.0000615 #> 5 1880 F Alla 6 0.0000615 #> 6 1880 F Alverta 6 0.0000615 #> 7 1880 F Ara 6 0.0000615 #> 8 1880 F Ardelia 6 0.0000615 #> 9 1880 F Ardella 6 0.0000615 #> 10 1880 F Arrie 6 0.0000615 #> # … with 445 more rows ``` ] .panel[.panel-name[1880 Alternate] ```r filter(babynames, (n == 5 | n == 6) & year == 1880) #> # A tibble: 455 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1880 F Abby 6 0.0000615 #> 2 1880 F Aileen 6 0.0000615 #> 3 1880 F Alba 6 0.0000615 #> 4 1880 F Alda 6 0.0000615 #> 5 1880 F Alla 6 0.0000615 #> 6 1880 F Alverta 6 0.0000615 #> 7 1880 F Ara 6 0.0000615 #> 8 1880 F Ardelia 6 0.0000615 #> 9 1880 F Ardella 6 0.0000615 #> 10 1880 F Arrie 6 0.0000615 #> # … with 445 more rows ``` ] .panel[.panel-name[Cars] ```r filter(babynames, name %in% c("Acura", "Lexus", "Yugo")) #> # A tibble: 57 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1990 F Lexus 36 0.0000175 #> 2 1990 M Lexus 12 0.00000558 #> 3 1991 F Lexus 102 0.0000502 #> 4 1991 M Lexus 16 0.00000755 #> 5 1992 F Lexus 193 0.0000963 #> 6 1992 M Lexus 25 0.0000119 #> 7 1993 F Lexus 285 0.000145 #> 8 1993 M Lexus 30 0.0000145 #> 9 1994 F Lexus 381 0.000195 #> 10 1994 F Acura 6 0.00000308 #> # … with 47 more rows ``` ] .panel[.panel-name[Cars Alternate] ```r filter(babynames, name == "Acura" | name == "Lexus" | name == "Yugo") #> # A tibble: 57 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1990 F Lexus 36 0.0000175 #> 2 1990 M Lexus 12 0.00000558 #> 3 1991 F Lexus 102 0.0000502 #> 4 1991 M Lexus 16 0.00000755 #> 5 1992 F Lexus 193 0.0000963 #> 6 1992 M Lexus 25 0.0000119 #> 7 1993 F Lexus 285 0.000145 #> 8 1993 M Lexus 30 0.0000145 #> 9 1994 F Lexus 381 0.000195 #> 10 1994 F Acura 6 0.00000308 #> # … with 47 more rows ``` ] ] ??? Common mistakes: * collapsing multiple tests into one (i.e., `10 < n < 20` instead of `10 < n, n < 20`) * Stringing together many tests when you could use %in% --- name: arrange class: center middle # `arrange()` --- class: middle <code class ='r hljs remark-code'>arrange(.data, ...)</code> --- class: middle <code class ='r hljs remark-code'>arrange(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> ??? Data to transform --- class: middle <code class ='r hljs remark-code'>arrange(.data, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? One or more columns to order by. Additional columns are used to break ties. --- ```r arrange(babynames, n) ``` .pull-left[
year
sex
name
n
prop
1965
M
Russell
5,649
0.003
1935
M
Jimmy
4,173
0.004
1891
M
William
6,763
0.062
2000
M
Kevin
12,667
0.006
1956
M
Johnny
6,640
0.003
1918
F
Gladys
8,735
0.007
1971
M
Stephen
13,105
0.007
1898
M
James
5,321
0.040
1994
F
Kaitlyn
6,686
0.003
1933
F
Margaret
15,239
0.015
] -- .pull-right[
year
sex
name
n
prop
1935
M
Jimmy
4,173
0.004
1898
M
James
5,321
0.040
1965
M
Russell
5,649
0.003
1956
M
Johnny
6,640
0.003
1994
F
Kaitlyn
6,686
0.003
1891
M
William
6,763
0.062
1918
F
Gladys
8,735
0.007
2000
M
Kevin
12,667
0.006
1971
M
Stephen
13,105
0.007
1933
F
Margaret
15,239
0.015
] --- class: your-turn # Your turn 5 .big[ * Arrange `babynames` by **`n`**. Add **`prop`** as a second (tiebreaking) variable to arrange by. * What is the smallest value of **`n`**? ]
02
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r arrange(babynames, n, prop) ``` ] .panel[.panel-name[Result] ``` # A tibble: 1,924,665 x 5 year sex name n prop <dbl> <chr> <chr> <int> <dbl> 1 2007 M Aaban 5 0.00000226 2 2007 M Aareon 5 0.00000226 3 2007 M Aaris 5 0.00000226 4 2007 M Abd 5 0.00000226 5 2007 M Abdulazeez 5 0.00000226 6 2007 M Abdulhadi 5 0.00000226 7 2007 M Abdulhamid 5 0.00000226 8 2007 M Abdulkadir 5 0.00000226 9 2007 M Abdulraheem 5 0.00000226 10 2007 M Abdulrahim 5 0.00000226 # … with 1,924,655 more rows ``` ] ] --- # Descending Order <code class ='r hljs remark-code'>arrange(babynames, <span style="background-color:#FED766;color:#009FB7">desc(</span>n<span style="background-color:#FED766;color:#009FB7">)</span>)</code> .pull-left[
year
sex
name
n
prop
1965
M
Russell
5,649
0.003
1935
M
Jimmy
4,173
0.004
1891
M
William
6,763
0.062
2000
M
Kevin
12,667
0.006
1956
M
Johnny
6,640
0.003
1918
F
Gladys
8,735
0.007
1971
M
Stephen
13,105
0.007
1898
M
James
5,321
0.040
1994
F
Kaitlyn
6,686
0.003
1933
F
Margaret
15,239
0.015
] -- .pull-right[
year
sex
name
n
prop
1933
F
Margaret
15,239
0.015
1971
M
Stephen
13,105
0.007
2000
M
Kevin
12,667
0.006
1918
F
Gladys
8,735
0.007
1891
M
William
6,763
0.062
1994
F
Kaitlyn
6,686
0.003
1956
M
Johnny
6,640
0.003
1965
M
Russell
5,649
0.003
1898
M
James
5,321
0.040
1935
M
Jimmy
4,173
0.004
] --- class: your-turn # Your turn 6 .big[ * Use **`desc()`** to find the names with the highest **`prop`**. * Which names have the largest values of **`n`**? ]
02
:
00
--- class: your-turn .panelset[ .panel[.panel-name[prop] ```r arrange(babynames, desc(prop)) # A tibble: 1,924,665 x 5 year sex name n prop <dbl> <chr> <chr> <int> <dbl> 1 1880 M John 9655 0.0815 2 1881 M John 8769 0.0810 3 1880 M William 9532 0.0805 4 1883 M John 8894 0.0791 5 1881 M William 8524 0.0787 6 1882 M John 9557 0.0783 7 1884 M John 9388 0.0765 8 1882 M William 9298 0.0762 9 1886 M John 9026 0.0758 10 1885 M John 8756 0.0755 # … with 1,924,655 more rows ``` ] .panel[.panel-name[n] ```r arrange(babynames, desc(n)) # A tibble: 1,924,665 x 5 year sex name n prop <dbl> <chr> <chr> <int> <dbl> 1 1947 F Linda 99686 0.0548 2 1948 F Linda 96209 0.0552 3 1947 M James 94756 0.0510 4 1957 M Michael 92695 0.0424 5 1947 M Robert 91642 0.0493 6 1949 F Linda 91016 0.0518 7 1956 M Michael 90620 0.0423 8 1958 M Michael 90520 0.0420 9 1948 M James 88588 0.0497 10 1954 M Michael 88514 0.0428 # … with 1,924,655 more rows ``` ] ] --- name: pipe class: center middle # `%>%` --- class: pop-quiz # Consider .pull-left[ How would you do the following to the `babynames` data: 1. Filter to only the names Sansa and Arya; 2. Arrange by **`name`** and then **`year`**; and 3. Remove the **`prop`** variable ] .pull-right[ ``` # A tibble: 67 x 4 year sex name n <dbl> <chr> <chr> <int> 1 1982 M Arya 12 2 1983 M Arya 7 3 1984 M Arya 9 4 1985 M Arya 20 5 1986 F Arya 5 6 1986 M Arya 18 7 1987 M Arya 22 8 1988 F Arya 5 9 1988 M Arya 22 10 1989 M Arya 30 # … with 57 more rows ``` ]
02
:
00
--- class: pop-quiz wide # Nesting <code class ='r hljs remark-code'>babynames</code> -- <code class ='r hljs remark-code'>filter(<span style="background-color:#FED766;color:#009FB7">babynames</span>, name %in% c("Sansa", "Arya"))</code> -- <code class ='r hljs remark-code'>arrange(<span style="background-color:#FED766;color:#009FB7">filter(babynames, name %in% c("Sansa", "Arya"))</span>, name, year)</code> -- <code class ='r hljs remark-code'>select(<span style="background-color:#FED766;color:#009FB7">arrange(filter(babynames, name %in% c("Sansa", "Arya")), name, year)</span>, -prop)</code> -- <code class ='r hljs remark-code'>select(arrange(filter(babynames, name %in% c("Sansa", "Arya")), name, year), -prop)</code> ??? Start with babynames Then filter to Sansa and Arya Then arrange by name and year Finally remove prop --- class: pop-quiz wide # Intermediate objects <code class ='r hljs remark-code'>bn1 <- babynames</code> -- <code class ='r hljs remark-code'>bn2 <- filter(<span style="background-color:#FED766;color:#009FB7">bn1</span>, name %in% c("Sansa", "Arya"))</code> -- <code class ='r hljs remark-code'>bn3 <- arrange(<span style="background-color:#FED766;color:#009FB7">bn2</span>, name, year)</code> -- <code class ='r hljs remark-code'>bn4 <- select(<span style="background-color:#FED766;color:#009FB7">bn3</span>, -prop)</code> -- .big[What are all these objects in my environment window??] --- class: center middle .big[Your code should tell the story of what you are doing to the data] --- class: center middle <iframe width="889" height="500" src="https://www.youtube.com/embed/UbxUSsFXYo4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- # Let's tell a story I: 1. **Tumble** out of bed -- 2. **Stumble** to the kitchen -- 3. **Pour** myself a cup of ambition -- 4. **Yawn** -- 5. **Stretch** -- 6. **Try** to come to live --- # What story does this tell? ```r try(come_to_live( stretch(yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition"))) )) ``` --- # Turn the story into code ```r I ``` --- # Turn the story into code ```r tumble(I, out_of = "bed") ``` --- # Turn the story into code ```r stumble(tumble(I, out_of = "bed"), to = "the kitchen") ``` --- # Turn the story into code ```r pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition") ``` --- # Turn the story into code ```r yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition")) ``` --- # Turn the story into code ```r stretch(yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition"))) ``` --- # Turn the story into code ```r try(come_to_live( stretch(yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition"))) )) ``` ??? We understand because we wrote it, but can others? Remember before we built this up. --- class: center middle .huge[**`%>%`**] -- Pass the output of one function to the first argument of the next --- class: middle <code class ='r hljs remark-code'><span style="background-color:#FED766;color:#009FB7">me1</span> <- "I"<br>me2 <- tumble(<span style="background-color:#FED766;color:#009FB7">me1</span>, out_of = "bed")<br>me3 <- stumble(me2, to = "the kitchen")</code> -- <code class ='r hljs remark-code'>me1 <- "I"<br><span style="background-color:#FED766;color:#009FB7">me2</span> <- tumble(me1, out_of = "bed")<br>me3 <- stumble(<span style="background-color:#FED766;color:#009FB7">me2</span>, to = "the kitchen")</code> -- <code class ='r hljs remark-code'>I <span style="background-color:#FED766;color:#009FB7">%>%</span><br> tumble(out_of = "bed") <span style="background-color:#FED766;color:#009FB7">%>%</span><br> stumble(to = "the kitchen")</code> --- # Tell a story ```r I %>% tumble(out_of = "bed") %>% stumble(to = "the kitchen") %>% pour(who = "myself", unit = "cup", what = "ambition") %>% yawn() %>% stretch() %>% try(come_to_live()) ``` --- # Piping in the tidyverse Functions take a data frame as their first argument, and return a data frame. <code class ='r hljs remark-code'>select(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> <code class ='r hljs remark-code'>filter(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> <code class ='r hljs remark-code'>arrange(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> --- class: pop-quiz # Reconsider .pull-left[ How would you do the following to the `babynames` data: 1. Filter to only the names Sansa and Arya; 2. Arrange by **`name`** and then **`year`**; and 3. Remove the **`prop`** variable ] .pull-right[ ``` # A tibble: 67 x 4 year sex name n <dbl> <chr> <chr> <int> 1 1982 M Arya 12 2 1983 M Arya 7 3 1984 M Arya 9 4 1985 M Arya 20 5 1986 F Arya 5 6 1986 M Arya 18 7 1987 M Arya 22 8 1988 F Arya 5 9 1988 M Arya 22 10 1989 M Arya 30 # … with 57 more rows ``` ] --- class: pop-quiz ```r babynames %>% filter(name %in% c("Sansa", "Arya")) %>% arrange(name, year) %>% select(-prop) #> # A tibble: 67 x 4 #> year sex name n #> <dbl> <chr> <chr> <int> #> 1 1982 M Arya 12 #> 2 1983 M Arya 7 #> 3 1984 M Arya 9 #> 4 1985 M Arya 20 #> 5 1986 F Arya 5 #> 6 1986 M Arya 18 #> 7 1987 M Arya 22 #> 8 1988 F Arya 5 #> 9 1988 M Arya 22 #> 10 1989 M Arya 30 #> # … with 57 more rows ``` --- class: your-turn # Your turn 7 .big[ Use **`%>%`** to write a sequence of functions that: 1. Filter `babynames` to just the females that were born in 2015. 2. Select the **`name`** and **`n`** columns. 3. Arrange the results so that the most popular names are at the top. ]
04
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% filter(sex == "F", year == 2015) %>% select(name, n) %>% arrange(desc(n)) ``` ] .panel[.panel-name[Results] ``` # A tibble: 19,074 x 2 name n <chr> <int> 1 Emma 20435 2 Olivia 19669 3 Sophia 17402 4 Ava 16361 5 Isabella 15594 6 Mia 14892 7 Abigail 12390 8 Emily 11780 9 Charlotte 11390 10 Harper 10291 # … with 19,064 more rows ``` ] ] --- class: your-turn # Your turn 8 .big[ Combine your new data skills to make a plot showing the popularity of your name over time. 1. Trim `babynames` to just the rows that contain your **`name`**. 2. Trim the result to just the columns that are needed for your plot. 3. Plot the results as a line graph with **`year`** on the x-axis and **`prop`** on the y-axis, colored **`sex`**. ]
04
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% filter(name == "Jake") %>% select(year, prop, sex) %>% ggplot(mapping = aes(x = year, y = prop)) + geom_line(mapping = aes(color = sex)) ``` ] .panel[.panel-name[Plot] <img src="images/transform/plots/yt-name-plot-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: center middle # What names are the most popular? --- # What is popularity? We can assess popularity through: -- 1. **Sums** - a large number of children have the name when we sum across years. -- 2. **Ranks** - the name consistently ranks among the top names from year to year. --- class: pop-quiz # Consider .big[ Do we have enough information to: 1. Calculate the total number of children with each name? 2. Rank the names within each year? ] --- # Deriving information .pull-left[ <img src="images/transform/summarize.png" height="75px"> ] .pull-right[ .big[Compile data with **`summarize()`**] ] --- # Deriving information .pull-left[ <img src="images/transform/summarize.png" height="75px"> </br> </br> </br> <img src="images/transform/group_by.png" height="75px"> ] .pull-right[ .big[Compile data with **`summarize()`**] </br> </br> </br> </br> .big[Analyze groups with **`group_by()`**] ] --- # Deriving information .pull-left[ <img src="images/transform/summarize.png" height="75px"> </br> </br> </br> <img src="images/transform/group_by.png" height="75px"> </br> </br> </br> <img src="images/transform/mutate.png" height="75px"> ] .pull-right[ .big[Compile data with **`summarize()`**] </br> </br> </br> </br> .big[Analyze groups with **`group_by()`**] </br> </br> </br> .big[Make new variables with **`mutate()`**] ] --- name: summarize class: center middle # `summarize()` --- class: middle <code class ='r hljs remark-code'>summarize(.data, ...)</code> --- class: middle <code class ='r hljs remark-code'>summarize(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> ??? Data to summarize --- class: middle <code class ='r hljs remark-code'>summarize(.data, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? Summaries to calculate --- ```r summarize(babynames, total = sum(n), max = max(n)) ``` .pull-left[
year
sex
name
n
prop
1880
F
Mary
7065
0.072
1880
F
Anna
2604
0.027
1880
F
Emma
2003
0.021
1880
F
Elizabeth
1939
0.020
1880
F
Minnie
1746
0.018
1880
F
Margaret
1578
0.016
1880
F
Ida
1472
0.015
1880
F
Alice
1414
0.014
1880
F
Bertha
1320
0.014
1880
F
Sarah
1288
0.013
] -- .pull-right[
total
max
348120517
99686
] --- background-image: url(images/transform/cheatsheet-summary.png) background-position: center middle background-size: 85% --- class: your-turn # Your turn 9 .big[ Use **`summarize()`** to compute three statistics *about the data*: 1. The first (minimum) year in the data set 2. The last (maximum) year in the data set 3. The total number of unique names in the data. ] Hint: Use the functions `min()`, `max()`, and `n_distinct()`.
03
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% summarize(first_year = min(year), last_year = max(year), num_names = n_distinct(name)) ``` ] .panel[.panel-name[Result] ``` # A tibble: 1 x 3 first_year last_year num_names <dbl> <dbl> <int> 1 1880 2017 97310 ``` ] ] --- class: your-turn # Your turn 10 .big[ Extract the rows for children named **Khaleesi**. 1. How many children have been named Khaleesi? 2. What was the first year Khaleesi appeared in the data? ]
04
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% filter(name == "Khaleesi") %>% summarize(total_kids = sum(n), first_year = min(year)) ``` ] .panel[.panel-name[Result] ``` # A tibble: 1 x 2 total_kids first_year <int> <dbl> 1 1964 2011 ``` ] ] --- name: group class: center middle # `group_by()` --- class: middle <code class ='r hljs remark-code'>group_by(.data, ...)</code> --- class: middle <code class ='r hljs remark-code'>group_by(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> ??? Data to group --- class: middle <code class ='r hljs remark-code'>group_by(.data, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? Variables to group by --- ```r group_by(babynames, sex) #> # A tibble: 1,924,665 x 5 #> # Groups: sex [2] #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1880 F Mary 7065 0.0724 #> 2 1880 F Anna 2604 0.0267 #> 3 1880 F Emma 2003 0.0205 #> 4 1880 F Elizabeth 1939 0.0199 #> 5 1880 F Minnie 1746 0.0179 #> 6 1880 F Margaret 1578 0.0162 #> 7 1880 F Ida 1472 0.0151 #> 8 1880 F Alice 1414 0.0145 #> 9 1880 F Bertha 1320 0.0135 #> 10 1880 F Sarah 1288 0.0132 #> # … with 1,924,655 more rows ``` ??? What has changed? Additional meta data being printed. --- # Grouped summaries ```r babynames %>% group_by(sex) %>% summarize(num_names = n_distinct(name)) #> # A tibble: 2 x 2 #> sex num_names #> <chr> <int> #> 1 F 67046 #> 2 M 40927 ``` --- class: your-turn # Your turn 11 .big[ Calculate popularity by determining the total number of children given each name. 1. Use **`group_by()`** and **`summarize()`** to calculate the total number of children that have been given each name, by sex. 2. Arrange the results to show the most popular names first. Bonus: 3. Create a bar plot of the 10 most popular names, with **`name`** on the x-axis and total children on the y-axis. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% group_by(name, sex) %>% summarize(total = sum(n)) %>% arrange(desc(total)) ``` ] .panel[.panel-name[Result] ``` # A tibble: 107,973 x 3 # Groups: name [97,310] name sex total <chr> <chr> <int> 1 James M 5150472 2 John M 5115466 3 Robert M 4814815 4 Michael M 4350824 5 Mary F 4123200 6 William M 4102604 7 David M 3611329 8 Joseph M 2603445 9 Richard M 2563082 10 Charles M 2386048 # … with 107,963 more rows ``` ] ] --- class: your-turn ```r babynames %>% group_by(name, sex) %>% summarize(total = sum(n), .groups = "drop") %>% arrange(desc(total)) %>% * slice_max(total, n = 10) %>% * ggplot(mapping = aes(x = fct_reorder(name, desc(total)), y = total)) + * geom_col(mapping = aes(fill = sex)) + * scale_fill_brewer() + * labs(x = "Name", y = "Total Children") + * theme_bw() ``` --- class: your-turn <img src="images/transform/plots/yt-group-by-plot-sol-1.png" width="90%" style="display: block; margin: auto;" /> --- class: your-turn # Your turn 12 .big[ Use grouping to calculate the **number of children born each year**. Plot the results as a line graph. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% group_by(year) %>% summarize(n_children = sum(n)) %>% ggplot(mapping = aes(x = year, y = n_children)) + geom_line() ``` ] .panel[.panel-name[Plot] <img src="images/transform/plots/yt-total-kids-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? What does this affect our measure of popularity? --- name: mutate class: center middle # `mutate()` --- class: middle <code class ='r hljs remark-code'>mutate(.data, ...)</code> --- class: middle <code class ='r hljs remark-code'>mutate(<span style="background-color:#FED766;color:#009FB7">.data</span>, ...)</code> ??? Data to mutate --- class: middle <code class ='r hljs remark-code'>mutate(.data, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? Additional variables to calculate --- ```r mutate(babynames, percent = round(prop * 100, digits = 2)) ``` .pull-left[
year
sex
name
n
prop
1880
F
Mary
7065
0.072
1880
F
Anna
2604
0.027
1880
F
Emma
2003
0.021
1880
F
Elizabeth
1939
0.020
1880
F
Minnie
1746
0.018
1880
F
Margaret
1578
0.016
1880
F
Ida
1472
0.015
1880
F
Alice
1414
0.014
1880
F
Bertha
1320
0.014
1880
F
Sarah
1288
0.013
] -- .pull-right[
year
sex
name
n
prop
percent
1880
F
Mary
7065
0.072
7.24
1880
F
Anna
2604
0.027
2.67
1880
F
Emma
2003
0.021
2.05
1880
F
Elizabeth
1939
0.020
1.99
1880
F
Minnie
1746
0.018
1.79
1880
F
Margaret
1578
0.016
1.62
1880
F
Ida
1472
0.015
1.51
1880
F
Alice
1414
0.014
1.45
1880
F
Bertha
1320
0.014
1.35
1880
F
Sarah
1288
0.013
1.32
] --- background-image: url(images/transform/cheatsheet-mutate.png) background-position: center middle background-size: 70% --- # `min_rank()` Lowest value get the lowest rank, i.e., `1`. ```r min_rank(c(50, 40, 60, 75, 50)) #> [1] 2 1 4 5 2 ``` -- To give highest values the rank of `1`, use `desc()` ```r min_rank(desc(c(50, 40, 60, 75, 50))) #> [1] 3 5 2 1 3 ``` --- class: your-turn # Your turn 13 .big[ Use **`mutate()`** and **`min_rank()`** to rank each row in `babynames` from largest **`prop`** to lowest **`prop`**. The highest **`prop`** should have a rank of `1`. ]
03
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% mutate(rank = min_rank(desc(prop))) ``` ] .panel[.panel-name[Result] ``` # A tibble: 1,924,665 x 6 year sex name n prop rank <dbl> <chr> <chr> <int> <dbl> <int> 1 1880 F Mary 7065 0.0724 14 2 1880 F Anna 2604 0.0267 709 3 1880 F Emma 2003 0.0205 1131 4 1880 F Elizabeth 1939 0.0199 1192 5 1880 F Minnie 1746 0.0179 1427 6 1880 F Margaret 1578 0.0162 1683 7 1880 F Ida 1472 0.0151 1897 8 1880 F Alice 1414 0.0145 2039 9 1880 F Bertha 1320 0.0135 2279 10 1880 F Sarah 1288 0.0132 2387 # … with 1,924,655 more rows ``` ] ] --- class: your-turn # Your turn 14 .big[ **What is each name's median rank?** 1. Compute each **`name`**'s rank with each **`year`** and **`sex`**. 2. Compute the median rank across years for each **`name`**, within each **`sex`**. 3. Sort the results from highest median rank to lowest. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% group_by(year, sex) %>% mutate(rank = min_rank(desc(prop))) %>% group_by(name, sex) %>% summarize(score = median(rank)) %>% arrange(score) ``` ] .panel[.panel-name[Result] ``` # A tibble: 107,973 x 3 # Groups: name [97,310] name sex score <chr> <chr> <dbl> 1 Mary F 1 2 James M 3 3 John M 3 4 William M 4 5 Robert M 6 6 Michael M 7.5 7 Charles M 9 8 Elizabeth F 10 9 Joseph M 10 10 Thomas M 11 # … with 107,963 more rows ``` ] ] --- # Recap: Single table verbs .pull-left[ <img src="images/transform/select.png" height="50px"> </br> </br> <img src="images/transform/filter.png" height="50px"> </br> </br> <img src="images/transform/arrange.png" height="50px"> </br> </br> <img src="images/transform/summarize.png" height="50px"> </br> </br> <img src="images/transform/mutate.png" height="50px"> ] .pull-right.wide-line[ Extract variables with **`select()`** Extract cases with **`filter()`** Arrange cases with **`arrange()`** Compile data with **`summarize()`** Make new variables with **`mutate()`** ] --- name: joins class: center middle # Joining Data Sets <a href="https://r4ds.had.co.nz/relational-data.html"> <img class="norm-book" src="images/books/r4ds-data-join.png"> </a> --- # Example Data: `nycflights13` ```r library(nycflights13) flights #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- background-image: url(images/transform/nycflights.png) background-position: center middle background-size: 65% --- background-image: url(images/transform/nycflights-highlight.png) background-position: center middle background-size: 65% .big[Which airlines had the longest delays?] --- .pull-left[ ```r distinct(flights, carrier) #> # A tibble: 16 x 1 #> carrier #> <chr> #> 1 UA #> 2 AA #> 3 B6 #> 4 DL #> 5 EV #> 6 MQ #> 7 US #> 8 WN #> 9 VX #> 10 FL #> 11 AS #> 12 9E #> 13 F9 #> 14 HA #> 15 YV #> 16 OO ``` ] .pull-right[ ```r airlines #> # A tibble: 16 x 2 #> carrier name #> <chr> <chr> #> 1 9E Endeavor Air Inc. #> 2 AA American Airlines Inc. #> 3 AS Alaska Airlines Inc. #> 4 B6 JetBlue Airways #> 5 DL Delta Air Lines Inc. #> 6 EV ExpressJet Airlines Inc. #> 7 F9 Frontier Airlines Inc. #> 8 FL AirTran Airways Corporation #> 9 HA Hawaiian Airlines Inc. #> 10 MQ Envoy Air #> 11 OO SkyWest Airlines Inc. #> 12 UA United Air Lines Inc. #> 13 US US Airways Inc. #> 14 VX Virgin America #> 15 WN Southwest Airlines Co. #> 16 YV Mesa Airlines Inc. ``` ] --- class: center middle **Mutating joins** use information from one data set to **add variables to another data set** (like `mutate()`) -- **Filtering joins** use information from one data set to **extract cases** from another data set (like `filter()`) --- background-image: url(images/transform/join-data.png) background-position: center middle background-size: 85% # Toy data for practice --- background-image: url(images/transform/left-join.png) background-position: center middle background-size: 90% # `left_join()` ```r left_join(heros, homes, by = "hero") ``` --- background-image: url(images/transform/right-join.png) background-position: center middle background-size: 90% # `right_join()` ```r right_join(heros, homes, by = "hero") ``` --- background-image: url(images/transform/full-join.png) background-position: center middle background-size: 90% # `full_join()` ```r full_join(heros, homes, by = "hero") ``` --- background-image: url(images/transform/inner-join.png) background-position: center middle background-size: 90% # `inner_join()` ```r inner_join(heros, homes, by = "hero") ``` --- class: your-turn # Your turn 15 .big[ Which airlines had the largest arrival delays (**`arr_delay`**)? Complete the code below. ] <code class ='r hljs remark-code'>flights %>%<br> drop_na(arr_delay) %>%<br> <span style="background-color:#FED766"> </span> %>% # 1. Join airlines to flights<br> group_by(<span style="background-color:#FED766"> </span>) %>%<br> <span style="background-color:#FED766"> </span> %>% # 2. Compute the average arrival delay<br> arrange(desc(<span style="color:#FED766;background-color:#FED766">delay</span>))</code>
06
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r flights %>% drop_na(arr_delay) %>% left_join(airlines, by = "carrier") %>% group_by(name) %>% summarize(delay = mean(arr_delay)) %>% arrange(desc(delay)) ``` ] .panel[.panel-name[Result] ``` #> # A tibble: 16 x 2 #> name delay #> <chr> <dbl> #> 1 Frontier Airlines Inc. 21.9 #> 2 AirTran Airways Corporation 20.1 #> 3 ExpressJet Airlines Inc. 15.8 #> 4 Mesa Airlines Inc. 15.6 #> 5 SkyWest Airlines Inc. 11.9 #> 6 Envoy Air 10.8 #> 7 Southwest Airlines Co. 9.65 #> 8 JetBlue Airways 9.46 #> 9 Endeavor Air Inc. 7.38 #> 10 United Air Lines Inc. 3.56 #> 11 US Airways Inc. 2.13 #> 12 Virgin America 1.76 #> 13 Delta Air Lines Inc. 1.64 #> 14 American Airlines Inc. 0.364 #> 15 Hawaiian Airlines Inc. -6.92 #> 16 Alaska Airlines Inc. -9.93 ``` ] ] --- background-image: url(images/transform/named-join.png) background-position: center 65% background-size: 90% # Non-matching names Use a named vector to match on variables with different names. ```r heroes %>% left_join(enemies, by = c("hero" = "avenger")) ``` --- class: center middle **Mutating joins** use information from one data set to **add variables to another data set** (like `mutate()`) **Filtering joins** use information from one data set to **extract cases** from another data set (like `filter()`) --- background-image: url(images/transform/semi-join.png) background-position: center middle background-size: 90% # `semi_join()` ```r semi_join(heros, homes, by = "hero") ``` --- background-image: url(images/transform/anti-join.png) background-position: center middle background-size: 90% # `anti_join()` ```r anti_join(heros, homes, by = "hero") ``` --- class: your-turn # Your turn 16 .big[ How many airports in the **`airports`** data were serviced by flights originating in New York? Notice that the column to join on is named **`faa`** in the **`airports`** data set and **`dest`** in the **`flights`** data set. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r airports %>% semi_join(flights, by = c("faa" = "dest")) %>% distinct() ``` ] .panel[.panel-name[Result] ``` #> # A tibble: 101 x 8 #> faa name lat lon alt tz dst tzone #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> #> 1 ABQ Albuquerque International… 35.0 -107. 5355 -7 A America/Denv… #> 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A America/New_… #> 3 ALB Albany Intl 42.7 -73.8 285 -5 A America/New_… #> 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A America/Anch… #> 5 ATL Hartsfield Jackson Atlant… 33.6 -84.4 1026 -5 A America/New_… #> 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chic… #> 7 AVL Asheville Regional Airport 35.4 -82.5 2165 -5 A America/New_… #> 8 BDL Bradley Intl 41.9 -72.7 173 -5 A America/New_… #> 9 BGR Bangor Intl 44.8 -68.8 192 -5 A America/New_… #> 10 BHM Birmingham Intl 33.6 -86.8 644 -6 A America/Chic… #> # … with 91 more rows ``` ] ] --- background-image: url(images/transform/cheatsheet-join.png) background-position: center middle background-size: 85% --- class: title-slide, center # Data Transformation <img src="images/hex/dplyr.png" width="20%" style="display: block; margin: auto;" /> ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]