class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#009FB7;">6</strong> </span> # Tidy Data ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).] <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$` `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { yellow: ["{\\color{yellow}{#1}}", 1], blue: ["{\\color{blue}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .yellow {color: #FED766;} .blue {color: #009FB7;} </style> --- class: your-turn # Your turn 0 .big[ * Open the R Notebook **materials/exercises/06-tidy.Rmd** * Run the setup chunk ]
01
:
00
--- background-image: url(images/tidy/applied-ds-tidy.png) background-position: center 60% background-size: 85% # .nobold[(Applied)] Data Science --- name: tidy-tools class: center middle # tidy tools --- # Tidy tools Functions are easiest to use when they are: 1\. **Simple**: They do one thing, and they do it well. -- 2\. **Composable**: They can be combined with other functions for multi-step operations. -- 3\. **Smart**: They can use R objects as input. -- Tidy functions do these things in a specific way. --- # Tidy tools Functions are easiest to use when they are: 1\. **Simple**: They do one thing, and they do it well. 2\. **Composable**: They can be combined with other functions for multi-step operations. .fade[ 3\. **Smart**: They can use R objects as input. ] Tidy functions do these things in a specific way. --- # Simple .big[ They do one thing, and they do it well. **`filter()`** – extract cases **`arrange()`** – reorder cases **`group_by()`** – group cases **`select()`** – extract variables **`mutate()`** – create new variables **`summarize()`** – summarize variables ] --- # Composable .big[ They can be combined with other functions for multi-step operations. <img src="images/tidy/pipe-example.png" width="80%" style="display: block; margin: auto;" /> Each `dplyr` function takes a data frame as its first argument and returns a data frame. As a result, you can directly pipe the output of one function into the next. ] --- name: tidy-data class: center middle # tidy data --- class: center middle <blockquote> .huge[Data are not just numbers, they are numbers with a context.] .right[ <cite> – [George Cobb and David Moore (1997)](https://doi.org/10.2307/2975286) </cite>] </blockquote> --- class: pop-quiz # Consider What are the variables in this data set? <img src="images/tidy/table1-tidy.png" width="80%" style="display: block; margin: auto;" /> --- class: pop-quiz # Consider What are the variables in this data set? <img src="images/tidy/table1-tidy-vars.png" width="80%" style="display: block; margin: auto;" /> --- class: pop-quiz # Consider What are the variables in this data set? <img src="images/tidy/table1-untidy.png" width="80%" style="display: block; margin: auto;" /> --- class: pop-quiz # Consider What are the variables in this data set? <img src="images/tidy/table1-untidy-vars.png" width="80%" style="display: block; margin: auto;" /> --- # This isn't tidy What are the variables in this data set? <img src="images/tidy/table1-population.png" width="80%" style="display: block; margin: auto;" /> --- class: center middle <blockquote> .huge[Data comes in many format, but R prefers just one: tidy data.] .right[ <cite> – Garrett Grolemund </cite>] </blockquote> --- # Tidy data .pull-left[ <img src="images/tidy/tidy-data.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ A data set is **tidy** if: 1. Each .blue[**variable**] is in its own .blue[**column**]. 2. Each .dark-yellow[**case**] is in its own .dark-yellow[**row**]. 3. Each .pink[**value**] is in its own .pink[**cell**]. ] --- class: your-turn # Your turn 1 .big[ Is **`bp_systolic`** tidy? What are the variables? ]
01
:
00
--- class: your-turn .panelset[ .panel[.panel-name[`bp_systolic`] ```r bp_systolic #> # A tibble: 3 x 4 #> subject_id time_1 time_2 time_3 #> <dbl> <dbl> <dbl> <dbl> #> 1 1 120 118 121 #> 2 2 125 131 NA #> 3 3 141 NA NA ``` ] .panel[.panel-name[`bp_systolic2`] ```r bp_systolic2 #> # A tibble: 6 x 3 #> subject_id time systolic #> <dbl> <dbl> <dbl> #> 1 1 1 120 #> 2 1 2 118 #> 3 1 3 121 #> 4 2 1 125 #> 5 2 2 131 #> 6 3 1 141 ``` ] ] ??? Three variables: subject, blood pressure, time point --- class: your-turn # Your turn 2 .big[ Using **`bp_systolic2`** with `group_by()` and `summarize()`: 1. Find the average systolic blood pressure for each subject. 2. Find the last time each subject was measured. ]
03
:
00
--- class: your-turn ```r bp_systolic2 %>% group_by(subject_id) %>% summarize(avg_sys = mean(systolic), last_measurement = max(time)) #> # A tibble: 3 x 3 #> subject_id avg_sys last_measurement #> <dbl> <dbl> <dbl> #> 1 1 120. 3 #> 2 2 128 2 #> 3 3 141 1 ``` --- class: center middle <blockquote> .huge[Tidy datasets are all alike, but every messy dataset is messy in its own way.] .right[ <cite> – [Hadley Wickham (2014)](https://doi.org/10.18637/jss.v059.i10) </cite>] </blockquote> --- <div class="hex-book"> <a href="https://tidyr.tidyverse.org"> <img class="hex" src="images/hex/tidyr.png"> </a> <a href="https://r4ds.had.co.nz/tidy-data.html"> <img class="book" src="images/books/r4ds-tidy-data.png"> </a> </div> --- # `tidyr` verbs .left-column[ <img src="images/tidy/pivot-longer.png" height="70px"> ] .right-column.center[ Move column names into values with **`pivot_longer()`** ] --- # `tidyr` verbs .left-column[ <img src="images/tidy/pivot-longer.png" height="70px"> </br> </br> </br> <img src="images/tidy/pivot-wider.png" height="70px"> ] .right-column.center[ Move column names into values with **`pivot_longer()`** </br> </br> </br> </br> Move values into column names with **`pivot_wider()`** ] --- # `tidyr` verbs .left-column[ <img src="images/tidy/pivot-longer.png" height="70px"> </br> </br> </br> <img src="images/tidy/pivot-wider.png" height="70px"> </br> </br> </br> <img src="images/tidy/separate.png" height="70px"> ] .right-column.center[ Move column names into values with **`pivot_longer()`** </br> </br> </br> </br> Move values into column names with **`pivot_wider()`** </br> </br> </br> </br> </br> Split a column with **`separate()`** ] --- # `tidyr` verbs .left-column[ <img src="images/tidy/pivot-longer.png" height="70px"> </br> </br> </br> <img src="images/tidy/pivot-wider.png" height="70px"> </br> </br> </br> <img src="images/tidy/separate.png" height="70px"> </br> </br> </br> <img src="images/tidy/unite.png" height="70px"> ] .right-column.center[ Move column names into values with **`pivot_longer()`** </br> </br> </br> </br> Move values into column names with **`pivot_wider()`** </br> </br> </br> </br> </br> Split a column with **`separate()`** </br> </br> </br> </br> Unite columns with **`unite()`** ] --- name: pivot-longer class: center middle # `pivot_longer()` --- # Toy data for practice ```r cases #> # A tibble: 3 x 4 #> country `2011` `2012` `2013` #> <chr> <dbl> <dbl> <dbl> #> 1 FR 7000 6900 7000 #> 2 DE 5800 6000 6200 #> 3 US 15000 14000 13000 ``` --- class: pop-quiz # Consider .big[ Discuss in the chat: What are the variables in the **`cases`** data set? ] <img src="images/tidy/pivot-longer/cases.png" width="60%" style="display: block; margin: auto;" />
01
:
00
--- class: pop-quiz # Consider .big[ What are the variables in the **`cases`** data set? ] <img src="images/tidy/pivot-longer/cases1.png" width="60%" style="display: block; margin: auto;" /> ??? Variable 1: Country --- class: pop-quiz # Consider .big[ What are the variables in the **`cases`** data set? ] <img src="images/tidy/pivot-longer/cases2.png" width="60%" style="display: block; margin: auto;" /> ??? Variable 2: Year --- class: pop-quiz # Consider .big[ What are the variables in the **`cases`** data set? ] <img src="images/tidy/pivot-longer/cases3.png" width="60%" style="display: block; margin: auto;" /> ??? Variable 3: cases --- class: your-turn # Your turn 3 .big[ On a sheet of paper, draw how the **`cases`** data set would look if it had the same values grouped into three columns: **`country`**, **`year`**, and **`n`**. ] <img src="images/tidy/pivot-longer/cases.png" width="60%" style="display: block; margin: auto;" />
05
:
00
--- background-image: url(images/tidy/pivot-longer/pl00.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl01.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl02.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl03.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl04.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl05.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl06.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl07.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl08.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl09.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl10.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl11.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl12.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl13.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-longer/pl14.png) background-position: center middle background-size: 85% ??? The `names_to` argument specifies the new variable name for what was formerly the column headings. --- background-image: url(images/tidy/pivot-longer/pl15.png) background-position: center middle background-size: 85% ??? The `values_to` argument specific the new variable name for what were formerly the spread out values. --- # `pivot_longer()` <code class ='r hljs remark-code'>cases %>%<br> pivot_longer(cols = -country,<br> names_to = "year",<br> values_to = "n")</code> --- # `pivot_longer()` <code class ='r hljs remark-code'><span style="background-color:#FED766;color:#009FB7">cases</span> %>%<br> pivot_longer(cols = -country,<br> names_to = "year",<br> values_to = "n")</code> ??? The data to reshape --- # `pivot_longer()` <code class ='r hljs remark-code'>cases %>%<br> pivot_longer(<span style="background-color:#FED766;color:#009FB7">cols = -country</span>,<br> names_to = "year",<br> values_to = "n")</code> ??? `cols` specifies which columns need to be pivoted --- # `pivot_longer()` <code class ='r hljs remark-code'>cases %>%<br> pivot_longer(cols = -country,<br> <span style="background-color:#FED766;color:#009FB7">names_to = "year"</span>,<br> values_to = "n")</code> ??? Name of the column the column names will go to. --- # `pivot_longer()` <code class ='r hljs remark-code'>cases %>%<br> pivot_longer(cols = -country,<br> names_to = "year",<br> <span style="background-color:#FED766;color:#009FB7">values_to = "n"</span>)</code> ??? Name of the column the values will go to. --- class: your-turn # Your turn 4 .big[ Use **`pivot_longer()`** to reorganize **`table4a`** into three columns: `country`, `year`, and `cases`. ] ```r table4a #> # A tibble: 3 x 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 745 2666 #> 2 Brazil 37737 80488 #> 3 China 212258 213766 ```
04
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Solution] ```r table4a %>% pivot_longer(cols = -country, names_to = "year", values_to = "n") #> # A tibble: 6 x 3 #> country year n #> <chr> <chr> <int> #> 1 Afghanistan 1999 745 #> 2 Afghanistan 2000 2666 #> 3 Brazil 1999 37737 #> 4 Brazil 2000 80488 #> 5 China 1999 212258 #> 6 China 2000 213766 ``` ] .panel[.panel-name[Better] ```r table4a %>% pivot_longer(cols = -country, names_to = "year", values_to = "n", names_transform = list(year = as.integer)) #> # A tibble: 6 x 3 #> country year n #> <chr> <int> <int> #> 1 Afghanistan 1999 745 #> 2 Afghanistan 2000 2666 #> 3 Brazil 1999 37737 #> 4 Brazil 2000 80488 #> 5 China 1999 212258 #> 6 China 2000 213766 ``` ] ] --- name: pivot-wider class: center middle # `pivot_wider()` --- # Toy data for practice ```r pollution #> # A tibble: 6 x 3 #> city size amount #> <chr> <chr> <dbl> #> 1 New York large 23 #> 2 New York small 14 #> 3 London large 22 #> 4 London small 16 #> 5 Beijing large 121 #> 6 Beijing small 56 ``` --- class: pop-quiz # Consider .big[ Discuss in the chat: What are the variables in the **`pollution`** data set? ] <img src="images/tidy/pivot-wider/pollution.png" width="40%" style="display: block; margin: auto;" />
01
:
00
--- class: pop-quiz # Consider .big[ What are the variables in the **`pollution`** data set? ] <img src="images/tidy/pivot-wider/pollution1.png" width="40%" style="display: block; margin: auto;" /> ??? Variable 1: City --- class: pop-quiz # Consider .big[ What are the variables in the **`pollution`** data set? ] <img src="images/tidy/pivot-wider/pollution2.png" width="40%" style="display: block; margin: auto;" /> ??? Variable 2: amount of large particulate --- class: pop-quiz # Consider .big[ What are the variables in the **`pollution`** data set? ] <img src="images/tidy/pivot-wider/pollution3.png" width="40%" style="display: block; margin: auto;" /> ??? Variable 3: amount of small particulate --- class: your-turn # Your turn 5 .big[ On a sheet of paper, draw how the **`pollution`** data set would look if it had the same values grouped into three columns: **`city`**, **`large`**, and **`small`**. ] <img src="images/tidy/pivot-wider/pollution.png" width="35%" style="display: block; margin: auto;" />
05
:
00
--- background-image: url(images/tidy/pivot-wider/pw00.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw01.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw02.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw03.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw04.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw05.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw06.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw07.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw08.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw09.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw10.png) background-position: center middle background-size: 85% --- background-image: url(images/tidy/pivot-wider/pw11.png) background-position: center middle background-size: 85% ??? The `names_from` argument specifies the variable the new column names will come from. --- background-image: url(images/tidy/pivot-wider/pw12.png) background-position: center middle background-size: 85% ??? The `values_from` argument specific the variable the new values will come from. --- # `pivot_wider()` <code class ='r hljs remark-code'>pollution %>%<br> pivot_wider(names_from = "size",<br> values_from = "amount")</code> --- # `pivot_wider()` <code class ='r hljs remark-code'><span style="background-color:#FED766;color:#009FB7">pollution</span> %>%<br> pivot_wider(names_from = "size",<br> values_from = "amount")</code> ??? The data to reshape --- # `pivot_wider()` <code class ='r hljs remark-code'>pollution %>%<br> pivot_wider(<span style="background-color:#FED766;color:#009FB7">names_from = "size"</span>,<br> values_from = "amount")</code> ??? Name of the column that will become column names. --- # `pivot_wider()` <code class ='r hljs remark-code'>pollution %>%<br> pivot_wider(names_from = "size",<br> <span style="background-color:#FED766;color:#009FB7">values_from = "amount"</span>)</code> ??? Name of the column that will make up the values of the new columns. --- class: your-turn # Your turn 6 .big[ Use **`pivot_wider()`** to reorganize **`table2`** into four columns: `country`, `year`, `cases`, and `population`. ] .smallish[ ```r table2 #> # A tibble: 12 x 4 #> country year type count #> <chr> <int> <chr> <int> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> 4 Afghanistan 2000 population 20595360 #> 5 Brazil 1999 cases 37737 #> 6 Brazil 1999 population 172006362 #> 7 Brazil 2000 cases 80488 #> 8 Brazil 2000 population 174504898 #> 9 China 1999 cases 212258 #> 10 China 1999 population 1272915272 #> 11 China 2000 cases 213766 #> 12 China 2000 population 1280428583 ``` ]
04
:
00
--- class: your-turn ```r table2 %>% pivot_wider(names_from = "type", values_from = "count") #> # A tibble: 6 x 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` --- name: sep-unite class: center middle # `separate()` and `unite()` --- # Toy data for practice ```r scores #> # A tibble: 12 x 3 #> name house score #> <chr> <chr> <dbl> #> 1 Ronald Weasley Gryffindor 78 #> 2 Harry Potter Gryffindor 85 #> 3 Hermione Granger Gryffindor 100 #> 4 Justin Finch-Fletchley Hufflepuff 87 #> 5 Hannah Abbot Hufflepuff 92 #> 6 Susan Bones Hufflepuff 93 #> 7 Anthony Goldstein Ravenclaw 84 #> 8 Michael Corner Ravenclaw 93 #> 9 Padma Patil Ravenclaw 97 #> 10 Vincent Crabbe Slytherin 61 #> 11 Gregory Goyle Slytherin 61 #> 12 Draco Malfoy Slytherin 92 ``` --- class: pop-quiz # Consider .big[ Discuss in the chat: What the variables in the `scores` data set? ] <img src="images/tidy/hogwarts.png" width="55%" style="display: block; margin: auto;" />
01
:
00
--- class: pop-quiz # Consider .big[ One variable or two? What is "tidy" will depend on your purpose. ] <img src="images/tidy/hogwarts2.png" width="55%" style="display: block; margin: auto;" /> --- # `separate()` <code class ='r hljs remark-code'>separate(data, col, into, sep = "[^[:alnum:]]+", ...)</code> --- # `separate()` <code class ='r hljs remark-code'>separate(<span style="background-color:#FED766;color:#009FB7">data</span>, col, into, sep = "[^[:alnum:]]+", ...)</code> ??? Data frame to tidy --- # `separate()` <code class ='r hljs remark-code'>separate(data, <span style="background-color:#FED766;color:#009FB7">col</span>, into, sep = "[^[:alnum:]]+", ...)</code> ??? Column to separate --- # `separate()` <code class ='r hljs remark-code'>separate(data, col, <span style="background-color:#FED766;color:#009FB7">into</span>, sep = "[^[:alnum:]]+", ...)</code> ??? New columns to be created --- # `separate()` <code class ='r hljs remark-code'>separate(data, col, into, <span style="background-color:#FED766;color:#009FB7">sep = "[^[:alnum:]]+"</span>, ...)</code> ??? What divides the pieces of information --- ```r scores %>% separate(col = name, into = c("first", "last")) #> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [4]. #> # A tibble: 12 x 4 #> first last house score #> <chr> <chr> <chr> <dbl> #> 1 Ronald Weasley Gryffindor 78 #> 2 Harry Potter Gryffindor 85 #> 3 Hermione Granger Gryffindor 100 #> 4 Justin Finch Hufflepuff 87 #> 5 Hannah Abbot Hufflepuff 92 #> 6 Susan Bones Hufflepuff 93 #> 7 Anthony Goldstein Ravenclaw 84 #> 8 Michael Corner Ravenclaw 93 #> 9 Padma Patil Ravenclaw 97 #> 10 Vincent Crabbe Slytherin 61 #> 11 Gregory Goyle Slytherin 61 #> 12 Draco Malfoy Slytherin 92 ``` --- <code class ='r hljs remark-code'>scores %>%<br> separate(col = name, into = c("first", "last"))<br><span style="color:red">#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [4].</span><br>#> # A tibble: 12 x 4<br>#> first last house score<br>#> <chr> <chr> <chr> <dbl><br>#> 1 Ronald Weasley Gryffindor 78<br>#> 2 Harry Potter Gryffindor 85<br>#> 3 Hermione Granger Gryffindor 100<br>#> 4 Justin Finch Hufflepuff 87<br>#> 5 Hannah Abbot Hufflepuff 92<br>#> 6 Susan Bones Hufflepuff 93<br>#> 7 Anthony Goldstein Ravenclaw 84<br>#> 8 Michael Corner Ravenclaw 93<br>#> 9 Padma Patil Ravenclaw 97<br>#> 10 Vincent Crabbe Slytherin 61<br>#> 11 Gregory Goyle Slytherin 61<br>#> 12 Draco Malfoy Slytherin 92</code> --- <code class ='r hljs remark-code'>scores %>%<br> separate(col = name, into = c("first", "last"))<br><span style="color:red">#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [4].</span><br>#> # A tibble: 12 x 4<br>#> first last house score<br>#> <chr> <chr> <chr> <dbl><br>#> 1 Ronald Weasley Gryffindor 78<br>#> 2 Harry Potter Gryffindor 85<br>#> 3 Hermione Granger Gryffindor 100<br>#> 4 Justin <span style="background-color:#FED766;color:#009FB7">Finch</span> Hufflepuff 87<br>#> 5 Hannah Abbot Hufflepuff 92<br>#> 6 Susan Bones Hufflepuff 93<br>#> 7 Anthony Goldstein Ravenclaw 84<br>#> 8 Michael Corner Ravenclaw 93<br>#> 9 Padma Patil Ravenclaw 97<br>#> 10 Vincent Crabbe Slytherin 61<br>#> 11 Gregory Goyle Slytherin 61<br>#> 12 Draco Malfoy Slytherin 92</code> --- <code class ='r hljs remark-code'>scores %>%<br> separate(col = name, into = c("first", "last"), <span style="background-color:#FED766;color:#009FB7">sep = " "</span>)<br>#> # A tibble: 12 x 4<br>#> first last house score<br>#> <chr> <chr> <chr> <dbl><br>#> 1 Ronald Weasley Gryffindor 78<br>#> 2 Harry Potter Gryffindor 85<br>#> 3 Hermione Granger Gryffindor 100<br>#> 4 Justin Finch-Fletchley Hufflepuff 87<br>#> 5 Hannah Abbot Hufflepuff 92<br>#> 6 Susan Bones Hufflepuff 93<br>#> 7 Anthony Goldstein Ravenclaw 84<br>#> 8 Michael Corner Ravenclaw 93<br>#> 9 Padma Patil Ravenclaw 97<br>#> 10 Vincent Crabbe Slytherin 61<br>#> 11 Gregory Goyle Slytherin 61<br>#> 12 Draco Malfoy Slytherin 92</code> --- <code class ='r hljs remark-code'>scores %>%<br> separate(col = name, into = c("first", "last"), <span style="background-color:#FED766;color:#009FB7">sep = " "</span>)<br>#> # A tibble: 12 x 4<br>#> first last house score<br>#> <chr> <chr> <chr> <dbl><br>#> 1 Ronald Weasley Gryffindor 78<br>#> 2 Harry Potter Gryffindor 85<br>#> 3 Hermione Granger Gryffindor 100<br>#> 4 Justin <span style="background-color:#FED766;color:#009FB7">Finch-Fletchley</span> Hufflepuff 87<br>#> 5 Hannah Abbot Hufflepuff 92<br>#> 6 Susan Bones Hufflepuff 93<br>#> 7 Anthony Goldstein Ravenclaw 84<br>#> 8 Michael Corner Ravenclaw 93<br>#> 9 Padma Patil Ravenclaw 97<br>#> 10 Vincent Crabbe Slytherin 61<br>#> 11 Gregory Goyle Slytherin 61<br>#> 12 Draco Malfoy Slytherin 92</code> --- # `separate()` <code class ='r hljs remark-code'>separate(data, col, into, sep = "[^[:alnum:]]+", ...,<br> <span style="background-color:#FED766;color:#009FB7">extra = "warn"</span>, fill = "warn")</code> -- - **`warn`**: emit warning, drop extra values - **`drop`**: drop extra values without warning - **`merge`**: merge extra values into final column of `into` --- # `separate()` <code class ='r hljs remark-code'>separate(data, col, into, sep = "[^[:alnum:]]+", ...,<br> extra = "warn", <span style="background-color:#FED766;color:#009FB7">fill = "warn"</span>)</code> -- - **`warn`**: emit warning, fill with `NA` from right - **`right`**: fill with `NA` from right without warning - **`left`**: fill with `NA` from left without warning --- # `unite()` .big[ Merges columns together; opposite of **`separate()`** ] <code class ='r hljs remark-code'>unite(data, col, ..., sep = "_")</code> --- # `unite()` .big[ Merges columns together; opposite of **`separate()`** ] <code class ='r hljs remark-code'>unite(<span style="background-color:#FED766;color:#009FB7">data</span>, col, ..., sep = "_")</code> ??? Data frame to tidy --- # `unite()` .big[ Merges columns together; opposite of **`separate()`** ] <code class ='r hljs remark-code'>unite(data, <span style="background-color:#FED766;color:#009FB7">col</span>, ..., sep = "_")</code> ??? Name of new column to be created --- # `unite()` .big[ Merges columns together; opposite of **`separate()`** ] <code class ='r hljs remark-code'>unite(data, col, <span style="background-color:#FED766;color:#009FB7">...</span>, sep = "_")</code> ??? Columns to be merged --- # `unite()` .big[ Merges columns together; opposite of **`separate()`** ] <code class ='r hljs remark-code'>unite(data, col, ..., <span style="background-color:#FED766;color:#009FB7">sep = "_"</span>)</code> ??? What divides the pieces of information --- .panelset[ .panel[.panel-name[Separate Names] ```r sep_scores <- scores %>% separate(name, into = c("first", "last"), sep = " ") sep_scores #> # A tibble: 12 x 4 #> first last house score #> <chr> <chr> <chr> <dbl> #> 1 Ronald Weasley Gryffindor 78 #> 2 Harry Potter Gryffindor 85 #> 3 Hermione Granger Gryffindor 100 #> 4 Justin Finch-Fletchley Hufflepuff 87 #> 5 Hannah Abbot Hufflepuff 92 #> 6 Susan Bones Hufflepuff 93 #> 7 Anthony Goldstein Ravenclaw 84 #> 8 Michael Corner Ravenclaw 93 #> 9 Padma Patil Ravenclaw 97 #> 10 Vincent Crabbe Slytherin 61 #> 11 Gregory Goyle Slytherin 61 #> 12 Draco Malfoy Slytherin 92 ``` ] .panel[.panel-name[Unite Names] ```r sep_scores %>% unite("full_name", first, last) #> # A tibble: 12 x 3 #> full_name house score #> <chr> <chr> <dbl> #> 1 Ronald_Weasley Gryffindor 78 #> 2 Harry_Potter Gryffindor 85 #> 3 Hermione_Granger Gryffindor 100 #> 4 Justin_Finch-Fletchley Hufflepuff 87 #> 5 Hannah_Abbot Hufflepuff 92 #> 6 Susan_Bones Hufflepuff 93 #> 7 Anthony_Goldstein Ravenclaw 84 #> 8 Michael_Corner Ravenclaw 93 #> 9 Padma_Patil Ravenclaw 97 #> 10 Vincent_Crabbe Slytherin 61 #> 11 Gregory_Goyle Slytherin 61 #> 12 Draco_Malfoy Slytherin 92 ``` ] .panel[.panel-name[Better Unite] ```r sep_scores %>% unite("full_name", first, last, sep = " ") #> # A tibble: 12 x 3 #> full_name house score #> <chr> <chr> <dbl> #> 1 Ronald Weasley Gryffindor 78 #> 2 Harry Potter Gryffindor 85 #> 3 Hermione Granger Gryffindor 100 #> 4 Justin Finch-Fletchley Hufflepuff 87 #> 5 Hannah Abbot Hufflepuff 92 #> 6 Susan Bones Hufflepuff 93 #> 7 Anthony Goldstein Ravenclaw 84 #> 8 Michael Corner Ravenclaw 93 #> 9 Padma Patil Ravenclaw 97 #> 10 Vincent Crabbe Slytherin 61 #> 11 Gregory Goyle Slytherin 61 #> 12 Draco Malfoy Slytherin 92 ``` ] ] --- # Recap: tidyr verbs .left-column[ <img src="images/tidy/pivot-longer.png" height="70px"> </br> </br> </br> <img src="images/tidy/pivot-wider.png" height="70px"> </br> </br> </br> <img src="images/tidy/separate.png" height="70px"> </br> </br> </br> <img src="images/tidy/unite.png" height="70px"> ] .right-column.center[ Move column names into values with **`pivot_longer()`** </br> </br> </br> </br> Move values into column names with **`pivot_wider()`** </br> </br> </br> </br> </br> Split a column with **`separate()`** </br> </br> </br> </br> Unite columns with **`unite()`** ] --- class: title-slide, center # Tidy Data <img src="images/hex/tidyr.png" width="20%" style="display: block; margin: auto;" /> ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]