Data Types

class: title-slide, center

# Data Types

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]

<div style = "position:fixed; visibility: hidden">
`$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$`
`$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$`
</div>

---
class: pop-quiz

# Pop quiz!

What types of data are in this data set?

```
# A tibble: 336,776 x 6
   time_hour           name          air_time             distance day   delayed
   <dttm>              <chr>         <Duration>              <dbl> <ord> <lgl>  
 1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours)     1400 Tues… TRUE   
 2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours)     1416 Tues… TRUE   
 3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours)      1089 Tues… TRUE   
 4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours)     1576 Tues… FALSE  
 5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours)       762 Tues… FALSE  
 6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours)        719 Tues… TRUE   
 7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours)      1065 Tues… TRUE   
 8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes)       229 Tues… FALSE  
 9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours)       944 Tues… FALSE  
10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours)        733 Tues… TRUE   
# … with 336,766 more rows
```

---
background-image: url(images/data-types/applied-ds-prog.png)
background-position: center 60%
background-size: 85%

# .nobold[(Applied)] Data Science

---
background-image: url(images/data-types/applied-ds-trans-prog.png)
background-position: center 60%
background-size: 85%

# .nobold[(Applied)] Data Science

---
name: logicals
class: center middle

# logicals

---
# Logicals

R's data type for Boolean values (i.e., `TRUE` and `FALSE`)

```r
typeof(TRUE)
#> [1] "logical"

typeof(FALSE)
#> [1] "logical"

typeof(c(TRUE, TRUE, FALSE))
#> [1] "logical"
```

---

```r
flights %>%
  mutate(delayed = arr_delay > 0) %>%
  select(arr_delay, delayed)
#> # A tibble: 336,776 x 2
#>    arr_delay delayed
#>        <dbl> <lgl>  
#>  1        11 TRUE   
#>  2        20 TRUE   
#>  3        33 TRUE   
#>  4       -18 FALSE  
#>  5       -25 FALSE  
#>  6        12 TRUE   
#>  7        19 TRUE   
#>  8       -14 FALSE  
#>  9        -8 FALSE  
#> 10         8 TRUE   
#> # … with 336,766 more rows
```

Can we compute the proportion of flights that arrived late?

---
# Most useful skills

* Math with logicals
  * When you do math with logicals, `TRUE` becomes **1** and `FALSE` becomes **0**

* The **sum** of a logical vector is the **count of `TRUE`s**

```r
sum(c(TRUE, FALSE, TRUE, TRUE))
#> [1] 3
```

* The **mean** of a logical vector is the **proportion of `TRUE`s**

```r
mean(c(1, 2, 3, 4) < 4)
#> [1] 0.75
```

---
class: your-turn

# Your turn 1

.big[
Use the `flights` data set to create a new variable, **`delayed`** that indicates if the flight was delayed (**`arr_delay > 0`**).

Then, remove all rows that contain an `NA` in the **`delayed`** variable.

Finally, create a summary table that shows:

1. How many flight were delayed?
2. What proportion of flights were delayed?
]

---
class: your-turn

```r
flights %>%
  mutate(delayed = arr_delay > 0) %>%
  drop_na(delayed) %>%
  summarize(total = sum(delayed), prop = mean(delayed))
#> # A tibble: 1 x 2
#>    total  prop
#>    <int> <dbl>
#> 1 133004 0.406
```

---
name: strings
class: center middle

# strings

---
# .nobold[(character)] strings

Anything surrounded by quotes (`"`) or single quotes (`'`)

```r
> "one"
> "1"
> "one's"
> '"Hello World"'
> "foo
+
+
+ oops. I'm stuck in a string."
```

---
class: pop-quiz

# Consider

Discuss in the chat: Are boys or girls names more likely to end in a vowel?

---
class: pop-quiz

How can we calculate the proportion of boys and girls names that end in a vowel?

```r
babynames
#> # A tibble: 1,924,665 x 5
#>     year sex   name          n   prop
#>    <dbl> <chr> <chr>     <int>  <dbl>
#>  1  1880 F     Mary       7065 0.0724
#>  2  1880 F     Anna       2604 0.0267
#>  3  1880 F     Emma       2003 0.0205
#>  4  1880 F     Elizabeth  1939 0.0199
#>  5  1880 F     Minnie     1746 0.0179
#>  6  1880 F     Margaret   1578 0.0162
#>  7  1880 F     Ida        1472 0.0151
#>  8  1880 F     Alice      1414 0.0145
#>  9  1880 F     Bertha     1320 0.0135
#> 10  1880 F     Sarah      1288 0.0132
#> # … with 1,924,655 more rows
```

---
# Most useful skills

1\. How to extract or replace substrings.

.fade[
2\. How to find matches for patterns.

3\. Regular expressions.
]

---
<div class="hex-book">
  <a href="https://stringr.tidyverse.org">
    <img class="hex" src="images/hex/stringr.png">
  </a>
  <a href="https://r4ds.had.co.nz/strings.html">
    <img class="book" src="images/books/r4ds-strings.png">
  </a>
</div>

---
# `str_sub()`

Extract or replace portions of a string with **`str_sub()`**

<code class ='r hljs remark-code'>str_sub(string, start = 1, end = -1)</code>

---
# `str_sub()`

Extract or replace portions of a string with **`str_sub()`**

<code class ='r hljs remark-code'>str_sub(<span style="background-color:#FED766;color:#009FB7">string</span>, start = 1, end = -1)</code>

???

string(s) to manipulate

---
# `str_sub()`

Extract or replace portions of a string with **`str_sub()`**

<code class ='r hljs remark-code'>str_sub(string, start = <span style="background-color:#FED766;color:#009FB7">1</span>, end = -1)</code>

???

position of the first character to extract within each string

---
# `str_sub()`

Extract or replace portions of a string with **`str_sub()`**

<code class ='r hljs remark-code'>str_sub(string, start = 1, end = <span style="background-color:#FED766;color:#009FB7">-1</span>)</code>

???

position of the last character to extract within each string

---
class: pop-quiz

# Pop quiz!

What will this return?

```r
str_sub("Mephisto", 1, 2)
```

```
#> [1] "Me"
```

---
class: pop-quiz

# Pop quiz!

What will this return?

```r
str_sub("Mephisto", 1, 2)
```

```
#> [1] "Me"
```

---
class: pop-quiz

# Pop quiz!

What will this return?

```r
str_sub("Mephisto", 2)
```

```
#> [1] "ephisto"
```

---
class: pop-quiz

# Pop quiz!

What will this return?

```r
str_sub("Mephisto", -3)
```

```
#> [1] "sto"
```

---
class: pop-quiz

# Pop quiz!

What will this return?

```r
m <- "Mephisto"
str_sub(m, -3) <- "--Agatha!"
m
```

```
#> [1] "Mephi--Agatha!"
```

---
class: your-turn

# Your turn 2

Complete the following code to:

1. Isolate the last letter of every **`name`**.
2. Create a variable that indicates whether the last letter is one of "a", "e", "i", "o", "u", or "y".
3. Calculate the proportion of children whose name ends in a vowel, by **`year`** and **`sex`**.
4. Display the results as a line plot.

<code class ='r hljs remark-code'>babynames %>%<br>&nbsp;&nbsp;<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>(last = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vowel = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>) %>%<br>&nbsp;&nbsp;group_by(<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>) %>%<br>&nbsp;&nbsp;<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>(prop_vowel = weighted.mean(vowel, w = n)) %>%<br>&nbsp;&nbsp;ggplot(mapping = aes(x = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;</span>, y = <span style="background-color:#FED766;color:#FED766">prop_vowel</span>)) +<br>&nbsp;&nbsp;<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>(mapping = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>)</code>

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
babynames %>%
  mutate(last = str_sub(name, start = -1, end = -1),
         vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
  group_by(year, sex) %>%
  summarize(prop_vowel = weighted.mean(vowel, w = n)) %>%
  ggplot(mapping = aes(x = year, y = prop_vowel)) +
  geom_line(mapping = aes(color = sex))
```

]

.panel[.panel-name[Plot]

]
]

---
name: factors
class: center middle

# factors

---
# Factors

R's representation of categorical data. Consists of:

1. A set of **values**
2. An ordered set of **valid levels**

```r
eyes <- factor(x = c("blue", "green", "green"),
               levels = c("blue", "brown", "green", "hazel"))

eyes
#> [1] blue  green green
#> Levels: blue brown green hazel
```

---
# Factors

Stored internally as an integer vector with a levels attribute

```r
unclass(eyes)
#> [1] 1 3 3
#> attr(,"levels")
#> [1] "blue"  "brown" "green" "hazel"
```

---
<div class="hex-book">
  <a href="https://forcats.tidyverse.org">
    <img class="hex" src="images/hex/forcats.png">
  </a>
  <a href="https://r4ds.had.co.nz/factors.html">
    <img class="book" src="images/books/r4ds-factors.png">
  </a>
</div>

---
class: pop-quiz

# Consider

Discuss in the chat: Do married people watch more or less TV than single people?

---
# Example data: `gss_cat`

```r
gss_cat
#> # A tibble: 21,483 x 9
#>     year marital     age race  rincome    partyid     relig     denom    tvhours
#>    <int> <fct>     <int> <fct> <fct>      <fct>       <fct>     <fct>      <int>
#>  1  2000 Never ma…    26 White $8000 to … Ind,near r… Protesta… Souther…      12
#>  2  2000 Divorced     48 White $8000 to … Not str re… Protesta… Baptist…      NA
#>  3  2000 Widowed      67 White Not appli… Independent Protesta… No deno…       2
#>  4  2000 Never ma…    39 White Not appli… Ind,near r… Orthodox… Not app…       4
#>  5  2000 Divorced     25 White Not appli… Not str de… None      Not app…       1
#>  6  2000 Married      25 White $20000 - … Strong dem… Protesta… Souther…      NA
#>  7  2000 Never ma…    36 White $25000 or… Not str re… Christian Not app…       3
#>  8  2000 Divorced     44 White $7000 to … Ind,near d… Protesta… Luthera…      NA
#>  9  2000 Married      44 White $25000 or… Not str de… Protesta… Other          0
#> 10  2000 Married      47 White $25000 or… Strong rep… Protesta… Souther…       3
#> # … with 21,473 more rows
```

???

A sample of data from teh General Social Survey, a long-running US survey conducted by NORC at the University of Chicago.

---
# Which religions watch the least TV?

```r
gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(mapping = aes(x = tvhours, y= relig)) +
  geom_point()
```

---
# Which plot do you prefer?

.pull-left[
<img src="images/data-types/plots/religion-tv-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/data-types/plots/religion-tv-order-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: center middle

???

Why is the y-axis in this order?

---
# `levels()`

Use **`levels()`** to access a factor's levels

```r
levels(gss_cat$relig)
#>  [1] "No answer"               "Don't know"             
#>  [3] "Inter-nondenominational" "Native american"        
#>  [5] "Christian"               "Orthodox-christian"     
#>  [7] "Moslem/islam"            "Other eastern"          
#>  [9] "Hinduism"                "Buddhism"               
#> [11] "Other"                   "None"                   
#> [13] "Jewish"                  "Catholic"               
#> [15] "Protestant"              "Not applicable"
```

---

.smallish[

```r
levels(gss_cat$relig)
#>  [1] "No answer"               "Don't know"              "Inter-nondenominational"
#>  [4] "Native american"         "Christian"               "Orthodox-christian"     
#>  [7] "Moslem/islam"            "Other eastern"           "Hinduism"               
#> [10] "Buddhism"                "Other"                   "None"                   
#> [13] "Jewish"                  "Catholic"                "Protestant"             
#> [16] "Not applicable"
```
]

---
# Most useful skills

1. Reorder the levels
2. Recode the levels
3. Collapse levels

---
class: center middle

.large-left[
# Reordering levels
]

.small-right[
<img src="images/hex/forcats.png" width="100%" style="display: block; margin: auto;" />
]

---
# `fct_reorder()`

Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level).

<code class ='r hljs remark-code'>fct_reorder(f, x, fun = median, ..., .desc = FALSE)</code>

---
# `fct_reorder()`

Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level).

<code class ='r hljs remark-code'>fct_reorder(<span style="background-color:#FED766;color:#009FB7">f</span>, x, fun = median, ..., .desc = FALSE)</code>

???

Factor to reorder

---
# `fct_reorder()`

Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level).

<code class ='r hljs remark-code'>fct_reorder(f, <span style="background-color:#FED766;color:#009FB7">x</span>, fun = median, ..., .desc = FALSE)</code>

???

variable to reorder by (in conjunction with `fun`)

---
# `fct_reorder()`

Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level).

<code class ='r hljs remark-code'>fct_reorder(f, x, <span style="background-color:#FED766;color:#009FB7">fun = median</span>, ..., .desc = FALSE)</code>

???

function to reorder by (in conjunction with `x`)

---
# `fct_reorder()`

Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level).

<code class ='r hljs remark-code'>fct_reorder(f, x, fun = median, ..., <span style="background-color:#FED766;color:#009FB7">.desc = FALSE</span>)</code>

???

put in descending order?

---

.panelset[
.panel[.panel-name[Code]

<code class ='r hljs remark-code'>gss_cat %>%<br>&nbsp;&nbsp;drop_na(tvhours) %>%<br>&nbsp;&nbsp;group_by(relig) %>%<br>&nbsp;&nbsp;summarize(tvhours = mean(tvhours)) %>%<br>&nbsp;&nbsp;ggplot(mapping = aes(x = tvhours, y = <span style="background-color:#FED766;color:#009FB7">fct_reorder(relig, tvhours, mean)</span>)) +<br>&nbsp;&nbsp;geom_point()</code>
]

.panel[.panel-name[Plot]
<img src="images/data-types/plots/religion-tv-order-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---
class: your-turn

# Your turn 3

Complete the following code to:

1. Calculate the average number of **`tvhours`**, by marital status.
2. Create a sensible plot of average TV consumption by marital status.

<code class ='r hljs remark-code'>gss_cat %>%<br>&nbsp;&nbsp;drop_na(tvhours) %>%<br>&nbsp;&nbsp;group_by(<span style="background-color:#FED766;color:#FED766">marital</span>) %>%<br>&nbsp;&nbsp;summarize(<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>) %>%<br>&nbsp;&nbsp;ggplot(mapping = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>) +<br>&nbsp;&nbsp;geom_col()</code>

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
gss_cat %>%
  drop_na(tvhours) %>%
  group_by(marital) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(mapping = aes(x = tvhours,y = fct_reorder(marital, tvhours, mean))) +
  geom_col()
```

]

.panel[.panel-name[Plot]

]
]

---
# `fct_infreq()`

```r
ggplot(gss_cat, mapping = aes(x = fct_infreq(marital))) +
  geom_bar()
```

.pull-left[
<img src="images/data-types/plots/infreq-default-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/data-types/plots/infreq-order-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# `fct_rev()`

```r
ggplot(gss_cat, mapping = aes(x = fct_rev(fct_infreq(marital)))) +
  geom_bar()
```

.pull-left[
<img src="images/data-types/plots/infreq-order-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/data-types/plots/rev-order-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: center middle

.large-left[
# Recoding levels
]

.small-right[
<img src="images/hex/forcats.png" width="100%" style="display: block; margin: auto;" />
]

---
class: your-turn

# Your turn 4

Do liberals or conservatives watch more TV?

Compute the average TV consumption by party identification, and plot the results.

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
gss_cat %>%
  drop_na(tvhours) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(mapping = aes(x = tvhours,y = fct_reorder(partyid, tvhours, mean))) +
  geom_col() +
  labs(y = "partyid")
```

]

.panel[.panel-name[Plot]

]
]

???

How can we improve these labels?

---
# `fct_recode()`

Change the values of the levels for a factor

<code class ='r hljs remark-code'>fct_recode(f, ...)</code>

---
# `fct_recode()`

Change the values of the levels for a factor

<code class ='r hljs remark-code'>fct_recode(<span style="background-color:#FED766;color:#009FB7">f</span>, ...)</code>

???

factor variable

---
# `fct_recode()`

Change the values of the levels for a factor

<code class ='r hljs remark-code'>fct_recode(f, <span style="background-color:#FED766;color:#009FB7">...</span>)</code>

???

new level = old level pairs

---
.panelset[
.panel[.panel-name[Code]

```r
gss_cat %>%
  drop_na(tvhours) %>%
  mutate(partyid = fct_recode(partyid,
                              "Republican, strong" = "Strong republican",
                              "Republican, weak"   = "Not str republican",
                              "Republican, lean"   = "Ind,near rep",
                              "Democrat, lean"     = "Ind,near dem",
                              "Democrat, weak"     = "Not str democrat",
                              "Democrat, strong"   = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(mapping = aes(x = tvhours, y = fct_reorder(partyid, tvhours, mean))) +
  geom_col() +
  labs(y = "partyid")
```

]

.panel[.panel-name[Plot]

]
]

---
<img src="images/data-types/plots/party-recode-1.png" width="90%" style="display: block; margin: auto;" />

---
<img src="images/data-types/plots/red-party-1.png" width="90%" style="display: block; margin: auto;" />

---
<img src="images/data-types/plots/blue-party-1.png" width="90%" style="display: block; margin: auto;" />

???

How can we combine these groups?

---
class: center middle

.large-left[
# Collapsing levels
]

.small-right[
<img src="images/hex/forcats.png" width="100%" style="display: block; margin: auto;" />
]

---
# `fct_collapse()`

Changes multiple levels into a single level.

<code class ='r hljs remark-code'>fct_collapse(f, ...)</code>

---
# `fct_collapse()`

Changes multiple levels into a single level.

<code class ='r hljs remark-code'>fct_collapse(<span style="background-color:#FED766;color:#009FB7">f</span>, ...)</code>

???

Factor with levels

---
# `fct_collapse()`

Changes multiple levels into a single level.

<code class ='r hljs remark-code'>fct_collapse(f, <span style="background-color:#FED766;color:#009FB7">...</span>)</code>

???

Named levels to be collapsed

---
.panelset[
.panel[.panel-name[Code]

```r
gss_cat %>%
  drop_na(tvhours) %>%
  mutate(new_party = fct_collapse(partyid,
                                  Conservative = c("Strong republican",
                                                   "Not str republican",
                                                   "Ind,near rep"),
                                  Liberal = c("Strong democrat",
                                              "Not str democrat",
                                              "Ind,near dem"))) %>%
  ggplot(mapping = aes(x = fct_infreq(new_party))) +
  geom_bar() +
  labs(x = "party")
```

]

.panel[.panel-name[Plot]

]
]

---
<img src="images/data-types/plots/need-lump-1.png" width="90%" style="display: block; margin: auto;" />

---
# `fct_lump_n()`

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

<code class ='r hljs remark-code'>fct_lump_n(f, n, other_level = "Other", ...)</code>

---
# `fct_lump_n()`

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

<code class ='r hljs remark-code'>fct_lump_n(<span style="background-color:#FED766;color:#009FB7">f</span>, n, other_level = "Other", ...)</code>

???

Factor with levels

---
# `fct_lump_n()`

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

<code class ='r hljs remark-code'>fct_lump_<span style="background-color:#FED766;color:#009FB7">n</span>(f, <span style="background-color:#FED766;color:#009FB7">n</span>, other_level = "Other", ...)</code>

???

number of levels to lump (n smallest levels)

---
# `fct_lump_n()`

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

<code class ='r hljs remark-code'>fct_lump_n(f, n, <span style="background-color:#FED766;color:#009FB7">other_level = "Other"</span>, ...)</code>

???

name of new level

---
.panelset[
.panel[.panel-name[Code]

```r
gss_cat %>%
  drop_na(tvhours) %>%
  mutate(new_party = fct_collapse(partyid,
                                  Conservative = c("Strong republican",
                                                   "Not str republican",
                                                   "Ind,near rep"),
                                  Liberal = c("Strong democrat",
                                              "Not str democrat",
                                              "Ind,near dem")),
         new_party = fct_lump_n(new_party, other_level = "Other", n = 3))  %>%
  group_by(new_party) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(x = tvhours, y = fct_reorder(new_party, tvhours, mean))) +
  geom_col() +
  labs(y = "partyid")
```

]

.panel[.panel-name[Plot]

]
]

---
name: dates
class: center middle

# dates and times

---
class: pop-quiz

# Pop quiz!

Does every year have 365 days?

Does every day have 24 hours?

Does every minute have 60 seconds?

What does a month measure?

---
# Most useful skills

1\. Creating dates/times (i.e., parsing)

2\. Access and change parts of a date

.fade[
3\. Deal with time zones

4\. Do math with instants and time spans
]

---
class: pop-quiz

# Consider

Discuss in the chat:

* What is the best time of day to fly?

* What is the best day of the week to fly?

---

```r
flights %>%
  select(year, month, day, hour, minute, sched_dep_time, time_hour)
#> # A tibble: 336,776 x 7
#>     year month   day  hour minute sched_dep_time time_hour          
#>    <int> <int> <int> <dbl>  <dbl>          <int> <dttm>             
#>  1  2013     1     1     5     15            515 2013-01-01 05:00:00
#>  2  2013     1     1     5     29            529 2013-01-01 05:00:00
#>  3  2013     1     1     5     40            540 2013-01-01 05:00:00
#>  4  2013     1     1     5     45            545 2013-01-01 05:00:00
#>  5  2013     1     1     6      0            600 2013-01-01 06:00:00
#>  6  2013     1     1     5     58            558 2013-01-01 05:00:00
#>  7  2013     1     1     6      0            600 2013-01-01 06:00:00
#>  8  2013     1     1     6      0            600 2013-01-01 06:00:00
#>  9  2013     1     1     6      0            600 2013-01-01 06:00:00
#> 10  2013     1     1     6      0            600 2013-01-01 06:00:00
#> # … with 336,766 more rows
```

???

Let's focus briefly on the relationship between the scheduled departure time, `sched_dep_time`, and the average flight delay.

---
.panelset[
.panel[.panel-name[Code]

```r
flights %>%
  ggplot(mapping = aes(x = sched_dep_time, y = arr_delay)) +
  geom_point(alpha = 0.2) +
  geom_smooth()
```

]

.panel[.panel-name[Plot]

]
]

???

We might start by making a plot of departure time vs. arrival delay. What's wrong with this? Minutes stop at 59, so there are gaps from 60-99 for every "hour". Time counting is not the same as our normal number line.

---
class: center middle

.large-left[
# Creating dates and times
]

.small-right[
<img src="images/hex/hms.png" width="100%" style="display: block; margin: auto;" />
]

---
# hms

.center[
.fade[2021-05-04] .blue-highlight[**14:52:34**]
]

Stored as the number of seconds since 00:00:00.<sup>*</sup>

```r
library(hms)
hms(seconds = 34, minutes = 52, hours = 14)
#> 14:52:34

unclass(hms(34, 52, 14))
#> [1] 53554
#> attr(,"units")
#> [1] "secs"
```

.footnote[*On a typical day.]

---
class: your-turn

# Your turn 5

.big[
What is the best time of day to fly?

1. Use the **`hour`** and **`minute`** variable in flights to compute the time of day for each flight as an `hms`.
2. Use a smooth line to plot the relationship between time of day and **`arr_delay`**.
]

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
flights %>%
  mutate(time = hms(hours = hour, minutes = minute)) %>%
  ggplot(mapping = aes(x = time, y = arr_delay)) +
  geom_point(alpha = 0.2) +
  geom_smooth()
```
]

.panel[.panel-name[Plot]
<img src="images/data-types/plots/tod-sol-1.png" width="80%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Improved]
<img src="images/data-types/plots/tod-sol-best-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---
class: center middle

.large-left[
# lubridate
]

.small-right[
<img src="images/hex/lubridate.png" width="100%" style="display: block; margin: auto;" />
]

---
# `ymd()` family

To parse strings as dates, use a *y*, *m*, *d*, *h*, *m*, *s* combination.

```r
ymd("2021-05-04")
#> [1] "2021-05-04"

mdy("May 4, 2021")
#> [1] "2021-05-04"

ymd_hms("2021-05-04 14:52:34")
#> [1] "2021-05-04 14:52:34 UTC"
```

---
# Accessing components

Extract components by name with a **singular** name.

```r
star_wars <- ymd("2021-05-04")
year(star_wars)
#> [1] 2021

month(star_wars)
#> [1] 5

day(star_wars)
#> [1] 4
```

---
# Setting components

Use the same functions to set components

```r
star_wars
#> [1] "2021-05-04"

year(star_wars) <- 2002
star_wars
#> [1] "2002-05-04"
```

---
# Date and time components

#eaybdjiknb .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #F0F0F0;
  width: auto;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 1px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#eaybdjiknb .gt_heading {
  background-color: #F0F0F0;
  text-align: left;
  border-bottom-color: #F0F0F0;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#eaybdjiknb .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #F0F0F0;
  border-bottom-width: 0;
}

#eaybdjiknb .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #F0F0F0;
  border-top-width: 0;
}

#eaybdjiknb .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#eaybdjiknb .gt_col_headings {
  border-top-style: solid;
  border-top-width: 3px;
  border-top-color: #F0F0F0;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#eaybdjiknb .gt_col_heading {
  color: #333333;
  background-color: #F0F0F0;
  font-size: 80%;
  font-weight: bolder;
  text-transform: uppercase;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#eaybdjiknb .gt_column_spanner_outer {
  color: #333333;
  background-color: #F0F0F0;
  font-size: 80%;
  font-weight: bolder;
  text-transform: uppercase;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#eaybdjiknb .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#eaybdjiknb .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#eaybdjiknb .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#eaybdjiknb .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #F0F0F0;
  font-size: 80%;
  font-weight: bolder;
  text-transform: uppercase;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#eaybdjiknb .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #F0F0F0;
  font-size: 80%;
  font-weight: bolder;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#eaybdjiknb .gt_from_md > :first-child {
  margin-top: 0;
}

#eaybdjiknb .gt_from_md > :last-child {
  margin-bottom: 0;
}

#eaybdjiknb .gt_row {
  padding-top: 3px;
  padding-bottom: 3px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#eaybdjiknb .gt_stub {
  color: #333333;
  background-color: #F0F0F0;
  font-size: 80%;
  font-weight: bolder;
  text-transform: uppercase;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#eaybdjiknb .gt_summary_row {
  color: #333333;
  background-color: #F0F0F0;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#eaybdjiknb .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#eaybdjiknb .gt_grand_summary_row {
  color: #333333;
  background-color: #F0F0F0;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#eaybdjiknb .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#eaybdjiknb .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#eaybdjiknb .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#eaybdjiknb .gt_footnotes {
  color: #333333;
  background-color: #F0F0F0;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#eaybdjiknb .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#eaybdjiknb .gt_sourcenotes {
  color: #333333;
  background-color: #F0F0F0;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 0px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 0px;
  border-right-color: #D3D3D3;
}

#eaybdjiknb .gt_sourcenote {
  font-size: 12px;
  padding: 10px;
}

#eaybdjiknb .gt_left {
  text-align: left;
}

#eaybdjiknb .gt_center {
  text-align: center;
}

#eaybdjiknb .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#eaybdjiknb .gt_font_normal {
  font-weight: normal;
}

#eaybdjiknb .gt_font_bold {
  font-weight: bold;
}

#eaybdjiknb .gt_font_italic {
  font-style: italic;
}

#eaybdjiknb .gt_super {
  font-size: 65%;
}

#eaybdjiknb .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
</style>
<div id="eaybdjiknb" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;"><table class="gt_table" style="table-layout: fixed;; width: 0px">
  <colgroup>
    <col style="width:120px;"/>
    <col style="width:150px;"/>
    <col style="width:350px;"/>
  </colgroup>
  
  <thead class="gt_col_headings">
    <tr>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1" style="text-align: center; vertical-align: middle; font-weight: bold; border-bottom-width: 3px; border-bottom-style: solid; border-bottom-color: black;">function</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1" style="text-align: center; vertical-align: middle; font-weight: bold; border-bottom-width: 3px; border-bottom-style: solid; border-bottom-color: black;">extracts</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1" style="text-align: center; vertical-align: middle; font-weight: bold; border-bottom-width: 3px; border-bottom-style: solid; border-bottom-color: black;">extra arguments</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">year()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">year</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">month()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">month</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">label = FALSE, abbr = TRUE</td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">week()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">week</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">day()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">day of month</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">wday()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">day of week</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">label = FALSE, abbr = TRUE</td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">qday()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">day of quarter</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">yday()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">day of year</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">hour()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">hour</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';">minute()</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0;">minute</td>
      <td class="gt_row gt_center" style="background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
    <tr>
      <td class="gt_row gt_center" style="border-bottom-width: 2px; border-bottom-style: solid; border-bottom-color: transparent; background-color: #F0F0F0; font-family: 'Source Code Pro';">second()</td>
      <td class="gt_row gt_center" style="border-bottom-width: 2px; border-bottom-style: solid; border-bottom-color: transparent; background-color: #F0F0F0;">second</td>
      <td class="gt_row gt_center" style="border-bottom-width: 2px; border-bottom-style: solid; border-bottom-color: transparent; background-color: #F0F0F0; font-family: 'Source Code Pro';"></td>
    </tr>
  </tbody>
  
  
</table></div>

---
# Accessing components

```r
wday(star_wars)
#> [1] 7

wday(star_wars, label = TRUE)
#> [1] Sat
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(star_wars, label = TRUE, abbr = FALSE)
#> [1] Saturday
#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
```

---
class: your-turn

# Your turn 6

Complete the following code to:

1. Extract the day of the week for each flight (as a full name) from **`time_hour`**.
2. Calculate the average **`arr_delay`** by day of the week.
3. Plot the results as a column chart (bar chart) with `geom_col()`.

<code class ='r hljs remark-code'>flights %>% <br>&nbsp;&nbsp;drop_na(arr_delay) %>% <br>&nbsp;&nbsp;mutate(weekday = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>) %>% <br>&nbsp;&nbsp;<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> %>%<br>&nbsp;&nbsp;<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>(avg_delay = <span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>) %>% <br>&nbsp;&nbsp;ggplot(mapping = aes(x = weekday, y = avg_delay)) +<br>&nbsp;&nbsp;<span style="background-color:#FED766"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></code>

---
class: your-turn

.panelset[
.panel[.panel-name[Code]

```r
flights %>% 
  drop_na(arr_delay) %>% 
  mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>% 
  group_by(weekday) %>%
  summarize(avg_delay = mean(arr_delay)) %>% 
  ggplot(mapping = aes(x = weekday, y = avg_delay)) +
  geom_col()
```

]

.panel[.panel-name[Plot]

]
]

---
class: title-slide, center

# Data Types

## Tidy Data Science with the Tidyverse and Tidymodels

### W. Jake Thompson

#### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) &#183; [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021)

.footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]