class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#009FB7;">4</strong> </span> # Data Types ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).] <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$` `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { yellow: ["{\\color{yellow}{#1}}", 1], blue: ["{\\color{blue}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .yellow {color: #FED766;} .blue {color: #009FB7;} </style> --- class: pop-quiz # Pop quiz! What types of data are in this data set? ``` # A tibble: 336,776 x 6 time_hour name air_time distance day delayed <dttm> <chr> <Duration> <dbl> <ord> <lgl> 1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1400 Tues… TRUE 2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1416 Tues… TRUE 3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours) 1089 Tues… TRUE 4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours) 1576 Tues… FALSE 5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours) 762 Tues… FALSE 6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours) 719 Tues… TRUE 7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours) 1065 Tues… TRUE 8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes) 229 Tues… FALSE 9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours) 944 Tues… FALSE 10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours) 733 Tues… TRUE # … with 336,766 more rows ``` --- background-image: url(images/data-types/applied-ds-prog.png) background-position: center 60% background-size: 85% # .nobold[(Applied)] Data Science --- background-image: url(images/data-types/applied-ds-trans-prog.png) background-position: center 60% background-size: 85% # .nobold[(Applied)] Data Science --- name: logicals class: center middle # logicals --- # Logicals R's data type for Boolean values (i.e., `TRUE` and `FALSE`) ```r typeof(TRUE) #> [1] "logical" typeof(FALSE) #> [1] "logical" typeof(c(TRUE, TRUE, FALSE)) #> [1] "logical" ``` --- ```r flights %>% mutate(delayed = arr_delay > 0) %>% select(arr_delay, delayed) #> # A tibble: 336,776 x 2 #> arr_delay delayed #> <dbl> <lgl> #> 1 11 TRUE #> 2 20 TRUE #> 3 33 TRUE #> 4 -18 FALSE #> 5 -25 FALSE #> 6 12 TRUE #> 7 19 TRUE #> 8 -14 FALSE #> 9 -8 FALSE #> 10 8 TRUE #> # … with 336,766 more rows ``` -- Can we compute the proportion of flights that arrived late? --- # Most useful skills * Math with logicals * When you do math with logicals, `TRUE` becomes **1** and `FALSE` becomes **0** -- * The **sum** of a logical vector is the **count of `TRUE`s** ```r sum(c(TRUE, FALSE, TRUE, TRUE)) #> [1] 3 ``` -- * The **mean** of a logical vector is the **proportion of `TRUE`s** ```r mean(c(1, 2, 3, 4) < 4) #> [1] 0.75 ``` --- class: your-turn # Your turn 1 .big[ Use the `flights` data set to create a new variable, **`delayed`** that indicates if the flight was delayed (**`arr_delay > 0`**). Then, remove all rows that contain an `NA` in the **`delayed`** variable. Finally, create a summary table that shows: 1. How many flight were delayed? 2. What proportion of flights were delayed? ]
04
:
00
--- class: your-turn ```r flights %>% mutate(delayed = arr_delay > 0) %>% drop_na(delayed) %>% summarize(total = sum(delayed), prop = mean(delayed)) #> # A tibble: 1 x 2 #> total prop #> <int> <dbl> #> 1 133004 0.406 ``` --- name: strings class: center middle # strings --- # .nobold[(character)] strings Anything surrounded by quotes (`"`) or single quotes (`'`) ```r > "one" > "1" > "one's" > '"Hello World"' > "foo + + + oops. I'm stuck in a string." ``` --- class: pop-quiz # Consider Discuss in the chat: Are boys or girls names more likely to end in a vowel?
01
:
00
--- class: pop-quiz How can we calculate the proportion of boys and girls names that end in a vowel? ```r babynames #> # A tibble: 1,924,665 x 5 #> year sex name n prop #> <dbl> <chr> <chr> <int> <dbl> #> 1 1880 F Mary 7065 0.0724 #> 2 1880 F Anna 2604 0.0267 #> 3 1880 F Emma 2003 0.0205 #> 4 1880 F Elizabeth 1939 0.0199 #> 5 1880 F Minnie 1746 0.0179 #> 6 1880 F Margaret 1578 0.0162 #> 7 1880 F Ida 1472 0.0151 #> 8 1880 F Alice 1414 0.0145 #> 9 1880 F Bertha 1320 0.0135 #> 10 1880 F Sarah 1288 0.0132 #> # … with 1,924,655 more rows ``` --- # Most useful skills 1\. How to extract or replace substrings. .fade[ 2\. How to find matches for patterns. 3\. Regular expressions. ] --- <div class="hex-book"> <a href="https://stringr.tidyverse.org"> <img class="hex" src="images/hex/stringr.png"> </a> <a href="https://r4ds.had.co.nz/strings.html"> <img class="book" src="images/books/r4ds-strings.png"> </a> </div> --- # `str_sub()` Extract or replace portions of a string with **`str_sub()`** <code class ='r hljs remark-code'>str_sub(string, start = 1, end = -1)</code> --- # `str_sub()` Extract or replace portions of a string with **`str_sub()`** <code class ='r hljs remark-code'>str_sub(<span style="background-color:#FED766;color:#009FB7">string</span>, start = 1, end = -1)</code> ??? string(s) to manipulate --- # `str_sub()` Extract or replace portions of a string with **`str_sub()`** <code class ='r hljs remark-code'>str_sub(string, start = <span style="background-color:#FED766;color:#009FB7">1</span>, end = -1)</code> ??? position of the first character to extract within each string --- # `str_sub()` Extract or replace portions of a string with **`str_sub()`** <code class ='r hljs remark-code'>str_sub(string, start = 1, end = <span style="background-color:#FED766;color:#009FB7">-1</span>)</code> ??? position of the last character to extract within each string --- class: pop-quiz # Pop quiz! What will this return? ```r str_sub("Mephisto", 1, 2) ``` -- ``` #> [1] "Me" ``` --- class: pop-quiz # Pop quiz! What will this return? ```r str_sub("Mephisto", 1, 2) ``` -- ``` #> [1] "Me" ``` --- class: pop-quiz # Pop quiz! What will this return? ```r str_sub("Mephisto", 2) ``` -- ``` #> [1] "ephisto" ``` --- class: pop-quiz # Pop quiz! What will this return? ```r str_sub("Mephisto", -3) ``` -- ``` #> [1] "sto" ``` --- class: pop-quiz # Pop quiz! What will this return? ```r m <- "Mephisto" str_sub(m, -3) <- "--Agatha!" m ``` -- ``` #> [1] "Mephi--Agatha!" ``` --- class: your-turn # Your turn 2 Complete the following code to: 1. Isolate the last letter of every **`name`**. 2. Create a variable that indicates whether the last letter is one of "a", "e", "i", "o", "u", or "y". 3. Calculate the proportion of children whose name ends in a vowel, by **`year`** and **`sex`**. 4. Display the results as a line plot. <code class ='r hljs remark-code'>babynames %>%<br> <span style="background-color:#FED766"> </span>(last = <span style="background-color:#FED766"> </span>,<br> vowel = <span style="background-color:#FED766"> </span>) %>%<br> group_by(<span style="background-color:#FED766"> </span>) %>%<br> <span style="background-color:#FED766"> </span>(prop_vowel = weighted.mean(vowel, w = n)) %>%<br> ggplot(mapping = aes(x = <span style="background-color:#FED766"> </span>, y = <span style="background-color:#FED766;color:#FED766">prop_vowel</span>)) +<br> <span style="background-color:#FED766"> </span>(mapping = <span style="background-color:#FED766"> </span>)</code>
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r babynames %>% mutate(last = str_sub(name, start = -1, end = -1), vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>% group_by(year, sex) %>% summarize(prop_vowel = weighted.mean(vowel, w = n)) %>% ggplot(mapping = aes(x = year, y = prop_vowel)) + geom_line(mapping = aes(color = sex)) ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/yt-year-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- name: factors class: center middle # factors --- # Factors R's representation of categorical data. Consists of: 1. A set of **values** 2. An ordered set of **valid levels** ```r eyes <- factor(x = c("blue", "green", "green"), levels = c("blue", "brown", "green", "hazel")) eyes #> [1] blue green green #> Levels: blue brown green hazel ``` --- # Factors Stored internally as an integer vector with a levels attribute ```r unclass(eyes) #> [1] 1 3 3 #> attr(,"levels") #> [1] "blue" "brown" "green" "hazel" ``` --- <div class="hex-book"> <a href="https://forcats.tidyverse.org"> <img class="hex" src="images/hex/forcats.png"> </a> <a href="https://r4ds.had.co.nz/factors.html"> <img class="book" src="images/books/r4ds-factors.png"> </a> </div> --- class: pop-quiz # Consider Discuss in the chat: Do married people watch more or less TV than single people?
01
:
00
--- # Example data: `gss_cat` ```r gss_cat #> # A tibble: 21,483 x 9 #> year marital age race rincome partyid relig denom tvhours #> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> #> 1 2000 Never ma… 26 White $8000 to … Ind,near r… Protesta… Souther… 12 #> 2 2000 Divorced 48 White $8000 to … Not str re… Protesta… Baptist… NA #> 3 2000 Widowed 67 White Not appli… Independent Protesta… No deno… 2 #> 4 2000 Never ma… 39 White Not appli… Ind,near r… Orthodox… Not app… 4 #> 5 2000 Divorced 25 White Not appli… Not str de… None Not app… 1 #> 6 2000 Married 25 White $20000 - … Strong dem… Protesta… Souther… NA #> 7 2000 Never ma… 36 White $25000 or… Not str re… Christian Not app… 3 #> 8 2000 Divorced 44 White $7000 to … Ind,near d… Protesta… Luthera… NA #> 9 2000 Married 44 White $25000 or… Not str de… Protesta… Other 0 #> 10 2000 Married 47 White $25000 or… Strong rep… Protesta… Souther… 3 #> # … with 21,473 more rows ``` ??? A sample of data from teh General Social Survey, a long-running US survey conducted by NORC at the University of Chicago. --- # Which religions watch the least TV? ```r gss_cat %>% drop_na(tvhours) %>% group_by(relig) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(mapping = aes(x = tvhours, y= relig)) + geom_point() ``` --- # Which plot do you prefer? .pull-left[ <img src="images/data-types/plots/religion-tv-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/data-types/plots/religion-tv-order-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: center middle <img src="images/data-types/plots/religion-tv-1.png" width="90%" style="display: block; margin: auto;" /> ??? Why is the y-axis in this order? --- # `levels()` Use **`levels()`** to access a factor's levels ```r levels(gss_cat$relig) #> [1] "No answer" "Don't know" #> [3] "Inter-nondenominational" "Native american" #> [5] "Christian" "Orthodox-christian" #> [7] "Moslem/islam" "Other eastern" #> [9] "Hinduism" "Buddhism" #> [11] "Other" "None" #> [13] "Jewish" "Catholic" #> [15] "Protestant" "Not applicable" ``` --- .smallish[ ```r levels(gss_cat$relig) #> [1] "No answer" "Don't know" "Inter-nondenominational" #> [4] "Native american" "Christian" "Orthodox-christian" #> [7] "Moslem/islam" "Other eastern" "Hinduism" #> [10] "Buddhism" "Other" "None" #> [13] "Jewish" "Catholic" "Protestant" #> [16] "Not applicable" ``` ] <img src="images/data-types/plots/religion-tv-1.png" width="65%" style="display: block; margin: auto;" /> --- # Most useful skills 1. Reorder the levels 2. Recode the levels 3. Collapse levels --- class: center middle .large-left[ # Reordering levels ] .small-right[ <img src="images/hex/forcats.png" width="100%" style="display: block; margin: auto;" /> ] --- # `fct_reorder()` Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level). <code class ='r hljs remark-code'>fct_reorder(f, x, fun = median, ..., .desc = FALSE)</code> --- # `fct_reorder()` Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level). <code class ='r hljs remark-code'>fct_reorder(<span style="background-color:#FED766;color:#009FB7">f</span>, x, fun = median, ..., .desc = FALSE)</code> ??? Factor to reorder --- # `fct_reorder()` Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level). <code class ='r hljs remark-code'>fct_reorder(f, <span style="background-color:#FED766;color:#009FB7">x</span>, fun = median, ..., .desc = FALSE)</code> ??? variable to reorder by (in conjunction with `fun`) --- # `fct_reorder()` Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level). <code class ='r hljs remark-code'>fct_reorder(f, x, <span style="background-color:#FED766;color:#009FB7">fun = median</span>, ..., .desc = FALSE)</code> ??? function to reorder by (in conjunction with `x`) --- # `fct_reorder()` Reorders the levels of a factor based on the result of **`fun(x)`** applied to each group of cases (grouped by level). <code class ='r hljs remark-code'>fct_reorder(f, x, fun = median, ..., <span style="background-color:#FED766;color:#009FB7">.desc = FALSE</span>)</code> ??? put in descending order? --- .panelset[ .panel[.panel-name[Code] <code class ='r hljs remark-code'>gss_cat %>%<br> drop_na(tvhours) %>%<br> group_by(relig) %>%<br> summarize(tvhours = mean(tvhours)) %>%<br> ggplot(mapping = aes(x = tvhours, y = <span style="background-color:#FED766;color:#009FB7">fct_reorder(relig, tvhours, mean)</span>)) +<br> geom_point()</code> ] .panel[.panel-name[Plot] <img src="images/data-types/plots/religion-tv-order-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: your-turn # Your turn 3 Complete the following code to: 1. Calculate the average number of **`tvhours`**, by marital status. 2. Create a sensible plot of average TV consumption by marital status. <code class ='r hljs remark-code'>gss_cat %>%<br> drop_na(tvhours) %>%<br> group_by(<span style="background-color:#FED766;color:#FED766">marital</span>) %>%<br> summarize(<span style="background-color:#FED766"> </span>) %>%<br> ggplot(mapping = <span style="background-color:#FED766"> </span>) +<br> geom_col()</code>
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r gss_cat %>% drop_na(tvhours) %>% group_by(marital) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(mapping = aes(x = tvhours,y = fct_reorder(marital, tvhours, mean))) + geom_col() ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/yt-married-tv-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- # `fct_infreq()` ```r ggplot(gss_cat, mapping = aes(x = fct_infreq(marital))) + geom_bar() ``` .pull-left[ <img src="images/data-types/plots/infreq-default-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/data-types/plots/infreq-order-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # `fct_rev()` ```r ggplot(gss_cat, mapping = aes(x = fct_rev(fct_infreq(marital)))) + geom_bar() ``` .pull-left[ <img src="images/data-types/plots/infreq-order-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/data-types/plots/rev-order-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: center middle .large-left[ # Recoding levels ] .small-right[ <img src="images/hex/forcats.png" width="100%" style="display: block; margin: auto;" /> ] --- class: your-turn # Your turn 4 Do liberals or conservatives watch more TV? Compute the average TV consumption by party identification, and plot the results.
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r gss_cat %>% drop_na(tvhours) %>% group_by(partyid) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(mapping = aes(x = tvhours,y = fct_reorder(partyid, tvhours, mean))) + geom_col() + labs(y = "partyid") ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/party-tv-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? How can we improve these labels? --- # `fct_recode()` Change the values of the levels for a factor <code class ='r hljs remark-code'>fct_recode(f, ...)</code> --- # `fct_recode()` Change the values of the levels for a factor <code class ='r hljs remark-code'>fct_recode(<span style="background-color:#FED766;color:#009FB7">f</span>, ...)</code> ??? factor variable --- # `fct_recode()` Change the values of the levels for a factor <code class ='r hljs remark-code'>fct_recode(f, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? new level = old level pairs --- .panelset[ .panel[.panel-name[Code] ```r gss_cat %>% drop_na(tvhours) %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Republican, lean" = "Ind,near rep", "Democrat, lean" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat")) %>% group_by(partyid) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(mapping = aes(x = tvhours, y = fct_reorder(partyid, tvhours, mean))) + geom_col() + labs(y = "partyid") ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/party-recode-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- <img src="images/data-types/plots/party-recode-1.png" width="90%" style="display: block; margin: auto;" /> --- <img src="images/data-types/plots/red-party-1.png" width="90%" style="display: block; margin: auto;" /> --- <img src="images/data-types/plots/blue-party-1.png" width="90%" style="display: block; margin: auto;" /> ??? How can we combine these groups? --- class: center middle .large-left[ # Collapsing levels ] .small-right[ <img src="images/hex/forcats.png" width="100%" style="display: block; margin: auto;" /> ] --- # `fct_collapse()` Changes multiple levels into a single level. <code class ='r hljs remark-code'>fct_collapse(f, ...)</code> --- # `fct_collapse()` Changes multiple levels into a single level. <code class ='r hljs remark-code'>fct_collapse(<span style="background-color:#FED766;color:#009FB7">f</span>, ...)</code> ??? Factor with levels --- # `fct_collapse()` Changes multiple levels into a single level. <code class ='r hljs remark-code'>fct_collapse(f, <span style="background-color:#FED766;color:#009FB7">...</span>)</code> ??? Named levels to be collapsed --- .panelset[ .panel[.panel-name[Code] ```r gss_cat %>% drop_na(tvhours) %>% mutate(new_party = fct_collapse(partyid, Conservative = c("Strong republican", "Not str republican", "Ind,near rep"), Liberal = c("Strong democrat", "Not str democrat", "Ind,near dem"))) %>% ggplot(mapping = aes(x = fct_infreq(new_party))) + geom_bar() + labs(x = "party") ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/collapse-count-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- <img src="images/data-types/plots/need-lump-1.png" width="90%" style="display: block; margin: auto;" /> --- # `fct_lump_n()` Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest. <code class ='r hljs remark-code'>fct_lump_n(f, n, other_level = "Other", ...)</code> --- # `fct_lump_n()` Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest. <code class ='r hljs remark-code'>fct_lump_n(<span style="background-color:#FED766;color:#009FB7">f</span>, n, other_level = "Other", ...)</code> ??? Factor with levels --- # `fct_lump_n()` Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest. <code class ='r hljs remark-code'>fct_lump_<span style="background-color:#FED766;color:#009FB7">n</span>(f, <span style="background-color:#FED766;color:#009FB7">n</span>, other_level = "Other", ...)</code> ??? number of levels to lump (n smallest levels) --- # `fct_lump_n()` Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest. <code class ='r hljs remark-code'>fct_lump_n(f, n, <span style="background-color:#FED766;color:#009FB7">other_level = "Other"</span>, ...)</code> ??? name of new level --- .panelset[ .panel[.panel-name[Code] ```r gss_cat %>% drop_na(tvhours) %>% mutate(new_party = fct_collapse(partyid, Conservative = c("Strong republican", "Not str republican", "Ind,near rep"), Liberal = c("Strong democrat", "Not str democrat", "Ind,near dem")), new_party = fct_lump_n(new_party, other_level = "Other", n = 3)) %>% group_by(new_party) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(aes(x = tvhours, y = fct_reorder(new_party, tvhours, mean))) + geom_col() + labs(y = "partyid") ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/show-lump-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- name: dates class: center middle # dates and times --- class: pop-quiz # Pop quiz! Does every year have 365 days? -- Does every day have 24 hours? -- Does every minute have 60 seconds? -- What does a month measure? --- # Most useful skills 1\. Creating dates/times (i.e., parsing) 2\. Access and change parts of a date .fade[ 3\. Deal with time zones 4\. Do math with instants and time spans ] --- class: pop-quiz # Consider Discuss in the chat: * What is the best time of day to fly? * What is the best day of the week to fly?
01
:
00
--- ```r flights %>% select(year, month, day, hour, minute, sched_dep_time, time_hour) #> # A tibble: 336,776 x 7 #> year month day hour minute sched_dep_time time_hour #> <int> <int> <int> <dbl> <dbl> <int> <dttm> #> 1 2013 1 1 5 15 515 2013-01-01 05:00:00 #> 2 2013 1 1 5 29 529 2013-01-01 05:00:00 #> 3 2013 1 1 5 40 540 2013-01-01 05:00:00 #> 4 2013 1 1 5 45 545 2013-01-01 05:00:00 #> 5 2013 1 1 6 0 600 2013-01-01 06:00:00 #> 6 2013 1 1 5 58 558 2013-01-01 05:00:00 #> 7 2013 1 1 6 0 600 2013-01-01 06:00:00 #> 8 2013 1 1 6 0 600 2013-01-01 06:00:00 #> 9 2013 1 1 6 0 600 2013-01-01 06:00:00 #> 10 2013 1 1 6 0 600 2013-01-01 06:00:00 #> # … with 336,766 more rows ``` ??? Let's focus briefly on the relationship between the scheduled departure time, `sched_dep_time`, and the average flight delay. --- .panelset[ .panel[.panel-name[Code] ```r flights %>% ggplot(mapping = aes(x = sched_dep_time, y = arr_delay)) + geom_point(alpha = 0.2) + geom_smooth() ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/bad-flight-time-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? We might start by making a plot of departure time vs. arrival delay. What's wrong with this? Minutes stop at 59, so there are gaps from 60-99 for every "hour". Time counting is not the same as our normal number line. --- class: center middle .large-left[ # Creating dates and times ] .small-right[ <img src="images/hex/hms.png" width="100%" style="display: block; margin: auto;" /> ] <a href="https://r4ds.had.co.nz/dates-and-times.html"> <img class="norm-book" src="images/books/r4ds-dates-times.png"> </a> --- # hms .center[ .fade[2021-05-04] .blue-highlight[**14:52:34**] ] Stored as the number of seconds since 00:00:00.<sup>*</sup> ```r library(hms) hms(seconds = 34, minutes = 52, hours = 14) #> 14:52:34 unclass(hms(34, 52, 14)) #> [1] 53554 #> attr(,"units") #> [1] "secs" ``` .footnote[*On a typical day.] --- class: your-turn # Your turn 5 .big[ What is the best time of day to fly? 1. Use the **`hour`** and **`minute`** variable in flights to compute the time of day for each flight as an `hms`. 2. Use a smooth line to plot the relationship between time of day and **`arr_delay`**. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r flights %>% mutate(time = hms(hours = hour, minutes = minute)) %>% ggplot(mapping = aes(x = time, y = arr_delay)) + geom_point(alpha = 0.2) + geom_smooth() ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/tod-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Improved] <img src="images/data-types/plots/tod-sol-best-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: center middle .large-left[ # lubridate ] .small-right[ <img src="images/hex/lubridate.png" width="100%" style="display: block; margin: auto;" /> ] <a href="https://r4ds.had.co.nz/dates-and-times.html"> <img class="norm-book" src="images/books/r4ds-dates-times.png"> </a> --- # `ymd()` family To parse strings as dates, use a *y*, *m*, *d*, *h*, *m*, *s* combination. ```r ymd("2021-05-04") #> [1] "2021-05-04" mdy("May 4, 2021") #> [1] "2021-05-04" ymd_hms("2021-05-04 14:52:34") #> [1] "2021-05-04 14:52:34 UTC" ``` --- # Accessing components Extract components by name with a **singular** name. ```r star_wars <- ymd("2021-05-04") year(star_wars) #> [1] 2021 month(star_wars) #> [1] 5 day(star_wars) #> [1] 4 ``` --- # Setting components Use the same functions to set components ```r star_wars #> [1] "2021-05-04" year(star_wars) <- 2002 star_wars #> [1] "2002-05-04" ``` --- # Date and time components
function
extracts
extra arguments
year()
year
month()
month
label = FALSE, abbr = TRUE
week()
week
day()
day of month
wday()
day of week
label = FALSE, abbr = TRUE
qday()
day of quarter
yday()
day of year
hour()
hour
minute()
minute
second()
second
--- # Accessing components ```r wday(star_wars) #> [1] 7 wday(star_wars, label = TRUE) #> [1] Sat #> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat wday(star_wars, label = TRUE, abbr = FALSE) #> [1] Saturday #> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday ``` --- class: your-turn # Your turn 6 Complete the following code to: 1. Extract the day of the week for each flight (as a full name) from **`time_hour`**. 2. Calculate the average **`arr_delay`** by day of the week. 3. Plot the results as a column chart (bar chart) with `geom_col()`. <code class ='r hljs remark-code'>flights %>% <br> drop_na(arr_delay) %>% <br> mutate(weekday = <span style="background-color:#FED766"> </span>) %>% <br> <span style="background-color:#FED766"> </span> %>%<br> <span style="background-color:#FED766"> </span>(avg_delay = <span style="background-color:#FED766"> </span>) %>% <br> ggplot(mapping = aes(x = weekday, y = avg_delay)) +<br> <span style="background-color:#FED766"> </span></code>
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Code] ```r flights %>% drop_na(arr_delay) %>% mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>% group_by(weekday) %>% summarize(avg_delay = mean(arr_delay)) %>% ggplot(mapping = aes(x = weekday, y = avg_delay)) + geom_col() ``` ] .panel[.panel-name[Plot] <img src="images/data-types/plots/yt-flight-day-sol-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: title-slide, center # Data Types <img src="images/data-types/hex-group.png" width="30%" style="display: block; margin: auto;" /> ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]