4
Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.
What types of data are in this data set?
# A tibble: 336,776 x 6 time_hour name air_time distance day delayed <dttm> <chr> <Duration> <dbl> <ord> <lgl> 1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1400 Tues… TRUE 2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1416 Tues… TRUE 3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours) 1089 Tues… TRUE 4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours) 1576 Tues… FALSE 5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours) 762 Tues… FALSE 6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours) 719 Tues… TRUE 7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours) 1065 Tues… TRUE 8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes) 229 Tues… FALSE 9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours) 944 Tues… FALSE 10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours) 733 Tues… TRUE # … with 336,766 more rows
R's data type for Boolean values (i.e., TRUE
and FALSE
)
typeof(TRUE)#> [1] "logical"typeof(FALSE)#> [1] "logical"typeof(c(TRUE, TRUE, FALSE))#> [1] "logical"
flights %>% mutate(delayed = arr_delay > 0) %>% select(arr_delay, delayed)#> # A tibble: 336,776 x 2#> arr_delay delayed#> <dbl> <lgl> #> 1 11 TRUE #> 2 20 TRUE #> 3 33 TRUE #> 4 -18 FALSE #> 5 -25 FALSE #> 6 12 TRUE #> 7 19 TRUE #> 8 -14 FALSE #> 9 -8 FALSE #> 10 8 TRUE #> # … with 336,766 more rows
flights %>% mutate(delayed = arr_delay > 0) %>% select(arr_delay, delayed)#> # A tibble: 336,776 x 2#> arr_delay delayed#> <dbl> <lgl> #> 1 11 TRUE #> 2 20 TRUE #> 3 33 TRUE #> 4 -18 FALSE #> 5 -25 FALSE #> 6 12 TRUE #> 7 19 TRUE #> 8 -14 FALSE #> 9 -8 FALSE #> 10 8 TRUE #> # … with 336,766 more rows
Can we compute the proportion of flights that arrived late?
TRUE
becomes 1 and FALSE
becomes 0Math with logicals
TRUE
becomes 1 and FALSE
becomes 0The sum of a logical vector is the count of TRUE
s
sum(c(TRUE, FALSE, TRUE, TRUE))#> [1] 3
Math with logicals
TRUE
becomes 1 and FALSE
becomes 0The sum of a logical vector is the count of TRUE
s
sum(c(TRUE, FALSE, TRUE, TRUE))#> [1] 3
TRUE
smean(c(1, 2, 3, 4) < 4)#> [1] 0.75
Use the flights
data set to create a new variable, delayed
that indicates if the flight was delayed (arr_delay > 0
).
Then, remove all rows that contain an NA
in the delayed
variable.
Finally, create a summary table that shows:
04:00
flights %>% mutate(delayed = arr_delay > 0) %>% drop_na(delayed) %>% summarize(total = sum(delayed), prop = mean(delayed))#> # A tibble: 1 x 2#> total prop#> <int> <dbl>#> 1 133004 0.406
Anything surrounded by quotes ("
) or single quotes ('
)
> "one"> "1"> "one's"> '"Hello World"'> "foo+++ oops. I'm stuck in a string."
Discuss in the chat: Are boys or girls names more likely to end in a vowel?
01:00
How can we calculate the proportion of boys and girls names that end in a vowel?
babynames#> # A tibble: 1,924,665 x 5#> year sex name n prop#> <dbl> <chr> <chr> <int> <dbl>#> 1 1880 F Mary 7065 0.0724#> 2 1880 F Anna 2604 0.0267#> 3 1880 F Emma 2003 0.0205#> 4 1880 F Elizabeth 1939 0.0199#> 5 1880 F Minnie 1746 0.0179#> 6 1880 F Margaret 1578 0.0162#> 7 1880 F Ida 1472 0.0151#> 8 1880 F Alice 1414 0.0145#> 9 1880 F Bertha 1320 0.0135#> 10 1880 F Sarah 1288 0.0132#> # … with 1,924,655 more rows
1. How to extract or replace substrings.
2. How to find matches for patterns.
3. Regular expressions.
str_sub()
Extract or replace portions of a string with str_sub()
str_sub(string, start = 1, end = -1)
str_sub()
Extract or replace portions of a string with str_sub()
str_sub(string, start = 1, end = -1)
string(s) to manipulate
str_sub()
Extract or replace portions of a string with str_sub()
str_sub(string, start = 1, end = -1)
position of the first character to extract within each string
str_sub()
Extract or replace portions of a string with str_sub()
str_sub(string, start = 1, end = -1)
position of the last character to extract within each string
What will this return?
str_sub("Mephisto", 1, 2)
What will this return?
str_sub("Mephisto", 1, 2)
#> [1] "Me"
What will this return?
str_sub("Mephisto", 1, 2)
What will this return?
str_sub("Mephisto", 1, 2)
#> [1] "Me"
What will this return?
str_sub("Mephisto", 2)
What will this return?
str_sub("Mephisto", 2)
#> [1] "ephisto"
What will this return?
str_sub("Mephisto", -3)
What will this return?
str_sub("Mephisto", -3)
#> [1] "sto"
What will this return?
m <- "Mephisto"str_sub(m, -3) <- "--Agatha!"m
What will this return?
m <- "Mephisto"str_sub(m, -3) <- "--Agatha!"m
#> [1] "Mephi--Agatha!"
Complete the following code to:
name
.year
and sex
.babynames %>%
(last = ,
vowel = ) %>%
group_by( ) %>%
(prop_vowel = weighted.mean(vowel, w = n)) %>%
ggplot(mapping = aes(x = , y = prop_vowel)) +
(mapping = )
05:00
R's representation of categorical data. Consists of:
eyes <- factor(x = c("blue", "green", "green"), levels = c("blue", "brown", "green", "hazel"))eyes#> [1] blue green green#> Levels: blue brown green hazel
Stored internally as an integer vector with a levels attribute
unclass(eyes)#> [1] 1 3 3#> attr(,"levels")#> [1] "blue" "brown" "green" "hazel"
Discuss in the chat: Do married people watch more or less TV than single people?
01:00
gss_cat
gss_cat#> # A tibble: 21,483 x 9#> year marital age race rincome partyid relig denom tvhours#> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>#> 1 2000 Never ma… 26 White $8000 to … Ind,near r… Protesta… Souther… 12#> 2 2000 Divorced 48 White $8000 to … Not str re… Protesta… Baptist… NA#> 3 2000 Widowed 67 White Not appli… Independent Protesta… No deno… 2#> 4 2000 Never ma… 39 White Not appli… Ind,near r… Orthodox… Not app… 4#> 5 2000 Divorced 25 White Not appli… Not str de… None Not app… 1#> 6 2000 Married 25 White $20000 - … Strong dem… Protesta… Souther… NA#> 7 2000 Never ma… 36 White $25000 or… Not str re… Christian Not app… 3#> 8 2000 Divorced 44 White $7000 to … Ind,near d… Protesta… Luthera… NA#> 9 2000 Married 44 White $25000 or… Not str de… Protesta… Other 0#> 10 2000 Married 47 White $25000 or… Strong rep… Protesta… Souther… 3#> # … with 21,473 more rows
A sample of data from teh General Social Survey, a long-running US survey conducted by NORC at the University of Chicago.
gss_cat %>% drop_na(tvhours) %>% group_by(relig) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(mapping = aes(x = tvhours, y= relig)) + geom_point()
Why is the y-axis in this order?
levels()
Use levels()
to access a factor's levels
levels(gss_cat$relig)#> [1] "No answer" "Don't know" #> [3] "Inter-nondenominational" "Native american" #> [5] "Christian" "Orthodox-christian" #> [7] "Moslem/islam" "Other eastern" #> [9] "Hinduism" "Buddhism" #> [11] "Other" "None" #> [13] "Jewish" "Catholic" #> [15] "Protestant" "Not applicable"
levels(gss_cat$relig)#> [1] "No answer" "Don't know" "Inter-nondenominational"#> [4] "Native american" "Christian" "Orthodox-christian" #> [7] "Moslem/islam" "Other eastern" "Hinduism" #> [10] "Buddhism" "Other" "None" #> [13] "Jewish" "Catholic" "Protestant" #> [16] "Not applicable"
fct_reorder()
Reorders the levels of a factor based on the result of fun(x)
applied to each group of cases (grouped by level).
fct_reorder(f, x, fun = median, ..., .desc = FALSE)
fct_reorder()
Reorders the levels of a factor based on the result of fun(x)
applied to each group of cases (grouped by level).
fct_reorder(f, x, fun = median, ..., .desc = FALSE)
Factor to reorder
fct_reorder()
Reorders the levels of a factor based on the result of fun(x)
applied to each group of cases (grouped by level).
fct_reorder(f, x, fun = median, ..., .desc = FALSE)
variable to reorder by (in conjunction with fun
)
fct_reorder()
Reorders the levels of a factor based on the result of fun(x)
applied to each group of cases (grouped by level).
fct_reorder(f, x, fun = median, ..., .desc = FALSE)
function to reorder by (in conjunction with x
)
fct_reorder()
Reorders the levels of a factor based on the result of fun(x)
applied to each group of cases (grouped by level).
fct_reorder(f, x, fun = median, ..., .desc = FALSE)
put in descending order?
Complete the following code to:
tvhours
, by marital status.gss_cat %>%
drop_na(tvhours) %>%
group_by(marital) %>%
summarize( ) %>%
ggplot(mapping = ) +
geom_col()
05:00
fct_infreq()
ggplot(gss_cat, mapping = aes(x = fct_infreq(marital))) + geom_bar()
fct_rev()
ggplot(gss_cat, mapping = aes(x = fct_rev(fct_infreq(marital)))) + geom_bar()
Do liberals or conservatives watch more TV?
Compute the average TV consumption by party identification, and plot the results.
05:00
How can we improve these labels?
fct_recode()
Change the values of the levels for a factor
fct_recode(f, ...)
fct_recode()
Change the values of the levels for a factor
fct_recode(f, ...)
factor variable
fct_recode()
Change the values of the levels for a factor
fct_recode(f, ...)
new level = old level pairs
gss_cat %>% drop_na(tvhours) %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Republican, lean" = "Ind,near rep", "Democrat, lean" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat")) %>% group_by(partyid) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(mapping = aes(x = tvhours, y = fct_reorder(partyid, tvhours, mean))) + geom_col() + labs(y = "partyid")
How can we combine these groups?
fct_collapse()
Changes multiple levels into a single level.
fct_collapse(f, ...)
fct_collapse()
Changes multiple levels into a single level.
fct_collapse(f, ...)
Factor with levels
fct_collapse()
Changes multiple levels into a single level.
fct_collapse(f, ...)
Named levels to be collapsed
gss_cat %>% drop_na(tvhours) %>% mutate(new_party = fct_collapse(partyid, Conservative = c("Strong republican", "Not str republican", "Ind,near rep"), Liberal = c("Strong democrat", "Not str democrat", "Ind,near dem"))) %>% ggplot(mapping = aes(x = fct_infreq(new_party))) + geom_bar() + labs(x = "party")
fct_lump_n()
Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.
fct_lump_n(f, n, other_level = "Other", ...)
fct_lump_n()
Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.
fct_lump_n(f, n, other_level = "Other", ...)
Factor with levels
fct_lump_n()
Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.
fctlumpn(f, n, other_level = "Other", ...)
number of levels to lump (n smallest levels)
fct_lump_n()
Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.
fct_lump_n(f, n, other_level = "Other", ...)
name of new level
gss_cat %>% drop_na(tvhours) %>% mutate(new_party = fct_collapse(partyid, Conservative = c("Strong republican", "Not str republican", "Ind,near rep"), Liberal = c("Strong democrat", "Not str democrat", "Ind,near dem")), new_party = fct_lump_n(new_party, other_level = "Other", n = 3)) %>% group_by(new_party) %>% summarize(tvhours = mean(tvhours)) %>% ggplot(aes(x = tvhours, y = fct_reorder(new_party, tvhours, mean))) + geom_col() + labs(y = "partyid")
Does every year have 365 days?
Does every year have 365 days?
Does every day have 24 hours?
Does every year have 365 days?
Does every day have 24 hours?
Does every minute have 60 seconds?
Does every year have 365 days?
Does every day have 24 hours?
Does every minute have 60 seconds?
What does a month measure?
1. Creating dates/times (i.e., parsing)
2. Access and change parts of a date
3. Deal with time zones
4. Do math with instants and time spans
Discuss in the chat:
What is the best time of day to fly?
What is the best day of the week to fly?
01:00
flights %>% select(year, month, day, hour, minute, sched_dep_time, time_hour)#> # A tibble: 336,776 x 7#> year month day hour minute sched_dep_time time_hour #> <int> <int> <int> <dbl> <dbl> <int> <dttm> #> 1 2013 1 1 5 15 515 2013-01-01 05:00:00#> 2 2013 1 1 5 29 529 2013-01-01 05:00:00#> 3 2013 1 1 5 40 540 2013-01-01 05:00:00#> 4 2013 1 1 5 45 545 2013-01-01 05:00:00#> 5 2013 1 1 6 0 600 2013-01-01 06:00:00#> 6 2013 1 1 5 58 558 2013-01-01 05:00:00#> 7 2013 1 1 6 0 600 2013-01-01 06:00:00#> 8 2013 1 1 6 0 600 2013-01-01 06:00:00#> 9 2013 1 1 6 0 600 2013-01-01 06:00:00#> 10 2013 1 1 6 0 600 2013-01-01 06:00:00#> # … with 336,766 more rows
Let's focus briefly on the relationship between the scheduled departure time, sched_dep_time
, and the average flight delay.
We might start by making a plot of departure time vs. arrival delay. What's wrong with this? Minutes stop at 59, so there are gaps from 60-99 for every "hour". Time counting is not the same as our normal number line.
2021-05-04 14:52:34
Stored as the number of seconds since 00:00:00.*
library(hms)hms(seconds = 34, minutes = 52, hours = 14)#> 14:52:34unclass(hms(34, 52, 14))#> [1] 53554#> attr(,"units")#> [1] "secs"
*On a typical day.
What is the best time of day to fly?
hour
and minute
variable in flights to compute the time of day for each flight as an hms
.arr_delay
.05:00
ymd()
familyTo parse strings as dates, use a y, m, d, h, m, s combination.
ymd("2021-05-04")#> [1] "2021-05-04"mdy("May 4, 2021")#> [1] "2021-05-04"ymd_hms("2021-05-04 14:52:34")#> [1] "2021-05-04 14:52:34 UTC"
Extract components by name with a singular name.
star_wars <- ymd("2021-05-04")year(star_wars)#> [1] 2021month(star_wars)#> [1] 5day(star_wars)#> [1] 4
Use the same functions to set components
star_wars#> [1] "2021-05-04"year(star_wars) <- 2002star_wars#> [1] "2002-05-04"
function | extracts | extra arguments |
---|---|---|
year() | year | |
month() | month | label = FALSE, abbr = TRUE |
week() | week | |
day() | day of month | |
wday() | day of week | label = FALSE, abbr = TRUE |
qday() | day of quarter | |
yday() | day of year | |
hour() | hour | |
minute() | minute | |
second() | second |
wday(star_wars)#> [1] 7wday(star_wars, label = TRUE)#> [1] Sat#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Satwday(star_wars, label = TRUE, abbr = FALSE)#> [1] Saturday#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
Complete the following code to:
time_hour
.arr_delay
by day of the week.geom_col()
.flights %>%
drop_na(arr_delay) %>%
mutate(weekday = ) %>%
%>%
(avg_delay = ) %>%
ggplot(mapping = aes(x = weekday, y = avg_delay)) +
05:00
Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.
What types of data are in this data set?
# A tibble: 336,776 x 6 time_hour name air_time distance day delayed <dttm> <chr> <Duration> <dbl> <ord> <lgl> 1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1400 Tues… TRUE 2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1416 Tues… TRUE 3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours) 1089 Tues… TRUE 4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours) 1576 Tues… FALSE 5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours) 762 Tues… FALSE 6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours) 719 Tues… TRUE 7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours) 1065 Tues… TRUE 8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes) 229 Tues… FALSE 9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours) 944 Tues… FALSE 10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours) 733 Tues… TRUE # … with 336,766 more rows
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
s | Toggle scribble toolbox |
Esc | Back to slideshow |