+ - 0:00:00
Notes for current slide
Notes for next slide

4

Data Types

Tidy Data Science with the Tidyverse and Tidymodels

W. Jake Thompson

https://tidyds-2021.wjakethompson.com · https://bit.ly/tidyds-2021

Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.

``

Pop quiz!

What types of data are in this data set?

# A tibble: 336,776 x 6
time_hour name air_time distance day delayed
<dttm> <chr> <Duration> <dbl> <ord> <lgl>
1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1400 TuesTRUE
2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1416 TuesTRUE
3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours) 1089 TuesTRUE
4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours) 1576 TuesFALSE
5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours) 762 TuesFALSE
6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours) 719 TuesTRUE
7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours) 1065 TuesTRUE
8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes) 229 TuesFALSE
9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours) 944 TuesFALSE
10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours) 733 TuesTRUE
# … with 336,766 more rows

(Applied) Data Science

(Applied) Data Science

logicals

Logicals

R's data type for Boolean values (i.e., TRUE and FALSE)

typeof(TRUE)
#> [1] "logical"
typeof(FALSE)
#> [1] "logical"
typeof(c(TRUE, TRUE, FALSE))
#> [1] "logical"
flights %>%
mutate(delayed = arr_delay > 0) %>%
select(arr_delay, delayed)
#> # A tibble: 336,776 x 2
#> arr_delay delayed
#> <dbl> <lgl>
#> 1 11 TRUE
#> 2 20 TRUE
#> 3 33 TRUE
#> 4 -18 FALSE
#> 5 -25 FALSE
#> 6 12 TRUE
#> 7 19 TRUE
#> 8 -14 FALSE
#> 9 -8 FALSE
#> 10 8 TRUE
#> # … with 336,766 more rows
flights %>%
mutate(delayed = arr_delay > 0) %>%
select(arr_delay, delayed)
#> # A tibble: 336,776 x 2
#> arr_delay delayed
#> <dbl> <lgl>
#> 1 11 TRUE
#> 2 20 TRUE
#> 3 33 TRUE
#> 4 -18 FALSE
#> 5 -25 FALSE
#> 6 12 TRUE
#> 7 19 TRUE
#> 8 -14 FALSE
#> 9 -8 FALSE
#> 10 8 TRUE
#> # … with 336,766 more rows

Can we compute the proportion of flights that arrived late?

Most useful skills

  • Math with logicals
    • When you do math with logicals, TRUE becomes 1 and FALSE becomes 0

Most useful skills

  • Math with logicals

    • When you do math with logicals, TRUE becomes 1 and FALSE becomes 0
  • The sum of a logical vector is the count of TRUEs

sum(c(TRUE, FALSE, TRUE, TRUE))
#> [1] 3

Most useful skills

  • Math with logicals

    • When you do math with logicals, TRUE becomes 1 and FALSE becomes 0
  • The sum of a logical vector is the count of TRUEs

sum(c(TRUE, FALSE, TRUE, TRUE))
#> [1] 3
  • The mean of a logical vector is the proportion of TRUEs
mean(c(1, 2, 3, 4) < 4)
#> [1] 0.75

Your turn 1

Use the flights data set to create a new variable, delayed that indicates if the flight was delayed (arr_delay > 0).

Then, remove all rows that contain an NA in the delayed variable.

Finally, create a summary table that shows:

  1. How many flight were delayed?
  2. What proportion of flights were delayed?
04:00
flights %>%
mutate(delayed = arr_delay > 0) %>%
drop_na(delayed) %>%
summarize(total = sum(delayed), prop = mean(delayed))
#> # A tibble: 1 x 2
#> total prop
#> <int> <dbl>
#> 1 133004 0.406

strings

(character) strings

Anything surrounded by quotes (") or single quotes (')

> "one"
> "1"
> "one's"
> '"Hello World"'
> "foo
+
+
+ oops. I'm stuck in a string."

Consider

Discuss in the chat: Are boys or girls names more likely to end in a vowel?

01:00

How can we calculate the proportion of boys and girls names that end in a vowel?

babynames
#> # A tibble: 1,924,665 x 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 F Mary 7065 0.0724
#> 2 1880 F Anna 2604 0.0267
#> 3 1880 F Emma 2003 0.0205
#> 4 1880 F Elizabeth 1939 0.0199
#> 5 1880 F Minnie 1746 0.0179
#> 6 1880 F Margaret 1578 0.0162
#> 7 1880 F Ida 1472 0.0151
#> 8 1880 F Alice 1414 0.0145
#> 9 1880 F Bertha 1320 0.0135
#> 10 1880 F Sarah 1288 0.0132
#> # … with 1,924,655 more rows

Most useful skills

1. How to extract or replace substrings.

2. How to find matches for patterns.

3. Regular expressions.

str_sub()

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

str_sub()

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

string(s) to manipulate

str_sub()

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

position of the first character to extract within each string

str_sub()

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

position of the last character to extract within each string

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)
#> [1] "Me"

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)
#> [1] "Me"

Pop quiz!

What will this return?

str_sub("Mephisto", 2)

Pop quiz!

What will this return?

str_sub("Mephisto", 2)
#> [1] "ephisto"

Pop quiz!

What will this return?

str_sub("Mephisto", -3)

Pop quiz!

What will this return?

str_sub("Mephisto", -3)
#> [1] "sto"

Pop quiz!

What will this return?

m <- "Mephisto"
str_sub(m, -3) <- "--Agatha!"
m

Pop quiz!

What will this return?

m <- "Mephisto"
str_sub(m, -3) <- "--Agatha!"
m
#> [1] "Mephi--Agatha!"

Your turn 2

Complete the following code to:

  1. Isolate the last letter of every name.
  2. Create a variable that indicates whether the last letter is one of "a", "e", "i", "o", "u", or "y".
  3. Calculate the proportion of children whose name ends in a vowel, by year and sex.
  4. Display the results as a line plot.

babynames %>%
        (last =                                   ,
         vowel =                                         ) %>%
  group_by(         ) %>%
           (prop_vowel = weighted.mean(vowel, w = n)) %>%
  ggplot(mapping = aes(x =    , y = prop_vowel)) +
           (mapping =                )

05:00
babynames %>%
mutate(last = str_sub(name, start = -1, end = -1),
vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
group_by(year, sex) %>%
summarize(prop_vowel = weighted.mean(vowel, w = n)) %>%
ggplot(mapping = aes(x = year, y = prop_vowel)) +
geom_line(mapping = aes(color = sex))

factors

Factors

R's representation of categorical data. Consists of:

  1. A set of values
  2. An ordered set of valid levels
eyes <- factor(x = c("blue", "green", "green"),
levels = c("blue", "brown", "green", "hazel"))
eyes
#> [1] blue green green
#> Levels: blue brown green hazel

Factors

Stored internally as an integer vector with a levels attribute

unclass(eyes)
#> [1] 1 3 3
#> attr(,"levels")
#> [1] "blue" "brown" "green" "hazel"

Consider

Discuss in the chat: Do married people watch more or less TV than single people?

01:00

Example data: gss_cat

gss_cat
#> # A tibble: 21,483 x 9
#> year marital age race rincome partyid relig denom tvhours
#> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 2000 Never ma… 26 White $8000 to … Ind,near r… Protesta… Souther… 12
#> 2 2000 Divorced 48 White $8000 to … Not str re… Protesta… Baptist… NA
#> 3 2000 Widowed 67 White Not appli… Independent Protesta… No deno… 2
#> 4 2000 Never ma… 39 White Not appli… Ind,near r… Orthodox… Not app… 4
#> 5 2000 Divorced 25 White Not appli… Not str de… None Not app… 1
#> 6 2000 Married 25 White $20000 - … Strong dem… Protesta… Souther… NA
#> 7 2000 Never ma… 36 White $25000 or… Not str re… Christian Not app… 3
#> 8 2000 Divorced 44 White $7000 to … Ind,near d… Protesta… Luthera… NA
#> 9 2000 Married 44 White $25000 or… Not str de… Protesta… Other 0
#> 10 2000 Married 47 White $25000 or… Strong rep… Protesta… Souther… 3
#> # … with 21,473 more rows

A sample of data from teh General Social Survey, a long-running US survey conducted by NORC at the University of Chicago.

Which religions watch the least TV?

gss_cat %>%
drop_na(tvhours) %>%
group_by(relig) %>%
summarize(tvhours = mean(tvhours)) %>%
ggplot(mapping = aes(x = tvhours, y= relig)) +
geom_point()

Which plot do you prefer?

Why is the y-axis in this order?

levels()

Use levels() to access a factor's levels

levels(gss_cat$relig)
#> [1] "No answer" "Don't know"
#> [3] "Inter-nondenominational" "Native american"
#> [5] "Christian" "Orthodox-christian"
#> [7] "Moslem/islam" "Other eastern"
#> [9] "Hinduism" "Buddhism"
#> [11] "Other" "None"
#> [13] "Jewish" "Catholic"
#> [15] "Protestant" "Not applicable"
levels(gss_cat$relig)
#> [1] "No answer" "Don't know" "Inter-nondenominational"
#> [4] "Native american" "Christian" "Orthodox-christian"
#> [7] "Moslem/islam" "Other eastern" "Hinduism"
#> [10] "Buddhism" "Other" "None"
#> [13] "Jewish" "Catholic" "Protestant"
#> [16] "Not applicable"

Most useful skills

  1. Reorder the levels
  2. Recode the levels
  3. Collapse levels

Reordering levels

fct_reorder()

Reorders the levels of a factor based on the result of fun(x) applied to each group of cases (grouped by level).

fct_reorder(f, x, fun = median, ..., .desc = FALSE)

fct_reorder()

Reorders the levels of a factor based on the result of fun(x) applied to each group of cases (grouped by level).

fct_reorder(f, x, fun = median, ..., .desc = FALSE)

Factor to reorder

fct_reorder()

Reorders the levels of a factor based on the result of fun(x) applied to each group of cases (grouped by level).

fct_reorder(f, x, fun = median, ..., .desc = FALSE)

variable to reorder by (in conjunction with fun)

fct_reorder()

Reorders the levels of a factor based on the result of fun(x) applied to each group of cases (grouped by level).

fct_reorder(f, x, fun = median, ..., .desc = FALSE)

function to reorder by (in conjunction with x)

fct_reorder()

Reorders the levels of a factor based on the result of fun(x) applied to each group of cases (grouped by level).

fct_reorder(f, x, fun = median, ..., .desc = FALSE)

put in descending order?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(mapping = aes(x = tvhours, y = fct_reorder(relig, tvhours, mean))) +
  geom_point()

Your turn 3

Complete the following code to:

  1. Calculate the average number of tvhours, by marital status.
  2. Create a sensible plot of average TV consumption by marital status.

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(marital) %>%
  summarize(                       ) %>%
  ggplot(mapping =                                                        ) +
  geom_col()

05:00
gss_cat %>%
drop_na(tvhours) %>%
group_by(marital) %>%
summarize(tvhours = mean(tvhours)) %>%
ggplot(mapping = aes(x = tvhours,y = fct_reorder(marital, tvhours, mean))) +
geom_col()

fct_infreq()

ggplot(gss_cat, mapping = aes(x = fct_infreq(marital))) +
geom_bar()

fct_rev()

ggplot(gss_cat, mapping = aes(x = fct_rev(fct_infreq(marital)))) +
geom_bar()

Recoding levels

Your turn 4

Do liberals or conservatives watch more TV?

Compute the average TV consumption by party identification, and plot the results.

05:00
gss_cat %>%
drop_na(tvhours) %>%
group_by(partyid) %>%
summarize(tvhours = mean(tvhours)) %>%
ggplot(mapping = aes(x = tvhours,y = fct_reorder(partyid, tvhours, mean))) +
geom_col() +
labs(y = "partyid")

How can we improve these labels?

fct_recode()

Change the values of the levels for a factor

fct_recode(f, ...)

fct_recode()

Change the values of the levels for a factor

fct_recode(f, ...)

factor variable

fct_recode()

Change the values of the levels for a factor

fct_recode(f, ...)

new level = old level pairs

gss_cat %>%
drop_na(tvhours) %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Republican, lean" = "Ind,near rep",
"Democrat, lean" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat")) %>%
group_by(partyid) %>%
summarize(tvhours = mean(tvhours)) %>%
ggplot(mapping = aes(x = tvhours, y = fct_reorder(partyid, tvhours, mean))) +
geom_col() +
labs(y = "partyid")

How can we combine these groups?

Collapsing levels

fct_collapse()

Changes multiple levels into a single level.

fct_collapse(f, ...)

fct_collapse()

Changes multiple levels into a single level.

fct_collapse(f, ...)

Factor with levels

fct_collapse()

Changes multiple levels into a single level.

fct_collapse(f, ...)

Named levels to be collapsed

gss_cat %>%
drop_na(tvhours) %>%
mutate(new_party = fct_collapse(partyid,
Conservative = c("Strong republican",
"Not str republican",
"Ind,near rep"),
Liberal = c("Strong democrat",
"Not str democrat",
"Ind,near dem"))) %>%
ggplot(mapping = aes(x = fct_infreq(new_party))) +
geom_bar() +
labs(x = "party")

fct_lump_n()

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

fct_lump_n(f, n, other_level = "Other", ...)

fct_lump_n()

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

fct_lump_n(f, n, other_level = "Other", ...)

Factor with levels

fct_lump_n()

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

fctlumpn(f, n, other_level = "Other", ...)

number of levels to lump (n smallest levels)

fct_lump_n()

Collapses levels with fewest values into a single level. By default collapses as many levels as possible such that the new level is still the smallest.

fct_lump_n(f, n, other_level = "Other", ...)

name of new level

gss_cat %>%
drop_na(tvhours) %>%
mutate(new_party = fct_collapse(partyid,
Conservative = c("Strong republican",
"Not str republican",
"Ind,near rep"),
Liberal = c("Strong democrat",
"Not str democrat",
"Ind,near dem")),
new_party = fct_lump_n(new_party, other_level = "Other", n = 3)) %>%
group_by(new_party) %>%
summarize(tvhours = mean(tvhours)) %>%
ggplot(aes(x = tvhours, y = fct_reorder(new_party, tvhours, mean))) +
geom_col() +
labs(y = "partyid")

dates and times

Pop quiz!

Does every year have 365 days?

Pop quiz!

Does every year have 365 days?

Does every day have 24 hours?

Pop quiz!

Does every year have 365 days?

Does every day have 24 hours?

Does every minute have 60 seconds?

Pop quiz!

Does every year have 365 days?

Does every day have 24 hours?

Does every minute have 60 seconds?

What does a month measure?

Most useful skills

1. Creating dates/times (i.e., parsing)

2. Access and change parts of a date

3. Deal with time zones

4. Do math with instants and time spans

Consider

Discuss in the chat:

  • What is the best time of day to fly?

  • What is the best day of the week to fly?

01:00
flights %>%
select(year, month, day, hour, minute, sched_dep_time, time_hour)
#> # A tibble: 336,776 x 7
#> year month day hour minute sched_dep_time time_hour
#> <int> <int> <int> <dbl> <dbl> <int> <dttm>
#> 1 2013 1 1 5 15 515 2013-01-01 05:00:00
#> 2 2013 1 1 5 29 529 2013-01-01 05:00:00
#> 3 2013 1 1 5 40 540 2013-01-01 05:00:00
#> 4 2013 1 1 5 45 545 2013-01-01 05:00:00
#> 5 2013 1 1 6 0 600 2013-01-01 06:00:00
#> 6 2013 1 1 5 58 558 2013-01-01 05:00:00
#> 7 2013 1 1 6 0 600 2013-01-01 06:00:00
#> 8 2013 1 1 6 0 600 2013-01-01 06:00:00
#> 9 2013 1 1 6 0 600 2013-01-01 06:00:00
#> 10 2013 1 1 6 0 600 2013-01-01 06:00:00
#> # … with 336,766 more rows

Let's focus briefly on the relationship between the scheduled departure time, sched_dep_time, and the average flight delay.

flights %>%
ggplot(mapping = aes(x = sched_dep_time, y = arr_delay)) +
geom_point(alpha = 0.2) +
geom_smooth()

We might start by making a plot of departure time vs. arrival delay. What's wrong with this? Minutes stop at 59, so there are gaps from 60-99 for every "hour". Time counting is not the same as our normal number line.

Creating dates and times

hms

2021-05-04 14:52:34

Stored as the number of seconds since 00:00:00.*

library(hms)
hms(seconds = 34, minutes = 52, hours = 14)
#> 14:52:34
unclass(hms(34, 52, 14))
#> [1] 53554
#> attr(,"units")
#> [1] "secs"

*On a typical day.

Your turn 5

What is the best time of day to fly?

  1. Use the hour and minute variable in flights to compute the time of day for each flight as an hms.
  2. Use a smooth line to plot the relationship between time of day and arr_delay.
05:00
flights %>%
mutate(time = hms(hours = hour, minutes = minute)) %>%
ggplot(mapping = aes(x = time, y = arr_delay)) +
geom_point(alpha = 0.2) +
geom_smooth()

lubridate

ymd() family

To parse strings as dates, use a y, m, d, h, m, s combination.

ymd("2021-05-04")
#> [1] "2021-05-04"
mdy("May 4, 2021")
#> [1] "2021-05-04"
ymd_hms("2021-05-04 14:52:34")
#> [1] "2021-05-04 14:52:34 UTC"

Accessing components

Extract components by name with a singular name.

star_wars <- ymd("2021-05-04")
year(star_wars)
#> [1] 2021
month(star_wars)
#> [1] 5
day(star_wars)
#> [1] 4

Setting components

Use the same functions to set components

star_wars
#> [1] "2021-05-04"
year(star_wars) <- 2002
star_wars
#> [1] "2002-05-04"

Date and time components

function extracts extra arguments
year() year
month() month label = FALSE, abbr = TRUE
week() week
day() day of month
wday() day of week label = FALSE, abbr = TRUE
qday() day of quarter
yday() day of year
hour() hour
minute() minute
second() second

Accessing components

wday(star_wars)
#> [1] 7
wday(star_wars, label = TRUE)
#> [1] Sat
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
wday(star_wars, label = TRUE, abbr = FALSE)
#> [1] Saturday
#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Your turn 6

Complete the following code to:

  1. Extract the day of the week for each flight (as a full name) from time_hour.
  2. Calculate the average arr_delay by day of the week.
  3. Plot the results as a column chart (bar chart) with geom_col().

flights %>%
  drop_na(arr_delay) %>%
  mutate(weekday =                                           ) %>%
                    %>%
           (avg_delay =               ) %>%
  ggplot(mapping = aes(x = weekday, y = avg_delay)) +
            

05:00
flights %>%
drop_na(arr_delay) %>%
mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>%
group_by(weekday) %>%
summarize(avg_delay = mean(arr_delay)) %>%
ggplot(mapping = aes(x = weekday, y = avg_delay)) +
geom_col()

Data Types

Tidy Data Science with the Tidyverse and Tidymodels

W. Jake Thompson

https://tidyds-2021.wjakethompson.com · https://bit.ly/tidyds-2021

Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.

Pop quiz!

What types of data are in this data set?

# A tibble: 336,776 x 6
time_hour name air_time distance day delayed
<dttm> <chr> <Duration> <dbl> <ord> <lgl>
1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1400 TuesTRUE
2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours) 1416 TuesTRUE
3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours) 1089 TuesTRUE
4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours) 1576 TuesFALSE
5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours) 762 TuesFALSE
6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours) 719 TuesTRUE
7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours) 1065 TuesTRUE
8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes) 229 TuesFALSE
9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours) 944 TuesFALSE
10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours) 733 TuesTRUE
# … with 336,766 more rows
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
sToggle scribble toolbox
Esc Back to slideshow