Data Types

Tidy Data Science with the Tidyverse and Tidymodels

W. Jake Thompson

https://tidyds-2021.wjakethompson.com · https://bit.ly/tidyds-2021

Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.

Pop quiz!

What types of data are in this data set?

# A tibble: 336,776 x 6
   time_hour           name          air_time             distance day   delayed
   <dttm>              <chr>         <Duration>              <dbl> <ord> <lgl>  
 1 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours)     1400 Tues… TRUE   
 2 2013-01-01 05:00:00 United Air L… 13620s (~3.78 hours)     1416 Tues… TRUE   
 3 2013-01-01 05:00:00 American Air… 9600s (~2.67 hours)      1089 Tues… TRUE   
 4 2013-01-01 05:00:00 JetBlue Airw… 10980s (~3.05 hours)     1576 Tues… FALSE  
 5 2013-01-01 06:00:00 Delta Air Li… 6960s (~1.93 hours)       762 Tues… FALSE  
 6 2013-01-01 05:00:00 United Air L… 9000s (~2.5 hours)        719 Tues… TRUE   
 7 2013-01-01 06:00:00 JetBlue Airw… 9480s (~2.63 hours)      1065 Tues… TRUE   
 8 2013-01-01 06:00:00 ExpressJet A… 3180s (~53 minutes)       229 Tues… FALSE  
 9 2013-01-01 06:00:00 JetBlue Airw… 8400s (~2.33 hours)       944 Tues… FALSE  
10 2013-01-01 06:00:00 American Air… 8280s (~2.3 hours)        733 Tues… TRUE   
# … with 336,766 more rows

(Applied) Data Science

logicals

Logicals

R's data type for Boolean values (i.e., TRUE and FALSE)

typeof(TRUE)
#> [1] "logical"
typeof(FALSE)
#> [1] "logical"
typeof(c(TRUE, TRUE, FALSE))
#> [1] "logical"

flights %>%
  mutate(delayed = arr_delay > 0) %>%
  select(arr_delay, delayed)
#> # A tibble: 336,776 x 2
#>    arr_delay delayed
#>        <dbl> <lgl>  
#>  1        11 TRUE   
#>  2        20 TRUE   
#>  3        33 TRUE   
#>  4       -18 FALSE  
#>  5       -25 FALSE  
#>  6        12 TRUE   
#>  7        19 TRUE   
#>  8       -14 FALSE  
#>  9        -8 FALSE  
#> 10         8 TRUE   
#> # … with 336,766 more rows

flights %>%
  mutate(delayed = arr_delay > 0) %>%
  select(arr_delay, delayed)
#> # A tibble: 336,776 x 2
#>    arr_delay delayed
#>        <dbl> <lgl>  
#>  1        11 TRUE   
#>  2        20 TRUE   
#>  3        33 TRUE   
#>  4       -18 FALSE  
#>  5       -25 FALSE  
#>  6        12 TRUE   
#>  7        19 TRUE   
#>  8       -14 FALSE  
#>  9        -8 FALSE  
#> 10         8 TRUE   
#> # … with 336,766 more rows

Can we compute the proportion of flights that arrived late?

Most useful skillsMath with logicalsWhen you do math with logicals, TRUE becomes 1 and FALSE becomes 0

Most useful skills

Math with logicals
- When you do math with logicals, TRUE becomes 1 and FALSE becomes 0
The sum of a logical vector is the count of TRUEs

sum(c(TRUE, FALSE, TRUE, TRUE))
#> [1] 3

Most useful skills

Math with logicals
- When you do math with logicals, TRUE becomes 1 and FALSE becomes 0
The sum of a logical vector is the count of TRUEs

sum(c(TRUE, FALSE, TRUE, TRUE))
#> [1] 3

The mean of a logical vector is the proportion of TRUEs

mean(c(1, 2, 3, 4) < 4)
#> [1] 0.75

Your turn 1

Use the flights data set to create a new variable, delayed that indicates if the flight was delayed (arr_delay > 0).

Then, remove all rows that contain an NA in the delayed variable.

Finally, create a summary table that shows:

How many flight were delayed?
What proportion of flights were delayed?

04:00

flights %>%
  mutate(delayed = arr_delay > 0) %>%
  drop_na(delayed) %>%
  summarize(total = sum(delayed), prop = mean(delayed))
#> # A tibble: 1 x 2
#>    total  prop
#>    <int> <dbl>
#> 1 133004 0.406

strings

(character) strings

Anything surrounded by quotes (") or single quotes (')

> "one"
> "1"
> "one's"
> '"Hello World"'
> "foo
+
+
+ oops. I'm stuck in a string."

Consider

Discuss in the chat: Are boys or girls names more likely to end in a vowel?

01:00

How can we calculate the proportion of boys and girls names that end in a vowel?

babynames
#> # A tibble: 1,924,665 x 5
#>     year sex   name          n   prop
#>    <dbl> <chr> <chr>     <int>  <dbl>
#>  1  1880 F     Mary       7065 0.0724
#>  2  1880 F     Anna       2604 0.0267
#>  3  1880 F     Emma       2003 0.0205
#>  4  1880 F     Elizabeth  1939 0.0199
#>  5  1880 F     Minnie     1746 0.0179
#>  6  1880 F     Margaret   1578 0.0162
#>  7  1880 F     Ida        1472 0.0151
#>  8  1880 F     Alice      1414 0.0145
#>  9  1880 F     Bertha     1320 0.0135
#> 10  1880 F     Sarah      1288 0.0132
#> # … with 1,924,655 more rows

Most useful skills

1. How to extract or replace substrings.

2. How to find matches for patterns.

3. Regular expressions.

`str_sub()`

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

`str_sub()`

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

string(s) to manipulate

`str_sub()`

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

position of the first character to extract within each string

`str_sub()`

Extract or replace portions of a string with str_sub()

str_sub(string, start = 1, end = -1)

position of the last character to extract within each string

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)

#> [1] "Me"

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)

Pop quiz!

What will this return?

str_sub("Mephisto", 1, 2)

#> [1] "Me"

Pop quiz!

What will this return?

str_sub("Mephisto", 2)

Pop quiz!

What will this return?

str_sub("Mephisto", 2)

#> [1] "ephisto"

Pop quiz!

What will this return?

str_sub("Mephisto", -3)

Pop quiz!

What will this return?

str_sub("Mephisto", -3)

#> [1] "sto"

Pop quiz!

What will this return?

m <- "Mephisto"
str_sub(m, -3) <- "--Agatha!"
m

Pop quiz!

What will this return?

m <- "Mephisto"
str_sub(m, -3) <- "--Agatha!"
m

#> [1] "Mephi--Agatha!"

Your turn 2

Complete the following code to:

Isolate the last letter of every name.
Create a variable that indicates whether the last letter is one of "a", "e", "i", "o", "u", or "y".
Calculate the proportion of children whose name ends in a vowel, by year and sex.
Display the results as a line plot.

babynames %>% (last = , vowel = ) %>% group_by( ) %>% (prop_vowel = weighted.mean(vowel, w = n)) %>% ggplot(mapping = aes(x = , y = prop_vowel)) + (mapping = )

05:00

babynames %>%
  mutate(last = str_sub(name, start = -1, end = -1),
         vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
  group_by(year, sex) %>%
  summarize(prop_vowel = weighted.mean(vowel, w = n)) %>%
  ggplot(mapping = aes(x = year, y = prop_vowel)) +
  geom_line(mapping = aes(color = sex))

factors

Factors

R's representation of categorical data. Consists of:

A set of values
An ordered set of valid levels

eyes <- factor(x = c("blue", "green", "green"),
               levels = c("blue", "brown", "green", "hazel"))
eyes
#> [1] blue  green green
#> Levels: blue brown green hazel

Factors

Stored internally as an integer vector with a levels attribute

unclass(eyes)
#> [1] 1 3 3
#> attr(,"levels")
#> [1] "blue"  "brown" "green" "hazel"

Consider

Discuss in the chat: Do married people watch more or less TV than single people?

01:00

Example data: `gss_cat`

gss_cat
#> # A tibble: 21,483 x 9
#>     year marital     age race  rincome    partyid     relig     denom    tvhours
#>    <int> <fct>     <int> <fct> <fct>      <fct>       <fct>     <fct>      <int>
#>  1  2000 Never ma…    26 White $8000 to … Ind,near r… Protesta… Souther…      12
#>  2  2000 Divorced     48 White $8000 to … Not str re… Protesta… Baptist…      NA
#>  3  2000 Widowed      67 White Not appli… Independent Protesta… No deno…       2
#>  4  2000 Never ma…    39 White Not appli… Ind,near r… Orthodox… Not app…       4
#>  5  2000 Divorced     25 White Not appli… Not str de… None      Not app…       1
#>  6  2000 Married      25 White $20000 - … Strong dem… Protesta… Souther…      NA
#>  7  2000 Never ma…    36 White $25000 or… Not str re… Christian Not app…       3
#>  8  2000 Divorced     44 White $7000 to … Ind,near d… Protesta… Luthera…      NA
#>  9  2000 Married      44 White $25000 or… Not str de… Protesta… Other          0
#> 10  2000 Married      47 White $25000 or… Strong rep… Protesta… Souther…       3
#> # … with 21,473 more rows

A sample of data from teh General Social Survey, a long-running US survey conducted by NORC at the University of Chicago.

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(mapping = aes(x = tvhours, y= relig)) +
  geom_point()

Which plot do you prefer?

Why is the y-axis in this order?

`levels()`

Use levels() to access a factor's levels

levels(gss_cat$relig)
#>  [1] "No answer"               "Don't know"             
#>  [3] "Inter-nondenominational" "Native american"        
#>  [5] "Christian"               "Orthodox-christian"     
#>  [7] "Moslem/islam"            "Other eastern"          
#>  [9] "Hinduism"                "Buddhism"               
#> [11] "Other"                   "None"                   
#> [13] "Jewish"                  "Catholic"               
#> [15] "Protestant"              "Not applicable"

levels(gss_cat$relig)
#>  [1] "No answer"               "Don't know"              "Inter-nondenominational"
#>  [4] "Native american"         "Christian"               "Orthodox-christian"     
#>  [7] "Moslem/islam"            "Other eastern"           "Hinduism"               
#> [10] "Buddhism"                "Other"                   "None"                   
#> [13] "Jewish"                  "Catholic"                "Protestant"             
#> [16] "Not applicable"

Most useful skillsReorder the levels
Recode the levels
Collapse levels

Reordering levels

`fct_reorder()`