+ - 0:00:00
Notes for current slide
Notes for next slide

16

Wrap-Up

Tidy Data Science with the Tidyverse and Tidymodels

W. Jake Thompson

https://tidyds-2021.wjakethompson.com · https://bit.ly/tidyds-2021

Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.

``

(Applied) Data Science

(Applied) Data Science

Please help!

Please help!

Create a reproducible example (reprex)

reprex

Goal: create the simplest example possible to illustrate the problem/question, that anyone can run on their own machine

  • Can't use data stored on your computer (others won't have that)

  • Can't assume options or settings are the same across computers

reprex

Goal: create the simplest example possible to illustrate the problem/question, that anyone can run on their own machine

  • Can't use data stored on your computer (others won't have that)

  • Can't assume options or settings are the same across computers

reprex to the rescue!

Example

Question: How do I sort by a sum and then all component columns?

Example

Question: How do I sort by a sum and then all component columns?

My use case

dat3
#> # A tibble: 50 x 4
#> student_id skill_1 skill_2 skill_3
#> <int> <int> <int> <int>
#> 1 3462 0 0 1
#> 2 3510 1 1 1
#> 3 9717 1 0 1
#> 4 3985 0 1 0
#> 5 2841 1 0 1
#> 6 4370 1 0 1
#> 7 5760 0 0 1
#> 8 7745 0 0 0
#> 9 3756 0 0 1
#> 10 6106 1 0 1
#> # … with 40 more rows
dat4
#> # A tibble: 50 x 5
#> student_id skill_1 skill_2 skill_3 skill_4
#> <int> <int> <int> <int> <int>
#> 1 1472 0 1 1 1
#> 2 7097 0 1 0 1
#> 3 2148 0 1 1 0
#> 4 3036 0 1 0 1
#> 5 3312 1 1 1 1
#> 6 8740 0 1 0 0
#> 7 9649 0 1 1 1
#> 8 2077 0 0 0 1
#> 9 6014 0 1 0 0
#> 10 6657 1 0 0 1
#> # … with 40 more rows

I have some data that shows which of 3 skills each student has mastered. I want to sort the data by the total number of skills mastered, and then by each skill. But the number of skills can change. How can I write a solution that will work for any number of skills?

Include data

Question: How do I sort a data frame by total skills and then each component skill?

dat3
#> # A tibble: 50 x 4
#> student_id skill_1 skill_2 skill_3
#> <int> <int> <int> <int>
#> 1 3462 0 0 1
#> 2 3510 1 1 1
#> 3 9717 1 0 1
#> 4 3985 0 1 0
#> 5 2841 1 0 1
#> 6 4370 1 0 1
#> 7 5760 0 0 1
#> 8 7745 0 0 0
#> 9 3756 0 0 1
#> 10 6106 1 0 1
#> # … with 40 more rows

Include data

Question: How do I sort a data frame by total skills and then each component skill?

dat3
#> # A tibble: 50 x 4
#> student_id skill_1 skill_2 skill_3
#> <int> <int> <int> <int>
#> 1 3462 0 0 1
#> 2 3510 1 1 1
#> 3 9717 1 0 1
#> 4 3985 0 1 0
#> 5 2841 1 0 1
#> 6 4370 1 0 1
#> 7 5760 0 0 1
#> 8 7745 0 0 0
#> 9 3756 0 0 1
#> 10 6106 1 0 1
#> # … with 40 more rows

You can't do anything with this. You don't have dat3 on your computer, and you can't copy/paste this df into an R object. Would have to build it by hand.

Include reproducible example data

library(tidyverse)
ex_data <- tibble(stu = c(1, 2, 3, 4, 5),
skill_1 = c(0, 0, 1, 1, 1),
skill_2 = c(1, 1, 0, 0, 0),
skill_3 = c(0, 1, 0, 1, 1))
ex_data
#> # A tibble: 5 x 4
#> stu skill_1 skill_2 skill_3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 0
#> 2 2 0 1 1
#> 3 3 1 0 0
#> 4 4 1 0 1
#> 5 5 1 0 1

Asking questions

Bad: How do I sort by a sum and then all component columns?

Asking questions

Bad: How do I sort by a sum and then all component columns?

Better: How can I sort a data frame by total skills and then each component skill?

Asking questions

Bad: How do I sort by a sum and then all component columns?

Better: How can I sort a data frame by total skills and then each component skill?

Best: Provide an example of what you want (including the better question), and solutions you've tried.

# What I want:
ex_data %>%
mutate(total = skill_1 + skill_2 + skill_3) %>%
arrange(total, desc(skill_1, skill_2, skill_3))
#> # A tibble: 5 x 5
#> stu skill_1 skill_2 skill_3 total
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 1 0 0 1
#> 2 1 0 1 0 1
#> 3 4 1 0 1 2
#> 4 5 1 0 1 2
#> 5 2 0 1 1 2

But without specifying each skill individually, because the number of skills may change.

# What I've tried
ex_data %>%
rowwise() %>%
mutate(total = sum(c_across(starts_with("skill")))) %>%
ungroup() %>%
arrange(total, desc(starts_with("skill")))
#> Error: arrange() failed at implicit mutate() step.
#> * Problem with `mutate()` input `..2`.
#> x `starts_with()` must be used within a *selecting* function.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
#> ℹ Input `..2` is `starts_with("skill")`.

Anatomy of a good question

  • Brief description of what you're doing

  • Reproducible data

  • What you've tried

  • What you've gotten

  • What you want to get

Anatomy of a good question

  • Brief description of what you're doing
  • Reproducible data

  • What you've tried

  • What you've gotten

  • What you want to get

reprex makes this part easier

reprex()

The reprex() function from the reprex package will run code, format it nicely, and render the output to your clipboard.

reprex(x = NULL, venue, session_info, style)

reprex()

The reprex() function from the reprex package will run code, format it nicely, and render the output to your clipboard.

reprex(x = NULL, venue, session_info, style)

The reprex. Looks first on the clipboard.

reprex()

The reprex() function from the reprex package will run code, format it nicely, and render the output to your clipboard.

reprex(x = NULL, venue, session_info, style)

Where is the question being posted.

reprex()

The reprex() function from the reprex package will run code, format it nicely, and render the output to your clipboard.

reprex(x = NULL, venue, session_info, style)

Whether or not to include session information.

reprex()

The reprex() function from the reprex package will run code, format it nicely, and render the output to your clipboard.

reprex(x = NULL, venue, session_info, style)

Whether or not to format code in tidy style.

Demo

Answer

ex_data %>%
rowwise() %>%
mutate(total = sum(c_across(starts_with("skill")))) %>%
ungroup() %>%
arrange(total, across(starts_with("skill"), desc))
#> # A tibble: 5 x 5
#> stu skill_1 skill_2 skill_3 total
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 1 0 0 1
#> 2 1 0 1 0 1
#> 3 4 1 0 1 2
#> 4 5 1 0 1 2
#> 5 2 0 1 1 2

We need to use across() in the arrange function.

Shannon Pileggi for @WeAreRLadies

What's Next

Data Science

R4DS: Expanding on this workshop. Much more to learn!

MDSR: beginning to end- data management, programming, statistics, machine learning, special topics in DS

MD: More statistics (more regression, hypothesis testing, confidence intervals, etc.)

R Programming

AdvR: How R works (environments, data structures, meta programming)

R Packages: How to make your own package! 2nd edition work in progress

HOPR: Intro to R as a programming language, in the context of data science/data analysis

Data Visualization

SocViz: Intro to good looking graphics with ggplot2

Cookbook: Basic recipes for creating and customizing plots

Fundamentals: Made with ggplot2 & Rmd, but no code in book. Focus is on what makes a graphic informative, and appealing.

Machine Learning

TMWR: How to use tidymodels, best practices, etc.

FEATENG: Recipes -- how to extract more information from you data, including best practices, recommendations, etc.

HOML: Focused on machine learning methods and models - random forest, clustering algos, gradient boosting machines, neural networks, stacking, more!

R Markdown

RMD: Everything you could ever want to know about R Markdown. Includes chapters on extensions as well.

cookbook: popular how-tos for how to do different things in rmarkdown

bookdown: writing books, articles, dissertations, etc.

(Applied) Data Science

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
sToggle scribble toolbox
Esc Back to slideshow