Lab 3

A pivot and a join walk into a pipe…

Lab
Due: End of lab on Mon, Sep 22

Introduction

In this lab you’ll get to practice your data tidying skills, specifically reshaping data with pivot functions and bringing two data frames together with join functions.

Make sure to upload your completed lab to Gradescope by the end of your lab session and commit and push your final version to GitHub.

Getting started

By now you should be familiar with how to get started with a lab assignment by cloning the GitHub repo for the assignment.

Click to expand if you need a refresher on how to get started with a lab assignment.
  • Go to https://cmgr.oit.duke.edu/containers and login with your Duke NetID and Password.
  • Click STA199 under My reservations to log into your container. You should now see the RStudio environment.
  • Go to the course organization at github.com/sta199-f25 organization on GitHub. Click on the repo with the prefix lab-3. It contains the starter documents you need to complete the homework.
  • Click on the green CODE button, select Use SSH. Click on the clipboard icon to copy the repo URL.
  • In RStudio, go to FileNew ProjectVersion ControlGit.
  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

Open the lab-3.qmd template Quarto file and update the authors field to add your name first (first and last) and then your teammates’ names (first and last). Render the document. Examine the rendered document and make sure your and your teammates’ names are updated in the document. Commit and push your changes with a meaningful commit message and push to GitHub.

Click to expand if you need a refresher on assignment guidelines.

Code

Code should follow the tidyverse style. Particularly,

  • there should be spaces before and line breaks after each + when building a ggplot,
  • there should also be spaces before and line breaks after each |> in a data transformation pipeline,
  • code should be properly indented,
  • there should be spaces around = signs and spaces after commas.

Additionally, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.1

Plots

  • Plots should have an informative title and, if needed, also a subtitle.
  • Axes and legends should be labeled with both the variable name and its units (if applicable).
  • Careful consideration should be given to aesthetic choices.

Workflow

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.

  • You should have at least 3 commits with meaningful commit messages by the end of the assignment.
  • Final versions of both your .qmd file and the rendered PDF should be pushed to GitHub.

Packages

In this lab we will work with the tidyverse package.

Questions

Question 1

Pivot longer and wider.

  1. For this part, you will work with the following dataset called scores.
scores <- tribble(
  ~student_id, ~math, ~english,
  "S1",        90,    92,
  "S2",        85,    80,
  "S3",        88,    85,
  "S4",        95,    74
)

scores
# A tibble: 4 × 3
  student_id  math english
  <chr>      <dbl>   <dbl>
1 S1            90      92
2 S2            85      80
3 S3            88      85
4 S4            95      74
Note

The tribble() function is helpful for creating small data frames (tibbles) with an easier to read row-by-row layout.

It has three variables (student_id, math, and english) and four rows (one for each student).

Before writing any code, answer the following questions:

  • Suppose you want to reshape the data frame so that there is one row per student per subject. What function would you use to do this?

  • If you reshaped the data to have one row per student per subject,

    • how many rows would the resulting data frame have?
    • how many columns would the resulting data frame have and what would the column names be?

Then, write the code to reshape the data frame as described above.

Render, commit, and push your changes to GitHub with a succinct and informative commit message.

Make sure to commit and push all changed files so that your Git pane is empty afterward.

  1. For this part, you will work with the following dataset called patients.
patients <- tribble(
  ~patient_id, ~measurement_time_, ~systolic_bp,
  "P1",        "Morning",          120,
  "P1",        "Noon",             115,
  "P1",        "Evening",          123,
  "P2",        "Morning",          118,
  "P2",        "Evening",          121
)

patients
# A tibble: 5 × 3
  patient_id measurement_time_ systolic_bp
  <chr>      <chr>                   <dbl>
1 P1         Morning                   120
2 P1         Noon                      115
3 P1         Evening                   123
4 P2         Morning                   118
5 P2         Evening                   121

It has three variables (patient_id, measurement_time_, and systolic_bp – short for systolic blood pressure) and five rows (one per patient per measurement time).

Before writing any code, answer the following questions:

  • Suppose you want to reshape the data frame so that there is one row per patient and measurements at different times of the day are recorded in different columns. What function would you use to do this?

  • If you reshaped the data to have one row per patient,

    • how many rows would the resulting data frame have?
    • how many columns would the resulting data frame have and what would the column names be?

Then, write the code to reshape the data frame as described above. What does the NA value mean in the resulting data frame?

Render, commit, and push your changes to GitHub with a succinct and informative commit message.

Make sure to commit and push all changed files so that your Git pane is empty afterward.

Question 2

For this question, you will work with the following dataset called grad_years as well as the student scores dataset from Question 1a (scores), and join them with various join functions.

grad_years <- tribble(
  ~id,  ~graduation_year,
  "S1", 2023,
  "S3", 2023,
  "S5", 2025,
  "S6", 2024
)

grad_years
# A tibble: 4 × 2
  id    graduation_year
  <chr>           <dbl>
1 S1               2023
2 S3               2023
3 S5               2025
4 S6               2024
  1. Your friend writes the following code to join the scores and grad_years data frames and gets an error message:
scores |>
  left_join(grad_years)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

What does the error message mean and why does it occur? How would you fix the code?

  1. How many rows and columns does the resulting data frame from part(a) have? Explain why.

  2. Don’t write any code yet: Suppose you join the two data frames, scores and grad_years with a right_join(), in that order. How many rows and columns would the resulting data frame have? Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.

  3. Start with the code this time. Join the two data frames, scores and grad_years with an inner_join(), in that order. How many rows and columns does the resulting data frame have? Explain why.

  4. Don’t write any code yet: Suppose you join the two data frames, grad_years and scores with an inner_join() again, but in the reverse order (grad_years first, then scores). Would you expect the resulting data frame to have the same number of rows as the previous part or a different number of rows? Explain your reasoning. Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.

  5. Don’t write any code yet: Suppose you join the two data frames, scores and grad_years with an anti_join(). Which observation(s) would be in the resulting data frame? Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.

  6. Don’t write any code yet: Suppose you join the two data frames, grad_years and scores with an anti_join() again, but in the reverse order (grad_years first, then scores). Which observation(s) would be in the resulting data frame? Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.

  7. Start with the code this time. Join the two data frames, scores and grad_years with a full_join(), in that order. How many rows and columns does the resulting data frame have? Explain why.

  8. You’re interested in all students who are in the scores data frame and you also need their graduation years. You want to find out which students are missing graduation years. Which join function would you use to achieve this? Write the code to perform the join to confirm your answer.

Render, commit, and push your changes to GitHub with a succinct and informative commit message.

Make sure to commit and push all changed files so that your Git pane is empty afterward.

Wrap-up

Warning

Before you wrap up the assignment, make sure that you render, commit, and push one final time so that the final versions of both your .qmd file and the rendered PDF are pushed to GitHub and your Git pane is empty. We will be checking these to make sure you have been practicing how to commit and push changes.

Submission

By now you should also be familiar with how to submit your assignment in Gradescope.

Click to expand if you need a refresher on how to get started with a lab assignment.

Submit your PDF document to Gradescope by the end of the lab to be considered “on time”:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with question. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Checklist

Make sure you have:

  • attempted all questions
  • rendered your Quarto document
  • committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
  • uploaded your PDF to Gradescope

Grading and feedback

  • This lab is worth 30 points:
    • 10 points for being in lab and turning in something – no partial credit for this part.
    • 20 points for:
      • answering the questions correctly – there is partial credit for this part.
      • following the workflow – there is partial credit for this part.
  • The workflow points are for:
    • committing at least three times as you work through your lab,
    • having your final version of .qmd and .pdf files in your GitHub repository, and
    • overall organization.
  • You’ll receive feedback on your lab on Gradescope within a week.

Good luck, and have fun with it!

Footnotes

  1. Remember, haikus not novellas when writing code!↩︎