Lab 3
A pivot and a join walk into a pipe…
Introduction
In this lab you’ll get to practice your data tidying skills, specifically reshaping data with pivot functions and bringing two data frames together with join functions.
Make sure to upload your completed lab to Gradescope by the end of your lab session and commit and push your final version to GitHub.
Getting started
By now you should be familiar with how to get started with a lab assignment by cloning the GitHub repo for the assignment.
Click to expand if you need a refresher on how to get started with a lab assignment.
- Go to https://cmgr.oit.duke.edu/containers and login with your Duke NetID and Password.
- Click
STA199
under My reservations to log into your container. You should now see the RStudio environment. - Go to the course organization at github.com/sta199-f25 organization on GitHub. Click on the repo with the prefix lab-3. It contains the starter documents you need to complete the homework.
- Click on the green CODE button, select Use SSH. Click on the clipboard icon to copy the repo URL.
- In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
- Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
- Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Open the lab-3.qmd template Quarto file and update the authors
field to add your name first (first and last) and then your teammates’ names (first and last). Render the document. Examine the rendered document and make sure your and your teammates’ names are updated in the document. Commit and push your changes with a meaningful commit message and push to GitHub.
Click to expand if you need a refresher on assignment guidelines.
Code
Code should follow the tidyverse style. Particularly,
- there should be spaces before and line breaks after each
+
when building aggplot
, - there should also be spaces before and line breaks after each
|>
in a data transformation pipeline, - code should be properly indented,
- there should be spaces around
=
signs and spaces after commas.
Additionally, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.1
Plots
- Plots should have an informative title and, if needed, also a subtitle.
- Axes and legends should be labeled with both the variable name and its units (if applicable).
- Careful consideration should be given to aesthetic choices.
Workflow
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.
- You should have at least 3 commits with meaningful commit messages by the end of the assignment.
- Final versions of both your
.qmd
file and the rendered PDF should be pushed to GitHub.
Packages
In this lab we will work with the tidyverse package.
Questions
Question 1
Pivot longer and wider.
- For this part, you will work with the following dataset called
scores
.
scores <- tribble(
~student_id, ~math, ~english,
"S1", 90, 92,
"S2", 85, 80,
"S3", 88, 85,
"S4", 95, 74
)
scores
# A tibble: 4 × 3
student_id math english
<chr> <dbl> <dbl>
1 S1 90 92
2 S2 85 80
3 S3 88 85
4 S4 95 74
The tribble()
function is helpful for creating small data frames (tibble
s) with an easier to read row-by-row layout.
It has three variables (student_id
, math
, and english
) and four rows (one for each student).
Before writing any code, answer the following questions:
Suppose you want to reshape the data frame so that there is one row per student per subject. What function would you use to do this?
-
If you reshaped the data to have one row per student per subject,
- how many rows would the resulting data frame have?
- how many columns would the resulting data frame have and what would the column names be?
Then, write the code to reshape the data frame as described above.
Render, commit, and push your changes to GitHub with a succinct and informative commit message.
Make sure to commit and push all changed files so that your Git pane is empty afterward.
- For this part, you will work with the following dataset called
patients
.
patients <- tribble(
~patient_id, ~measurement_time_, ~systolic_bp,
"P1", "Morning", 120,
"P1", "Noon", 115,
"P1", "Evening", 123,
"P2", "Morning", 118,
"P2", "Evening", 121
)
patients
# A tibble: 5 × 3
patient_id measurement_time_ systolic_bp
<chr> <chr> <dbl>
1 P1 Morning 120
2 P1 Noon 115
3 P1 Evening 123
4 P2 Morning 118
5 P2 Evening 121
It has three variables (patient_id
, measurement_time_
, and systolic_bp
– short for systolic blood pressure) and five rows (one per patient per measurement time).
Before writing any code, answer the following questions:
Suppose you want to reshape the data frame so that there is one row per patient and measurements at different times of the day are recorded in different columns. What function would you use to do this?
-
If you reshaped the data to have one row per patient,
- how many rows would the resulting data frame have?
- how many columns would the resulting data frame have and what would the column names be?
Then, write the code to reshape the data frame as described above. What does the NA
value mean in the resulting data frame?
Render, commit, and push your changes to GitHub with a succinct and informative commit message.
Make sure to commit and push all changed files so that your Git pane is empty afterward.
Question 2
For this question, you will work with the following dataset called grad_years
as well as the student scores dataset from Question 1a (scores
), and join them with various join functions.
grad_years <- tribble(
~id, ~graduation_year,
"S1", 2023,
"S3", 2023,
"S5", 2025,
"S6", 2024
)
grad_years
# A tibble: 4 × 2
id graduation_year
<chr> <dbl>
1 S1 2023
2 S3 2023
3 S5 2025
4 S6 2024
- Your friend writes the following code to join the
scores
andgrad_years
data frames and gets an error message:
scores |>
left_join(grad_years)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
What does the error message mean and why does it occur? How would you fix the code?
How many rows and columns does the resulting data frame from part(a) have? Explain why.
Don’t write any code yet: Suppose you join the two data frames,
scores
andgrad_years
with aright_join()
, in that order. How many rows and columns would the resulting data frame have? Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.Start with the code this time. Join the two data frames,
scores
andgrad_years
with aninner_join()
, in that order. How many rows and columns does the resulting data frame have? Explain why.Don’t write any code yet: Suppose you join the two data frames,
grad_years
andscores
with aninner_join()
again, but in the reverse order (grad_years
first, thenscores
). Would you expect the resulting data frame to have the same number of rows as the previous part or a different number of rows? Explain your reasoning. Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.Don’t write any code yet: Suppose you join the two data frames,
scores
andgrad_years
with ananti_join()
. Which observation(s) would be in the resulting data frame? Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.Don’t write any code yet: Suppose you join the two data frames,
grad_years
andscores
with ananti_join()
again, but in the reverse order (grad_years
first, thenscores
). Which observation(s) would be in the resulting data frame? Then, write the code to perform the join. If your guess wasn’t correct, discuss with your teammates before proceeding.Start with the code this time. Join the two data frames,
scores
andgrad_years
with afull_join()
, in that order. How many rows and columns does the resulting data frame have? Explain why.You’re interested in all students who are in the
scores
data frame and you also need their graduation years. You want to find out which students are missing graduation years. Which join function would you use to achieve this? Write the code to perform the join to confirm your answer.
Render, commit, and push your changes to GitHub with a succinct and informative commit message.
Make sure to commit and push all changed files so that your Git pane is empty afterward.
Wrap-up
Before you wrap up the assignment, make sure that you render, commit, and push one final time so that the final versions of both your .qmd
file and the rendered PDF are pushed to GitHub and your Git pane is empty. We will be checking these to make sure you have been practicing how to commit and push changes.
Submission
By now you should also be familiar with how to submit your assignment in Gradescope.
Click to expand if you need a refresher on how to get started with a lab assignment.
Submit your PDF document to Gradescope by the end of the lab to be considered “on time”:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with question. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Make sure you have:
- attempted all questions
- rendered your Quarto document
- committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
- uploaded your PDF to Gradescope
Grading and feedback
- This lab is worth 30 points:
- 10 points for being in lab and turning in something – no partial credit for this part.
- 20 points for:
- answering the questions correctly – there is partial credit for this part.
- following the workflow – there is partial credit for this part.
- The workflow points are for:
- committing at least three times as you work through your lab,
- having your final version of
.qmd
and.pdf
files in your GitHub repository, and - overall organization.
- You’ll receive feedback on your lab on Gradescope within a week.
Good luck, and have fun with it!
Footnotes
Remember, haikus not novellas when writing code!↩︎