HW 3

Inflation everywhere

Due: Sun, Sep 28, 11:59 pm

Introduction

This is a two-part homework assignment:

Part 1 – 🤖 Feedback from AI: Not graded, for practice, you get immediate feedback with AI, based on rubrics designed by the course instructor. Complete in hw-3-part-1.qmd, no submission required.
Part 2 – 🧑🏽‍🏫 Feedback from Humans: Graded, you get feedback from the course instructional team within a week. Complete in hw-3-part-2.qmd, submit on Gradescope.

By now you should be familiar with how to get started with a homework assignment by cloning the GitHub repo for the assignment.

Click to expand if you need a refresher on how to get started with a homework assignment.

Go to https://cmgr.oit.duke.edu/containers and login with your Duke NetID and Password.
Click STA199 under My reservations to log into your container. You should now see the RStudio environment.
Go to the course organization at github.com/sta199-f25 organization on GitHub. Click on the repo with the prefix hw-3. It contains the starter documents you need to complete the homework.
Click on the green CODE button, select Use SSH. Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

By now you should also be familiar with guidelines for formatting your code and plots as well as your Git and Gradescope workflow.

Click to expand if you need a refresher on assignment guidelines.

Code

Code should follow the tidyverse style. Particularly,

there should be spaces before and line breaks after each + when building a ggplot,
there should also be spaces before and line breaks after each |> in a data transformation pipeline,
code should be properly indented,
there should be spaces around = signs and spaces after commas.

Additionally, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.

Plots

Plots should have an informative title and, if needed, also a subtitle.
Axes and legends should be labeled with both the variable name and its units (if applicable).
Careful consideration should be given to aesthetic choices.

Workflow

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.

You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Final versions of both your .qmd file and the rendered PDF should be pushed to GitHub.

Part 1 – Feedback from AI

Your answers to the questions in this part should go in the file hw-3-part-1.qmd.

Instructions

Write your answer to each question in the appropriate section of the hw-3-part-1.qmd file. Then, highlight your answer to a question, click on Addins > AIFEEDR > Get feedback. In the app that opens, select the appropriate homework number (3) and question number. Then click on Get Feedback. Please be patient, feedback generation can take a few seconds. Once you read the feedback, you can go back to your Quarto document to improve your answer based on the feedback. You will then need to click the red X on the top left corner of the Viewer pane to stop the feedback app from running before you can re-render your Quarto document.

Click to expand if you want to review the video that demonstrates how to use the AI feedback tool.

Catch the AI making a mistake?

Submit a bug report with evidence of the mistake as well as your rationale for why it is a mistake to earn the opportunity for extra credit.

Your bug report must include screenshots of the relevant parts of your Quarto document as well as the “incorrect” feedback you received. Your bug report should be printed and turned in to Dr. Çetinkaya-Rundel in person by the beginning of class (11:45 am) on the Tuesday following the assignment due date, e.g. for HW 3 due on Sun, Sep 28, your bug report is due by the beginning of class on Tue, Sep 30. Electronic submissions will not be accepted. Your bug report, including screenshots, should fit on no more than two pages.

If your bug report is confirmed (i.e., I agree that AI indeed made a mistake), you will receive 1 extra point that will be added to your total points for the assignment.

Packages

In this part you will work with the tidyverse package, which is a collection of packages for doing data analysis in a “tidy” way.

library(tidyverse)

Data

For this part of the analysis you will work with inflation data from various countries in the world over 30 years.

country_inflation <- read_csv("data/country-inflation.csv")

Questions

Do not forget to render, commit, and push regularly, after each substantial change to your document (e.g., after answering each question). Use succinct and informative commit messages. Make sure to commit and push all changed files so that your Git pane is empty afterward.

Question 1

Get to know the data.

glimpse() at the country_inflation data frame and answer the following questions based on the output.
- How many rows does country_inflation have and what does each row represent?
- How many columns does country_inflation have and what does each column represent?
Display the names of the countries included in the dataset. How many distinct countries are there?

Tip

Find distinct countries with distinct() and print all countries with print(n = ___) added to your pipeline, where the blank is the number of distinct countries.

Question 2

Which countries had the top three highest inflation rates in 2023? Your output should be a data frame with two columns, country and 2023, with inflation rates in descending order, and three rows for the top three countries. Briefly comment on how the inflation rates for these countries compare to the inflation rate for United States in that year.

Tip

Column names that are numbers are not considered “proper” in R, therefore to select them you’ll need to surround them with backticks (`).

Question 3

In a single pipeline,

calculate the ratio of the inflation in 2023 and inflation in 1993 for each country and store this information in a new column called inf_ratio,
select the variables country and inf_ratio, and
store the result in a new data frame called country_inflation_ratios.

Then, in two separate pipelines,

arrange country_inflation_ratios in increasing order of inf_ratio and
arrange country_inflation_ratios in decreasing order of inf_ratio.

Which country’s inflation increase is the largest over this time period and by how much? Which country’s inflation decrease is the largest over this time period and by how much?

Tip

For this question you’ll once again need to use variables whose names are numbers (years) in your pipeline. Make sure to surround the names of such variables with backticks (`).

Question 4

Reshape (pivot) country_inflation such that each row represents a country/year combination. Then, display the resulting data frame and state how many rows and columns it has and write a sentence stating the number of rows and columns as well as the names of the columns of the resulting data frame.

Requirements:

Your code must use one of pivot_longer() or pivot_wider(). There are other ways you can do this reshaping move in R, but this question requires solving this problem by pivoting.
In your pivot_*() function, you must use names_transform = as.numeric as an argument to transform the variable type to numeric as you pivot the data so that in the resulting data frame the year variable is numeric.
The resulting data frame must be saved as something other than country_inflation so you (1) can refer to this data frame later in your analysis and (2) do not overwrite country_inflation. Use a short but informative name.

Warning

The last question in Part 1 as well as some questions in Part 2 require the use of the pivoted data frame from Question 4.

Question 5

Use a separate, single pipeline to answer each of the following questions.

Requirement: Your code must use the filter() function for each part, not arrange().

What is the highest inflation rate observed between 1993 and 2023? The output of the pipeline should be a data frame with one row and three columns. In addition to code and output, your response should include a single sentence stating the country and year.
What is the lowest inflation rate observed between 1993 and 2023? The output of the pipeline should be a data frame with one row and three columns. In addition to code and output, your response should include a single sentence stating the country and year.
Putting (a) and (b) together: What are the highest and the lowest inflation rates observed between 1993 and 2023? The output of the pipeline should be a data frame with two rows and three columns.

STOP: You must do this before proceeding to Part 2

At the end of your Part 1 Quarto file you’ll see a code cell like the following:

```{r}
#| eval: false
write_csv(_____, file = "data/country-inflation-pivoted.csv")
```

This will allow you to save the pivoted data frame from Question 4 as a CSV (comma-separated-values) file called country-inflation-pivoted.csv in your data folder so that you can read it and use it Part 2.

You need to modify the code cell by doing the following two things:

fill in the blank with the name of the pivoted data frame from Question 4 and
change the value of the code cell option* eval from false to true so that the code cell will run and save the CSV file when you render your Quarto document.

Now is another good time to render, commit, and push your changes to GitHub with an informative and concise commit message. And once again, make sure to commit and push all changed files including the newly created dataset so that your Git pane is empty afterward. We keep repeating this because it’s important and because we see students forget to do this. So take a moment to make sure you’re practicing good version control habits.

Part 2 – Feedback from Humans

Your answers to the questions in this part should go in the file hw-3-part-2.qmd.

Packages

You will use the tidyverse package for data wrangling and visualization,

You may choose to load other packages as needed, particularly for improving your plots. If you do so, add the functions to load these packages in the code cell labeled load-packages.

library(tidyverse)

Question 6

More on inflation around the world.

For this question you will use the pivoted version of the country_inflation dataset that you created in Question 4 of Part 1. Load this dataset from the CSV file you created at the end of Part 1. To do so, you can use the following code, which is already in your hw-3-part-2.qmd file. However, you’ll need to set eval to true in the code cell options so that this code runs when you render your Quarto document. And, before you can do that, you must have completed Question 4 of Part 1 and saved the pivoted data frame as a CSV file in your data folder as described in the previous section.

country_inflation_pivoted <- read_csv("data/country-inflation-pivoted.csv")

a. Create a vector called countries_of_interest which contains the names of up to five countries you want to visualize the inflation rates for over the years. For example, if these countries are Türkiye and United States, you can express this as follows:

countries_of_interest <- c("Türkiye", "United States")

If they are Türkiye, United States, and Chile, you can express this as follows:

countries_of_interest <- c("Türkiye", "United States", "Chile")

So on and so forth… Then, in 1-2 sentences, state why you chose these countries.

Note

Your countries_of_interest should consist of no more than five countries. Make sure that the spelling of your countries matches how they appear in the dataset.

b. In a single pipeline, filter your pivoted dataset to include only the countries_of_interest from part (a), and save the resulting data frame with a new name so you (1) can refer to this data frame later in your analysis and (2) do not overwrite the data frame you’re starting with. Use a short but informative name. Then, in a new pipeline, find the distinct() countries in the data frame you created.

Tip

The number of distinct countries in the filtered data frame you created in part (b) should equal the number of countries you chose in part (a). If it doesn’t, you might have misspelled a country name or made a mistake in filtering for these countries. Go back and correct your work.

c. Using your data frame from the previous question, create a plot of annual inflation vs. year for these countries. Then, in a few sentences, describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.

Requirements for the plot:

Data should be represented with points as well as lines connecting the points for each country.
Each country should be represented by a different color line and different color and shape points.
Axes and legend should be properly labeled.
The plot should have an appropriate title (and optionally a subtitle).
Plot should be customized in at least one way – you could use a different than default color scale, or different than default theme, or some other customization.

Questions 7-9

Inflation in the US.

The OECD defines inflation as follows:

Inflation is a rise in the general level of prices of goods and services that households acquire for the purpose of consumption in an economy over a period of time.

The main measure of inflation is the annual inflation rate which is the movement of the Consumer Price Index (CPI) from one month/period to the same month/period of the previous year expressed as percentage over time.

Source: OECD CPI FAQ

CPI is broken down into 12 expenditures such as food, housing, health, etc. Your goal in this part is to create another time series plot of annual inflation, this time for US only.

The data you will need to create this visualization is spread across two files:

us-inflation.csv: Annual inflation rate for the US for 12 CPI expenditures. Each expenditure is identified by an ID number.
cpi-expenditures.csv: A “lookup table” of CPI expenditure ID numbers and their descriptions.

Let’s load both of these files.

us_inflation <- read_csv("data/us-inflation.csv")
cpi_expenditures <- read_csv("data/cpi-expenditures.csv")

Question 7

a. How many columns and how many rows does the us_inflation dataset have? What are the variables in it? Which years do these data span? Write a brief (1-2 sentences) narrative summarizing this information.

b. How many columns and how many rows does the cpi_expenditures dataset have? What are the variables in it? Write a brief (1-2 sentences) narrative summarizing this information.

c. Create a new dataset by joining the us_inflation dataset with the cpi_expenditure_id dataset.

Determine which type of join is the most appropriate one and use that.
Note that the two datasets don’t have a variable with a common name, though they do have variables that contain common information but are named differently. You will need to first figure out which variables those are, and then define the by argument and use the join_by() function to indicate these variables to join the datasets by.
Use a short but informative name for the joined dataset, and do not overwrite either of the datasets that go into creating it.

Then, find the number of rows and columns of the resulting dataset and report the names of its columns. Add a brief (1-2 sentences) narrative summarizing this information.

Question 8

a. Create a vector called expenditures_of_interest which contains the descriptions or IDs of CPI expenditures you want to visualize. Your expenditures_of_interest should consist of no more than five expenditures. If you’re using descriptions, make sure that the spelling of your expenditures matches how they appear in the dataset. Then, in 1-2 sentences, state why you chose these expenditures.

Tip

Refer back to the guidance provided in Question 6 if you’re not sure how to create this vector.

b. In a single pipeline, filter your joined dataset to include only the expenditures_of_interest from part (a), and save the resulting data frame with a new name so you (1) can refer to this data frame later in your analysis and (2) do not overwrite the data frame you’re starting with. Use a short but informative name. Then, in a new pipeline, find the distinct() expenditures in the data frame you created.

Question 9

Using your data frame from the previous question, create a plot of annual inflation vs. year for these expenditures. Then, in a few sentences, describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of inflation rates in the US over the last decade.

Requirements for the plot:

Data should be represented with points as well as lines connecting the points for each expenditure.
Each expenditure should be represented by a different color line and different color and shape points.
Axes and legend should be properly labeled.
The plot should have an appropriate title (and optionally a subtitle).
Plot should be customized in at least one way – you could use a different than default color scale, or different than default theme, or some other customization.
If your legend has labels that are too long, you can try moving the legend to the bottom and stack the labels vertically. Hint: The legend.position and legend.direction arguments of the theme() functions will be useful.

ggplot(...) +
  ... +
  theme(
    legend.position = "bottom",
    legend.direction = "vertical"
  )

Question 10

Handling missing values after joins.

In class last week, we worked with the sales_taxes dataset and joined it to another dataset called us_regions so that we ended up with a data frame containing information on each state (plus DC), salex tax rate in that state, and the region the state is in. While DC was in our first dataset, it was not in the second, resulting in an NA value for the region variable in the joined dataset. We discussed that there are three approaches we could take to address this issue before calculating the average average sales tax of states in each region.

a. Read in the two datasets, sales-taxes-25.csv and us-regions.csv, which are both in your data folder. Then, join the two datasets using the appropriate type of join to be able to calculate the average sales tax rate by region, not overwriting either data frame that goes into the join.

b. Apply the three approaches we discussed in class to address the issue that DC has an NA value for region in the joined dataset, and then calculate the average average sales tax of states in each region.

c. Choose one of the three approaches as the “most appropriate” approach and provide a brief (1-2 sentences) rationale for your choice.

Tip

You can refer back to AE 06: Sales taxes + data joining if you need a reminder of these data and how to join them.

Wrap-up

Warning

Before you wrap up the assignment, make sure that you render, commit, and push one final time so that the final versions of both your .qmd file and the rendered PDF are pushed to GitHub and your Git pane is empty. We will be checking these to make sure you have been practicing how to commit and push changes.

Submission

Submit your PDF document to Gradescope by the deadline to be considered “on time”:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
Click on your STA 199 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with question. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).

Checklist

Make sure you have:

attempted all questions
rendered your Quarto document
committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
uploaded your PDF to Gradescope

Grading and feedback

Questions 1-5 are not graded, but you should complete them to get practice.
Questions 6-10 are graded, and you will receive feedback on Gradescope from the course instructional team within a week.
- Questions will be graded for accuracy and completeness.
- Partial credit will be given where appropriate.
- There are also workflow points for:
  - committing at least three times as you work through your lab,
  - having your final version of .qmd and .pdf files in your GitHub repository, and
  - overall organization.