HW 3
Inflation everywhere
Introduction
This is a two-part homework assignment:
Part 1 – 🤖 Feedback from AI: Not graded, for practice, you get immediate feedback with AI, based on rubrics designed by the course instructor. Complete in
hw-3-part-1.qmd
, no submission required.Part 2 – 🧑🏽🏫 Feedback from Humans: Graded, you get feedback from the course instructional team within a week. Complete in
hw-3-part-2.qmd
, submit on Gradescope.
By now you should be familiar with how to get started with a homework assignment by cloning the GitHub repo for the assignment.
Click to expand if you need a refresher on how to get started with a homework assignment.
- Go to https://cmgr.oit.duke.edu/containers and login with your Duke NetID and Password.
- Click
STA199
under My reservations to log into your container. You should now see the RStudio environment. - Go to the course organization at github.com/sta199-f25 organization on GitHub. Click on the repo with the prefix hw-3. It contains the starter documents you need to complete the homework.
- Click on the green CODE button, select Use SSH. Click on the clipboard icon to copy the repo URL.
- In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
- Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
- Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
By now you should also be familiar with guidelines for formatting your code and plots as well as your Git and Gradescope workflow.
Click to expand if you need a refresher on assignment guidelines.
Code
Code should follow the tidyverse style. Particularly,
- there should be spaces before and line breaks after each
+
when building aggplot
, - there should also be spaces before and line breaks after each
|>
in a data transformation pipeline, - code should be properly indented,
- there should be spaces around
=
signs and spaces after commas.
Additionally, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.
Plots
- Plots should have an informative title and, if needed, also a subtitle.
- Axes and legends should be labeled with both the variable name and its units (if applicable).
- Careful consideration should be given to aesthetic choices.
Workflow
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.
- You should have at least 3 commits with meaningful commit messages by the end of the assignment.
- Final versions of both your
.qmd
file and the rendered PDF should be pushed to GitHub.
Part 1 – Feedback from AI
Your answers to the questions in this part should go in the file hw-3-part-1.qmd
.
Instructions
Write your answer to each question in the appropriate section of the hw-3-part-1.qmd
file. Then, highlight your answer to a question, click on Addins > AIFEEDR > Get feedback. In the app that opens, select the appropriate homework number (3) and question number. Then click on Get Feedback. Please be patient, feedback generation can take a few seconds. Once you read the feedback, you can go back to your Quarto document to improve your answer based on the feedback. You will then need to click the red X on the top left corner of the Viewer pane to stop the feedback app from running before you can re-render your Quarto document.
Click to expand if you want to review the video that demonstrates how to use the AI feedback tool.
Packages
In this part you will work with the tidyverse package, which is a collection of packages for doing data analysis in a “tidy” way.
Data
For this part of the analysis you will work with inflation data from various countries in the world over 30 years.
country_inflation <- read_csv("data/country-inflation.csv")
Questions
Do not forget to render, commit, and push regularly, after each substantial change to your document (e.g., after answering each question). Use succinct and informative commit messages. Make sure to commit and push all changed files so that your Git pane is empty afterward.
Question 1
Get to know the data.
-
glimpse()
at thecountry_inflation
data frame and answer the following questions based on the output.- How many rows does
country_inflation
have and what does each row represent? - How many columns does
country_inflation
have and what does each column represent?
- How many rows does
- Display the names of the countries included in the dataset. How many distinct countries are there?
Find distinct countries with distinct()
and print all countries with print_n(n = ___)
added to your pipeline, where the blank is the number of distinct countries.
Question 2
Which countries had the top three highest inflation rates in 2023? Your output should be a data frame with two columns, country
and 2023
, with inflation rates in descending order, and three rows for the top three countries. Briefly comment on how the inflation rates for these countries compare to the inflation rate for United States in that year.
Column names that are numbers are not considered “proper” in R, therefore to select them you’ll need to surround them with backticks (`).
Question 3
In a single pipeline,
- calculate the ratio of the inflation in 2023 and inflation in 1993 for each country and store this information in a new column called
inf_ratio
, - select the variables
country
andinf_ratio
, and - store the result in a new data frame called
country_inflation_ratios
.
Then, in two separate pipelines,
- arrange
country_inflation_ratios
in increasing order ofinf_ratio
and - arrange
country_inflation_ratios
in decreasing order ofinf_ratio
.
Which country’s inflation increase is the largest over this time period and by how much? Which country’s inflation decrease is the largest over this time period and by how much?
For this question you’ll once again need to use variables whose names are numbers (years) in your pipeline. Make sure to surround the names of such variables with backticks (`
).
Question 4
Reshape (pivot) country_inflation
such that each row represents a country/year combination. Then, display the resulting data frame and state how many rows and columns it has and write a sentence stating the number of rows and columns as well as the names of the columns of the resulting data frame.
Requirements:
- Your code must use one of
pivot_longer()
orpivot_wider()
. There are other ways you can do this reshaping move in R, but this question requires solving this problem by pivoting. - In your
pivot_*()
function, you must usenames_transform = as.numeric
as an argument to transform the variable type to numeric as you pivot the data so that in the resulting data frame the year variable is numeric. - The resulting data frame must be saved as something other than
country_inflation
so you (1) can refer to this data frame later in your analysis and (2) do not overwritecountry_inflation
. Use a short but informative name.
The last question in Part 1 as well as some questions in Part 2 require the use of the pivoted data frame from Question 4.
Question 5
Use a separate, single pipeline to answer each of the following questions.
Requirement: Your code must use the filter()
function for each part, not arrange()
.
What is the highest inflation rate observed between 1993 and 2023? The output of the pipeline should be a data frame with one row and three columns. In addition to code and output, your response should include a single sentence stating the country and year.
What is the lowest inflation rate observed between 1993 and 2023? The output of the pipeline should be a data frame with one row and three columns. In addition to code and output, your response should include a single sentence stating the country and year.
Putting (a) and (b) together: What are the highest and the lowest inflation rates observed between 1993 and 2023? The output of the pipeline should be a data frame with two rows and three columns.
At the end of your Part 1 Quarto file you’ll see a code cell like the following:
```{r}
#| eval: false
write_csv(_____, file = "data/country-inflation-pivoted.csv")
```
This will allow you to save the pivoted data frame from Question 4 as a CSV (comma-separated-values) file called country-inflation-pivoted.csv
in your data
folder so that you can read it and use it Part 2.
You need to modify the code cell by doing the following two things:
- fill in the blank with the name of the pivoted data frame from Question 4 and
-
change the value of the code cell option*
eval
fromfalse
totrue
so that the code cell will run and save the CSV file when you render your Quarto document.
Now is another good time to render, commit, and push your changes to GitHub with an informative and concise commit message. And once again, make sure to commit and push all changed files including the newly created dataset so that your Git pane is empty afterward. We keep repeating this because it’s important and because we see students forget to do this. So take a moment to make sure you’re practicing good version control habits.
Part 2 – Feedback from Humans
Your answers to the questions in this part should go in the file hw-3-part-2.qmd
.
Packages
You will use the tidyverse package for data wrangling and visualization,
You may choose to load other packages as needed, particularly for improving your plots. If you do so, add the functions to load these packages in the code cell labeled load-packages
.
Question 6
More on inflation around the world.
For this question you will use the pivoted version of the country_inflation
dataset that you created in Question 4 of Part 1. Load this dataset from the CSV file you created at the end of Part 1. To do so, you can use the following code, which is already in your hw-3-part-2.qmd
file. However, you’ll need to set eval
to true
in the code cell options so that this code runs when you render your Quarto document. And, before you can do that, you must have completed Question 4 of Part 1 and saved the pivoted data frame as a CSV file in your data
folder as described in the previous section.
country_inflation_pivoted <- read_csv("data/country-inflation-pivoted.csv")
a. Create a vector called countries_of_interest
which contains the names of up to five countries you want to visualize the inflation rates for over the years. For example, if these countries are Türkiye and United States, you can express this as follows:
countries_of_interest <- c("Türkiye", "United States")
If they are Türkiye, United States, and Chile, you can express this as follows:
countries_of_interest <- c("Türkiye", "United States", "Chile")
So on and so forth… Then, in 1-2 sentences, state why you chose these countries.
Your countries_of_interest
should consist of no more than five countries. Make sure that the spelling of your countries matches how they appear in the dataset.
b. In a single pipeline, filter your pivoted dataset to include only the countries_of_interest
from part (a), and save the resulting data frame with a new name so you (1) can refer to this data frame later in your analysis and (2) do not overwrite the data frame you’re starting with. Use a short but informative name. Then, in a new pipeline, find the distinct()
countries in the data frame you created.
The number of distinct countries in the filtered data frame you created in part (b) should equal the number of countries you chose in part (a). If it doesn’t, you might have misspelled a country name or made a mistake in filtering for these countries. Go back and correct your work.
c. Using your data frame from the previous question, create a plot of annual inflation vs. year for these countries. Then, in a few sentences, describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.
Requirements for the plot:
- Data should be represented with points as well as lines connecting the points for each country.
- Each country should be represented by a different color line and different color and shape points.
- Axes and legend should be properly labeled.
- The plot should have an appropriate title (and optionally a subtitle).
- Plot should be customized in at least one way – you could use a different than default color scale, or different than default theme, or some other customization.
Do not forget to render, commit, and push regularly, after each substantial change to your document (e.g., after answering each question). Use succinct and informative commit messages. Make sure to commit and push all changed files so that your Git pane is empty afterward.
Questions 7-9
Inflation in the US.
The OECD defines inflation as follows:
Inflation is a rise in the general level of prices of goods and services that households acquire for the purpose of consumption in an economy over a period of time.
The main measure of inflation is the annual inflation rate which is the movement of the Consumer Price Index (CPI) from one month/period to the same month/period of the previous year expressed as percentage over time.
Source: OECD CPI FAQ
CPI is broken down into 12 expenditures such as food, housing, health, etc. Your goal in this part is to create another time series plot of annual inflation, this time for US only.
The data you will need to create this visualization is spread across two files:
-
us-inflation.csv
: Annual inflation rate for the US for 12 CPI expenditures. Each expenditure is identified by an ID number. -
cpi-expenditures.csv
: A “lookup table” of CPI expenditure ID numbers and their descriptions.
Let’s load both of these files.
Question 7
a. How many columns and how many rows does the us_inflation
dataset have? What are the variables in it? Which years do these data span? Write a brief (1-2 sentences) narrative summarizing this information.
b. How many columns and how many rows does the cpi_expenditures
dataset have? What are the variables in it? Write a brief (1-2 sentences) narrative summarizing this information.
c. Create a new dataset by joining the us_inflation
dataset with the cpi_expenditure_id
dataset.
Determine which type of join is the most appropriate one and use that.
Note that the two datasets don’t have a variable with a common name, though they do have variables that contain common information but are named differently. You will need to first figure out which variables those are, and then define the
by
argument and use thejoin_by()
function to indicate these variables to join the datasets by.Use a short but informative name for the joined dataset, and do not overwrite either of the datasets that go into creating it.
Then, find the number of rows and columns of the resulting dataset and report the names of its columns. Add a brief (1-2 sentences) narrative summarizing this information.
Question 8
a. Create a vector called expenditures_of_interest
which contains the descriptions or IDs of CPI expenditures you want to visualize. Your expenditures_of_interest
should consist of no more than five expenditures. If you’re using descriptions, make sure that the spelling of your expenditures matches how they appear in the dataset. Then, in 1-2 sentences, state why you chose these expenditures.
Refer back to the guidance provided in Question 6 if you’re not sure how to create this vector.
b. In a single pipeline, filter your joined dataset to include only the expenditures_of_interest
from part (a), and save the resulting data frame with a new name so you (1) can refer to this data frame later in your analysis and (2) do not overwrite the data frame you’re starting with. Use a short but informative name. Then, in a new pipeline, find the distinct()
expenditures in the data frame you created.
Question 9
Using your data frame from the previous question, create a plot of annual inflation vs. year for these expenditures. Then, in a few sentences, describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of inflation rates in the US over the last decade.
Requirements for the plot:
- Data should be represented with points as well as lines connecting the points for each expenditure.
- Each expenditure should be represented by a different color line and different color and shape points.
- Axes and legend should be properly labeled.
- The plot should have an appropriate title (and optionally a subtitle).
- Plot should be customized in at least one way – you could use a different than default color scale, or different than default theme, or some other customization.
- If your legend has labels that are too long, you can try moving the legend to the bottom and stack the labels vertically. Hint: The
legend.position
andlegend.direction
arguments of thetheme()
functions will be useful.
Question 10
Handling missing values after joins.
In class last week, we worked with the sales_taxes
dataset and joined it to another dataset called us_regions
so that we ended up with a data frame containing information on each state (plus DC), salex tax rate in that state, and the region the state is in. While DC was in our first dataset, it was not in the second, resulting in an NA
value for the region
variable in the joined dataset. We discussed that there are three approaches we could take to address this issue before calculating the average average sales tax of states in each region.
a. Read in the two datasets, sales-taxes-25.csv
and us-regions.csv
, which are both in your data
folder. Then, join the two datasets using the appropriate type of join to be able to calculate the average sales tax rate by region, not overwriting either data frame that goes into the join.
b. Apply the three approaches we discussed in class to address the issue that DC has an NA
value for region
in the joined dataset, and then calculate the average average sales tax of states in each region.
c. Choose one of the three approaches as the “most appropriate” approach and provide a brief (1-2 sentences) rationale for your choice.
You can refer back to AE 06: Sales taxes + data joining if you need a reminder of these data and how to join them.
Wrap-up
Before you wrap up the assignment, make sure that you render, commit, and push one final time so that the final versions of both your .qmd
file and the rendered PDF are pushed to GitHub and your Git pane is empty. We will be checking these to make sure you have been practicing how to commit and push changes.
Submission
Submit your PDF document to Gradescope by the deadline to be considered “on time”:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with question. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Make sure you have:
- attempted all questions
- rendered your Quarto document
- committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
- uploaded your PDF to Gradescope
Grading and feedback
Questions 1-5 are not graded, but you should complete them to get practice.
-
Questions 6-10 are graded, and you will receive feedback on Gradescope from the course instructional team within a week.
- Questions will be graded for accuracy and completeness.
- Partial credit will be given where appropriate.
- There are also workflow points for:
- committing at least three times as you work through your lab,
- having your final version of
.qmd
and.pdf
files in your GitHub repository, and - overall organization.