AE 05: Tidying Stat Sci

Goal

Our ultimate goal in this application exercise is to make the following data visualization.

Data

The data come from the Office of the University Registrar. They make the data available as a table that you can download as a PDF, but I’ve put the data exported in a CSV file for you. Let’s load that in.

library(tidyverse)

statsci <- read_csv("data/statsci_clean.csv")

And let’s take a look at the data.

statsci
# A tibble: 4 × 16
  degree_type `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 AB2              0      1      0      0      4      4      1      0      0
2 AB               2      2      4      1      3      6      3      4      4
3 BS2              2      6      1      0      5      6      6      8      8
4 BS               5      9      4     13     10     17     24     21     26
# ℹ 6 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>, `2025` <dbl>

Pivoting

  • Demo: Pivot the statsci data frame longer such that:

    • Each row represents a degree type / year combination
    • year and number of graduates for that year are columns in the data frame
    • The resulting year column is numeric
statsci |>
  pivot_longer(
    cols = -degree_type,
    names_to = "year",
    values_to = "n",
    names_transform = as.numeric
  )
# A tibble: 60 × 3
   degree_type  year     n
   <chr>       <dbl> <dbl>
 1 AB2          2011     0
 2 AB2          2012     1
 3 AB2          2013     0
 4 AB2          2014     0
 5 AB2          2015     4
 6 AB2          2016     4
 7 AB2          2017     1
 8 AB2          2018     0
 9 AB2          2019     0
10 AB2          2020     1
# ℹ 50 more rows
  • Your Turn: Now, repeat your code from above, but this time save the result to a new variable name.
statsci_longer <- statsci |>
  pivot_longer(
    cols = -degree_type,
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  )

Plotting

  • Your turn: Now we will start making our plot, but let’s not get too fancy right away. Create the following plot, which will serve as the “first draft” on the way to our Goal. Do this by adding on to your pipeline from earlier.

Line plot of numbers of Statistical Science majors over the years (2011 - 2021). Degree types represented are BS, BS2, AB, AB2. There is an increasing trend in BS degrees and somewhat steady trend in AB degrees.

statsci_longer |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line()

  • Question: Why was the pivot necessary in order to create this plot?

Add your response here!

  • Question: What aspects of the plot need to be updated to go from the draft you created above to the Goal plot at the beginning of this application exercise.

Add your response here.

  • Demo: Update x-axis scale such that the years displayed go from 2011 to 2025 in increments of 2 years. Do this by adding on to your pipeline from earlier.
statsci_longer |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2025, 2))

  • Demo: Update line colors using the following level / color assignments. Once again, do this by adding on to your pipeline from earlier.
    • “BS” = “cadetblue4”

    • “BS2” = “cadetblue3”

    • “AB” = “lightgoldenrod4”

    • “AB2” = “lightgoldenrod3”

statsci_longer |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2025, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4", 
               "BS2" = "cadetblue3", 
               "AB" = "lightgoldenrod4", 
               "AB2" = "lightgoldenrod3"))

  • Your turn: Update the plot labels (title, subtitle, x, y, and caption) and use theme_minimal(). Once again, do this by adding on to your pipeline from earlier.
statsci_longer |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2025, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4", 
               "BS2" = "cadetblue3", 
               "AB" = "lightgoldenrod4", 
               "AB2" = "lightgoldenrod3"))+
  labs(
    x = "Graduation year",
    y = "Number of majors graduating",
    color = "Degree type",
    title = "Statistical Science majors over the years",
    subtitle = "Academic years 2011 - 2025",
    caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
  ) +
  theme_minimal()

  • Demo: Move the legend into the plot, make its background white, and its border gray.
statsci_longer |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2025, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4", 
               "BS2" = "cadetblue3", 
               "AB" = "lightgoldenrod4", 
               "AB2" = "lightgoldenrod3"))+
  labs(
    x = "Graduation year",
    y = "Number of majors graduating",
    color = "Degree type",
    title = "Statistical Science majors over the years",
    subtitle = "Academic years 2011 - 2025",
    caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
  ) +
  theme_minimal() +
  theme(
    legend.position = "inside",
    legend.position.inside = c(0.1, 0.7),
    legend.background = element_rect(fill = "white", color = "grey")
  )

  • Demo: Finally, set fig-width: 7 and fig-height: 5 for your plot in the chunk options.
statsci_longer |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2025, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4", 
               "BS2" = "cadetblue3", 
               "AB" = "lightgoldenrod4", 
               "AB2" = "lightgoldenrod3"))+
  labs(
    x = "Graduation year",
    y = "Number of majors graduating",
    color = "Degree type",
    title = "Statistical Science majors over the years",
    subtitle = "Academic years 2011 - 2025",
    caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
  ) +
  theme_minimal() +
  theme(
    legend.position = "inside",
    legend.position.inside = c(0.1, 0.7),
    legend.background = element_rect(fill = "white", color = "grey")
  )

Let’s now pivot wider!

  • Demo: Just like you can pivot longer, you can pivot wider. Let’s convert our longer data frame back into the wider one in a single pipeline.
statsci_wider <- statsci_longer |>
  pivot_wider(
    names_from = "year", 
    values_from = "n"
  )