Exam 1 review

Lecture 11

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

September 30, 2025

Warm-up

While you wait: Participate 📱💻

Question

  • Options…
  • Options…

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

  • Cheat sheet: 8.5x11, both sides, hand written or typed, any content you want, must be prepared by you

  • Bring a pencil and eraser (you’re allowed to use a pen, but you might not want to)

  • Reminder: Academic dishonesty / Duke Community Standard

From last time

Finish up: ae-08-durham-climate-factors

  • Go to your ae project in RStudio.

  • Open ae-08-durham-climate-factors.qmd and pick up at “Pivot”.

Joins

Setup

students <- tribble(
  ~student_id, ~name,  ~major,
  "S001",      "Abby", "History",
  "S002",      "Jinu", "Mathematics",
  "S003",      "Mira", "Political Science",
  "S004",      "Rumi", "Statistical Science",
  "S005",      "Zoey", "Computer Science"
)
enrollments <- tribble(
  ~sid,   ~course,
  "S003", "POLSCI 175",
  "S003", "STA 199",
  "S003", "RELIGION 228",
  "S004", "CS 201",
  "S004", "STA 240",
  "S004", "STA 221",
  "S004", "THEATRST 202",
  "S005", "CS 201",
  "S005", "STA 199",
  "S005", "RELIGION 228",
  "S005", "THEATRST 202"
)

What type of join?

Which type of join would you use to find the courses that all students are enrolled in?

students |>
  left_join(enrollments, by = join_by(student_id == sid))
# A tibble: 13 × 4
   student_id name  major               course      
   <chr>      <chr> <chr>               <chr>       
 1 S001       Abby  History             <NA>        
 2 S002       Jinu  Mathematics         <NA>        
 3 S003       Mira  Political Science   POLSCI 175  
 4 S003       Mira  Political Science   STA 199     
 5 S003       Mira  Political Science   RELIGION 228
 6 S004       Rumi  Statistical Science CS 201      
 7 S004       Rumi  Statistical Science STA 240     
 8 S004       Rumi  Statistical Science STA 221     
 9 S004       Rumi  Statistical Science THEATRST 202
10 S005       Zoey  Computer Science    CS 201      
11 S005       Zoey  Computer Science    STA 199     
12 S005       Zoey  Computer Science    RELIGION 228
13 S005       Zoey  Computer Science    THEATRST 202

What type of join?

Which type of join would you use to find the students for whom we have enrollment information?

students |>
  inner_join(enrollments, by = join_by(student_id == sid))
# A tibble: 11 × 4
   student_id name  major               course      
   <chr>      <chr> <chr>               <chr>       
 1 S003       Mira  Political Science   POLSCI 175  
 2 S003       Mira  Political Science   STA 199     
 3 S003       Mira  Political Science   RELIGION 228
 4 S004       Rumi  Statistical Science CS 201      
 5 S004       Rumi  Statistical Science STA 240     
 6 S004       Rumi  Statistical Science STA 221     
 7 S004       Rumi  Statistical Science THEATRST 202
 8 S005       Zoey  Computer Science    CS 201      
 9 S005       Zoey  Computer Science    STA 199     
10 S005       Zoey  Computer Science    RELIGION 228
11 S005       Zoey  Computer Science    THEATRST 202

What type of join?

Which type of join would you use to find the students for whom we have no enrollment information?

students |>
  anti_join(enrollments, by = join_by(student_id == sid))
# A tibble: 2 × 3
  student_id name  major      
  <chr>      <chr> <chr>      
1 S001       Abby  History    
2 S002       Jinu  Mathematics

if_else() / case_when()

Collecting data

Suppose you conduct a survey and ask students their student ID number and number of credits they’re taking this semester. What is the type of each variable?

survey_raw <- tibble(
  student_id = c(273674, 298765, 287129, "I don't remember"),
  n_credits = c(4, 4.5, "I'm not sure yet", "2 - underloading")
)
survey_raw
# A tibble: 4 × 2
  student_id       n_credits       
  <chr>            <chr>           
1 273674           4               
2 298765           4.5             
3 287129           I'm not sure yet
4 I don't remember 2 - underloading

Cleaning data

survey <- survey_raw |>
  mutate(
    student_id = if_else(student_id == "I don't remember", NA, student_id),
    n_credits = case_when(
      n_credits == "I'm not sure yet" ~ NA,
      n_credits == "2 - underloading" ~ "2",
      .default = n_credits
    ),
    n_credits = as.numeric(n_credits)
  )
survey
# A tibble: 4 × 2
  student_id n_credits
  <chr>          <dbl>
1 273674           4  
2 298765           4.5
3 287129          NA  
4 <NA>             2  

Type coercion

  • If variables in a data frame have multiple types of values, R will coerce them into a single type, which may or may not be what you want.

  • If what R does by default is not what you want, you can use explicit coercion functions like as.numeric(), as.character(), etc. to turn them into the types you want them to be, which will generally also involve cleaning up the features of the data that caused the unwanted implicit coercion in the first place.

Aesthetic mappings

openintro::loan50

library(openintro)
library(ggthemes)
loan50 |>
  select(annual_income, interest_rate, homeownership)
# A tibble: 50 × 3
   annual_income interest_rate homeownership
           <dbl>         <dbl> <fct>        
 1         59000         10.9  rent         
 2         60000          9.92 rent         
 3         75000         26.3  mortgage     
 4         75000          9.92 rent         
 5        254000          9.43 mortgage     
 6         67000          9.92 mortgage     
 7         28800         17.1  rent         
 8         80000          6.08 mortgage     
 9         34000          7.97 rent         
10         80000         12.6  mortgage     
# ℹ 40 more rows

Aesthetic mappings

What will the following code result in?

ggplot(
  loan50,
  aes(
    x = annual_income,
    y = interest_rate,
    color = homeownership,
    shape = homeownership
  )
) +
  geom_point() +
  scale_color_colorblind()

Aesthetic mappings

Global mappings

What will the following code result in?

ggplot(
  loan50,
  aes(
    x = annual_income,
    y = interest_rate,
    color = homeownership,
    shape = homeownership
  )
) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_colorblind()

Global mappings

`geom_smooth()` using formula = 'y ~ x'

Local mappings

What will the following code result in?

ggplot(
  loan50,
  aes(x = annual_income, y = interest_rate)
) +
  geom_point(aes(color = homeownership)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_colorblind()

Local mappings

`geom_smooth()` using formula = 'y ~ x'

Mapping vs. setting

What will the following code result in?

ggplot(
  loan50,
  aes(x = annual_income, y = interest_rate)
) +
  geom_point(aes(color = homeownership)) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  scale_color_colorblind()

Mapping vs. setting

`geom_smooth()` using formula = 'y ~ x'

Recap: Aesthetic mappings

  • Aesthetic mapping defined at the global level will be used by all geoms for which the aesthetic is defined.

  • Aesthetic mapping defined at the local level will be used only by the geoms they’re defined for.

Aside: Legends

ggplot(
  loan50,
  aes(
    x = annual_income,
    y = interest_rate,
    color = homeownership,
    shape = homeownership
  )
) +
  geom_point() +
  scale_color_colorblind()

Aside: Legends

ggplot(
  loan50,
  aes(
    x = annual_income,
    y = interest_rate,
    color = homeownership,
    shape = homeownership
  )
) +
  geom_point() +
  scale_color_colorblind() +
  labs(color = "Home ownership")

Aside: Legends

ggplot(
  loan50,
  aes(
    x = annual_income,
    y = interest_rate,
    color = homeownership,
    shape = homeownership
  )
) +
  geom_point() +
  scale_color_colorblind() +
  labs(
    color = "Home ownership",
    shape = "Home ownership"
  )

Factors

Factors

  • Factors are used for categorical variables – variables that have a fixed and known set of possible values.

  • They are also useful when you want to display character vectors in a non-alphabetical order.

Bar plot

ggplot(loan50, aes(x = homeownership)) +
  geom_bar()

Bar plot - reordered

loan50 |>
  mutate(
    homeownership = fct_relevel(homeownership, "mortgage", "rent", "own")
  ) |>
  ggplot(aes(x = homeownership)) +
  geom_bar()

Frequency table

loan50 |>
  count(homeownership)
# A tibble: 3 × 2
  homeownership     n
  <fct>         <int>
1 rent             21
2 mortgage         26
3 own               3

Bar plot - reordered

loan50 |>
  mutate(
    homeownership = fct_relevel(homeownership, "own", "rent", "mortgage")
  ) |>
  count(homeownership)
# A tibble: 3 × 2
  homeownership     n
  <fct>         <int>
1 own               3
2 rent             21
3 mortgage         26

Under the hood

class(loan50$homeownership)
[1] "factor"
typeof(loan50$homeownership)
[1] "integer"
levels(loan50$homeownership)
[1] "rent"     "mortgage" "own"     

Recap: Factors

  • The forcats package has a bunch of functions (that start with fct_*()) for dealing with factors and their levels: https://forcats.tidyverse.org/reference/index.html

  • Factors and the order of their levels are relevant for displays (tables, plots) and they’ll be relevant for modeling (later in the course)

  • factor is a data class

Aside: ==

loan50 |>
  mutate(
    homeownership_new = if_else(
      homeownership == "rent",
      "don't own",
      homeownership
    )
  ) |>
  distinct(homeownership, homeownership_new)
# A tibble: 3 × 2
  homeownership homeownership_new
  <fct>         <chr>            
1 rent          don't own        
2 mortgage      mortgage         
3 own           own              

Aside: |

loan50 |>
  mutate(
    homeownership_new = if_else(
      homeownership == "rent" | homeownership == "mortgage",
      "don't own",
      homeownership
    )
  ) |>
  distinct(homeownership, homeownership_new)
# A tibble: 3 × 2
  homeownership homeownership_new
  <fct>         <chr>            
1 rent          don't own        
2 mortgage      don't own        
3 own           own              

Aside: %in%

loan50 |>
  mutate(
    homeownership_new = if_else(
      homeownership %in% c("rent", "mortgage"),
      "don't own",
      homeownership
    )
  ) |>
  distinct(homeownership, homeownership_new)
# A tibble: 3 × 2
  homeownership homeownership_new
  <fct>         <chr>            
1 rent          don't own        
2 mortgage      don't own        
3 own           own              

Other questions?