Evaluating models

Lecture 22

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

November 13, 2025

Warm-up

While you wait: Participate 📱💻

What is sensitivity also known as?

  • True positive rate
  • True negative rate
  • False positive rate
  • False negative rate
  • Recall

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

  • Projects due tonight, peer evals diue tpmorrow night

  • Practice Exam 2 is posted on the course website

  • Reply to post on Ed about requests for topics / concepts for exam review [thread]

From last class: Participate 📱💻

Which of the following best describes the area annotated on the ROC curve?

  • Where all positives classified as positive, all negatives classified as negative
  • Where true positive rate = false positive rate
  • Where all positives classified as negative, all negatives classified as positive

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

ROC curve

Which corner of the plot indicates the best model performance?

Next steps

Next steps

  • Fit models on training data

  • Make predictions on testing data

  • Evaluate predictions on testing data:

    • Linear models: R-squared, adjusted R-squared, RMSE (root mean squared error), etc.
    • Logistic models: False negative and positive rates, AUC (area under the curve), etc.
  • Make decisions based on model predictive performance, validity across various testing/training splits (aka “cross validation”), explainability

Note

We will only learn about a subset of these in this course, but you can go further into these ideas in STA 210 or STA 221 as well as in various machine learning courses.

ae-14-chicago-taxi-classification

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-14-chicago-taxi-classification.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits.

Recap

  • Split data into training and testing sets (generally 75/25)

  • Fit models on training data and reduce to a few candidate models

  • Make predictions on testing data

  • Evaluate predictions on testing data using appropriate predictive performance metrics

    • Linear models: Adjusted R-squared, AIC, etc.
    • Logistic models: False negative and positive rates, AUC (area under the curve), etc.
  • Don’t forget to also consider explainability and domain knowledge when selecting a final model

  • In a future machine learning course: Cross-validation (partitioning training data into training and validation sets and repeating this many times to evaluate model predictive performance before using the testing data), feature engineering, hyperparameter tuning, more complex models (random forests, gradient boosting machines, neural networks, etc.)

[Time permitting] The 🐘 in the room: Chat GPT (and friends)

Take a pulse

How are you using Chat GPT or similar AI tools for this class?

Example 1 - Similar to Lab 4

SurveyUSA polled 900 NC adults between September 4-7, 2024. Of the 900 NC adults, 771 were identified by SurveyUSA as being registered to vote. The following question was asked to these 771 adults: “Should the use of marijuana for medical use remain against the law in North Carolina? Or be legalized?”

Of the 771 responses, 391 were between the ages of 18-49. Of the individuals that are between 18-49, 59 individuals responded that they think medical marijuana should remain against the law, 292 said it should be made legal, and the remainder were not sure. Of the individuals that are 50+, 67 individuals responded that they think medical marijuana should remain against the law, 245 said it should be made legal, and the remainder were not sure.

Calculate the proportions of 18-49 year olds and 50+ year-olds in this sample. make sure the result is an ungrouped data frame with a column for relevant counts, a column for relevant proportions, and a column for the groups you’re interested in.

Chat GPT - Take 1 prompt

What is wrong with this prompt? How would you correct it?

SurveyUSA polled 900 NC adults between September 4-7, 2024. Of the 900 NC adults, 771 were identified by SurveyUSA as being registered to vote. The following question was asked to these 771 adults: “Should the use of marijuana for medical use remain against the law in North Carolina? Or be legalized?”

Of the 771 responses, 391 were between the ages of 18-49. Of the individuals that are between 18-49, 59 individuals responded that they think medical marijuana should remain against the law, 292 said it should be made legal, and the remainder were not sure. Of the individuals that are 50+, 67 individuals responded that they think medical marijuana should remain against the law, 245 said it should be made legal, and the remainder were not sure.

Calculate the proportions of 18-49 year olds and 50+ year-olds in this sample. Make sure the result is an ungrouped data frame with a column for relevant counts, a column for relevant proportions, and a column for the groups you’re interested in.

Chat GPT - Take 1 response

Identify at least one mistake or code smell/style issue with the code below.

library(tidyverse)

# Given data
total_registered <- 771

# Responses by age group
responses <- tribble(
  ~age_group , ~against , ~legalize , ~not_sure                ,
  "18-49"    ,       59 ,       292 , 391 - (59 + 292)         ,
  "50+"      ,       67 ,       245 , (771 - 391) - (67 + 245)
)

# Calculate total per group
responses <- responses %>%
  mutate(total = against + legalize + not_sure)

# Then calculate proportions of each age group in the total sample
age_props <- responses %>%
  summarise(
    total_18_49 = sum(total[age_group == "18-49"]),
    total_50_plus = sum(total[age_group == "50+"])
  ) %>%
  pivot_longer(everything(), names_to = "age_group", values_to = "count") %>%
  mutate(
    age_group = recode(
      age_group,
      "total_18_49" = "18-49",
      "total_50_plus" = "50+"
    ),
    proportion = count / total_registered
  )

# Show both tables
list(
  responses_by_age = responses,
  proportions_by_age = age_props
)
$responses_by_age
# A tibble: 2 × 5
  age_group against legalize not_sure total
  <chr>       <dbl>    <dbl>    <dbl> <dbl>
1 18-49          59      292       40   391
2 50+            67      245       68   380

$proportions_by_age
# A tibble: 2 × 3
  age_group count proportion
  <chr>     <dbl>      <dbl>
1 18-49       391      0.507
2 50+         380      0.493

Let’s follow along with Chat GPT…

https://chatgpt.com/share/e/690a259e-d0ac-8005-a3d8-0c1f80d22e7c

Example 2

Each CSI franchise includes a number of series, each with the character string "CSI" (in capital letters) in their title. Using this information, create a new data frame called csi that only has titles from the CSI-verse and all of the columns from tv for these titles. Then, display the first 10 rows of csi. Note: The titles and seasons included in the dataset may not be all CSI titles and titles that have ever aired.

Chat GPT - Take 1 prompt

Each CSI franchise includes a number of series, each with the character string "CSI" (in capital letters) in their title. Using this information, create a new data frame called csi that only has titles from the CSI-verse and all of the columns from tv for these titles. Then, display the first 10 rows of csi. Note: The titles and seasons included in the dataset may not be all CSI titles and titles that have ever aired.

Chat GPT - Take 1 response

Here’s a short, safe solution. It keeps all columns from tv, selects only rows whose title contains the exact uppercase string “CSI”, stores that in csi, then prints the first 10 rows.

# Base R solution (case-sensitive match for "CSI")
csi <- tv[grepl("CSI", tv$title, fixed = TRUE), ]

# Show the first 10 rows
head(csi, 10)

If you prefer dplyr:

library(dplyr)

csi <- tv %>%
 filter(grepl("CSI", title, fixed = TRUE))

csi %>% slice_head(n = 10)

Notes: The code above searches for the exact uppercase substring "CSI" (per your instructions). If you want a case-insensitive match (e.g., to also catch "csi"), use grepl("CSI", tv$title, ignore.case = TRUE) or fixed = FALSE with ignore.case = TRUE.

Chat GPT - Take 1 code

What is wrong with the following code?

tv <- read_csv("data/tv.csv")
library(dplyr)

csi <- tv %>%
  filter(grepl("CSI", title, fixed = TRUE))

csi %>% slice_head(n = 10)
# A tibble: 10 × 9
   season title     year month   day av_rating genre_1 genre_2 genre_3
    <dbl> <chr>    <dbl> <dbl> <dbl>     <dbl> <chr>   <chr>   <chr>  
 1      1 CSI: Cr…  2001     1    20      8.32 Crime   Drama   Mystery
 2      2 CSI: Cr…  2002     1    10      8.26 Crime   Drama   Mystery
 3      3 CSI: Cr…  2003     1    15      8.30 Crime   Drama   Mystery
 4      4 CSI: Cr…  2004     1    18      8.33 Crime   Drama   Mystery
 5      5 CSI: Cr…  2005     1    24      8.38 Crime   Drama   Mystery
 6      6 CSI: Cr…  2006     1    16      8.21 Crime   Drama   Mystery
 7      7 CSI: Cr…  2007     1    14      8.43 Crime   Drama   Mystery
 8      8 CSI: Cr…  2008     1     7      8.07 Crime   Drama   Mystery
 9      9 CSI: Cr…  2009     1    27      7.80 Crime   Drama   Mystery
10     10 CSI: Cr…  2010     1    23      7.69 Crime   Drama   Mystery

Let’s follow along with Chat GPT…

https://chatgpt.com/share/e/690a2790-6a9c-8005-b6db-ca8e37607244

Guidelines and best practices for using Chat GPT

  • Do not just copy-paste the prompt – for appropriate academic conduct, for your own learning, and for getting to better results faster
  • Engineer the prompt until the response starts to look like code you’re learning in the course
  • If the response is not correct, ask for a correction
  • If the response doesn’t follow the guidelines, ask for a correction
  • When you ask for a correction, be ready for it to “correct” (i.e., change) more than you asked for
  • Do not just copy-paste code from Chat GPT responses, run it line-by-line and edit as needed – Chat GPT does not execute R code so there’s no guarantee any of the code it gives you works
  • Watch out for clear mistakes in the response: do not keep loading packages that are already loaded, use the base pipe |>, use the tidyverse style, do not use unnecessary print() statements, be critical of code smell, etc.