Quantifying uncertainty

Lecture 24

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

November 25, 2025

Warm-up

While you wait: Participate 📱💻

Given that HW 6 will be assigned on Monday, Dec 1, what day should HW 5 be due? It will be due at 11:59 pm on the deadline we decide on.

  • Sunday, Nov 30 - no change
  • Monday, Dec 1 - postponed by 1 day
  • Tuesday, Dec 2 - postponed by 2 days
  • Wednesday, Dec 3 - postponed by 3 days

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

  • HW:
    • HW 5 due [whichever date we decided on in the previous slide]
    • HW 6 due at 11:59 pm on Fri, Dec 6, but will be accepted until Sun, Dec 7 at 11:59 pm without penalty
  • Final exam:
    • In-class only, “cheat sheet” with same specs as usual allowed
    • Cumulative – will definitely have material since Exam 2, but there’s so little of it that it won’t be a huge portion of the exam
    • Final exam review during reading period – date/time/location TBA
    • Office hours during reading period + final exam week – schedule TBA
  • Grades:
    • Exam 2 in-class grades posted – can review questions in my office hours anytime before the final
    • Exam 2 take-home + project grades to be posted after Thanksgiving break

Quantifying uncertainty

Goal

Find range of plausible values for the slope using bootstrap confidence intervals.

Packages

# load packages
library(tidyverse) # for data wrangling and visualization
library(tidymodels) # for modeling
library(openintro) # for Duke Forest dataset
library(scales) # for pretty axis labels
library(glue) # for constructing character strings
library(knitr) # for neatly formatted tables

Data: Houses in Duke Forest

  • Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020
  • Scraped from Zillow
  • Source: openintro::duke_forest

Home in Duke Forest

Goal: Use the area (in square feet) to understand variability in the price of houses in Duke Forest.

Exploratory data analysis

Code
ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(alpha = 0.7) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Price and area of houses in Duke Forest"
  ) +
  scale_y_continuous(labels = label_dollar())

Modeling

df_fit <- linear_reg() |>
  fit(price ~ area, data = duke_forest)

tidy(df_fit) |>
  kable(digits = 2) # neatly format table to 2 digits
term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

Participate 📱💻

Which of the following is the correct interpretation of the intercept?

term estimate std.error statistic p.value
(Intercept) 116652 53302 2 0
area 159 18 9 0
  • For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher by $116,652, on average.
  • Duke Forest houses that are 0 square feet are predicted to sell, for $116,652, on average.
  • For each additional square foot, the model predicts the sale price of Duke Forest houses to be lower by $15,900, on average.
  • Duke Forest houses that are 0 square feet are predicted to sell, for $15,900, on average.

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Participate 📱💻

Which of the following is the correct interpretation of the slope?

term estimate std.error statistic p.value
(Intercept) 116652 53302 2 0
area 159 18 9 0
  • For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher by $159, on average.
  • Duke Forest houses that are 0 square feet are predicted to sell, for $116,652, on average.
  • For each additional square foot, the model predicts the sale price of Duke Forest houses to be lower by $15,900, on average.
  • Duke Forest houses that are 0 square feet are predicted to sell, for $159, on average.

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

From sample to population

For each additional square foot, we expect the sale price of Duke Forest houses to be higher by $159, on average.


  • This estimate is valid for the single sample of 98 houses.
  • But what if we’re not interested quantifying the relationship between the size and price of a house in this single sample?
  • What if we want to say something about the relationship between these variables for all houses in Duke Forest?

Statistical inference

  • Statistical inference provide methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from

  • For our inferences to be valid, the sample should be random and representative of the population we’re interested in

Soup in a bowl

Inference for simple linear regression

  • Calculate a confidence interval for the slope, \(\beta_1\) (today)

  • Conduct a hypothesis test for the slope,\(\beta_1\) (next week)

Confidence interval for the slope

Confidence interval

  • A plausible range of values for a population parameter is called a confidence interval
  • Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net
    • We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish
    • Similarly, if we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter

Confidence interval for the slope

A confidence interval will allow us to make a statement like “For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus X dollars.

  • Should X be $10? $100? $1000?

  • If we were to take another sample of 98 would we expect the slope calculated based on that sample to be exactly $159? Off by $10? $100? $1000?

  • The answer depends on how variable (from one sample to another sample) the sample statistic (the slope) is

  • We need a way to quantify the variability of the sample statistic

Quantify the variability of the slope

for estimation

  • Two approaches:
    1. Via simulation (what we’ll do in this course)
    2. Via mathematical models (what you can learn about in future courses)
  • Bootstrapping to quantify the variability of the slope for the purpose of estimation:
    • Bootstrap new samples from the original sample
    • Fit models to each of the samples and estimate the slope
    • Use features of the distribution of the bootstrapped slopes to construct a confidence interval

Bootstrap sample 1

Bootstrap sample 2

Bootstrap sample 3

Bootstrap sample 4

Bootstrap sample 5

so on and so forth…

Bootstrap samples 1 - 5

Bootstrap samples 1 - 100

Look familiar?

Look familiar?

Slopes of bootstrap samples

Fill in the blank: For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus ___ dollars.

Slopes of bootstrap samples

Fill in the blank: For each additional square foot, we expect the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus ___ dollars.

Confidence level

How confident are you that the true slope is between $0 and $250? How about $150 and $170? How about $90 and $210?

95% confidence interval

  • A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution
  • We are 95% confident that for each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $90.43 to $205.77.

Application exercise

ae-15-duke-forest-bootstrap

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-15-duke-forest-bootstrap.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits.