Linear models with a multiple predictors

Lecture 17

Author
Affiliation

Dr. Mine Γ‡etinkaya-Rundel

Duke University
STA 199 - Fall 2025

Published

October 28, 2025

Warm-up

While you wait: Participate πŸ“±πŸ’»

Which of the following is true about linear regression models?

  • The regression line always passes through the origin (x = 0, y = 0).
  • The slope indicates the average change in the outcome for a one-unit increase in the predictor.
  • The intercept indicates the average value of the predictor when the outcome is 0.
  • Least squares regression line minimizes the sum of the vertical distances between the observed and predicted values of the outcome.

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

  • HW 4 is due on Sunday, 11:59 pm

  • Make use of office hours!!!

  • Reminder: Class starts at 11:45 am, ends at 1:00 pm (not 12:50 pm)

  • If you missed yesterdays project milestone due to a documented illness, look out for an email from me for an alternate task. (This likely means you also missed the lab, that can be one of your dropped labs.)

  • Email me to let me know if you’d like a live review of your project in class next Tuesday! I’ll take the first 2 respondents.

Linear regression with a categorical predictor

Packages

── Attaching core tidyverse packages ────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.4          βœ” readr     2.1.5     
βœ” forcats   1.0.0          βœ” stringr   1.5.1     
βœ” ggplot2   3.5.2          βœ” tibble    3.3.0.9004
βœ” lubridate 1.9.4          βœ” tidyr     1.3.1     
βœ” purrr     1.1.0          
── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ──────────────────────────── tidymodels 1.3.0 ──
βœ” broom        1.0.9     βœ” rsample      1.3.1
βœ” dials        1.4.1     βœ” tune         1.3.0
βœ” infer        1.0.9     βœ” workflows    1.2.0
βœ” modeldata    1.5.1     βœ” workflowsets 1.1.1
βœ” parsnip      1.3.2     βœ” yardstick    1.3.2
βœ” recipes      1.3.1     
── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
βœ– scales::discard() masks purrr::discard()
βœ– dplyr::filter()   masks stats::filter()
βœ– recipes::fixed()  masks stringr::fixed()
βœ– dplyr::lag()      masks stats::lag()
βœ– yardstick::spec() masks readr::spec()
βœ– recipes::step()   masks stats::step()

Attaching package: 'palmerpenguins'

The following object is masked from 'package:modeldata':

    penguins

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

From last time (with penguins)

A different researcher wants to look at body weight of penguins based on the island they were recorded on. How are the variables involved in this analysis different?

. . .

  • outcome: body weight (numerical)

  • predictor: island (categorical)

Visualize body weight vs. island

Determine whether each of the following plot types would be an appropriate choice for visualizing the relationship between body weight and island of penguins.

  • Scatterplot ❌

  • Box plot βœ…

  • Violin plot βœ…

  • Density plot βœ…

  • Bar plot ❌

  • Stacked bar plot ❌

Boxplot

Visualize the relationship between body weight and island of penguins. Also calculate the average body weight per island.

ggplot(
  penguins,
  aes(x = island, y = body_mass_g)
) +
  geom_boxplot() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body mass by island"
  )

penguins |>
  group_by(island) |>
  summarize(
    bm_mean = mean(body_mass_g, na.rm = TRUE)
  )
# A tibble: 3 Γ— 2
  island    bm_mean
  <fct>       <dbl>
1 Biscoe      4716.
2 Dream       3713.
3 Torgersen   3706.

Violin plot

ggplot(
  penguins,
  aes(x = island, y = body_mass_g)
) +
  geom_violin() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body mass by island"
  )

Density plot

ggplot(
  penguins,
  aes(x = body_mass_g, color = island, fill = island)
) +
  geom_density(alpha = 0.5) +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body mass by island"
  )

Multiple geoms

ggplot(
  penguins,
  aes(x = island, y = body_mass_g, color = island)
) +
  geom_boxplot() +
  geom_beeswarm(size = 2, alpha = 0.5) +
  scale_color_colorblind() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body mass by island"
  ) +
  theme(legend.position = "none")

Model - fit

Fit a linear regression model predicting body weight from island and display the results. Why is Biscoe not on the output?

bm_island_fit <- linear_reg() |>
  fit(body_mass_g ~ island, data = penguins)

tidy(bm_island_fit)
# A tibble: 3 Γ— 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

Model - interpret

\[ \widehat{body~mass} = 4716 - 1003 \times islandDream - 1010 \times islandTorgersen \]

  • Intercept: Penguins from Biscoe island are expected to weigh, on average, 4,716 grams.

  • Slope - islandDream: Penguins from Dream island are expected to weigh, on average, 1,003 grams less than those from Biscoe island.

  • Slope - islandTorgersen: Penguins from Torgersen island are expected to weigh, on average, 1,010 grams less than those from Biscoe island.

Model - predict

What is the predicted body weight of a penguin on Biscoe island? What are the estimated body weights of penguins on Dream and Torgersen islands? Where have we seen these values before?

new_penguins <- tibble(
  island = c("Biscoe", "Dream", "Torgersen")
)

augment(bm_island_fit, new_data = new_penguins)
# A tibble: 3 Γ— 2
  .pred island   
  <dbl> <chr>    
1 4716. Biscoe   
2 3713. Dream    
3 3706. Torgersen

Model - predict

Calculate the predicted body weights of penguins on Biscoe, Dream, and Torgersen islands by hand.

\[ \widehat{body~mass} = 4716 - 1003 \times islandDream - 1010 \times islandTorgersen \]

. . .

  • Biscoe: \(\widehat{body~mass} = 4716 - 1003 \times 0 - 1010 \times 0 = 4716\)

. . .

  • Dream: \(\widehat{body~mass} = 4716 - 1003 \times 1 - 1010 \times 0 = 3713\)

. . .

  • Torgersen: \(\widehat{body~mass} = 4716 - 1003 \times 0 - 1010 \times 1 = 3706\)

Models with categorical predictors

  • When the categorical predictor has many levels, they’re encoded to dummy variables.

  • The first level of the categorical variable is the baseline level. In a model with one categorical predictor, the intercept is the predicted value of the outcome for the baseline level (x = 0).

  • Each slope coefficient describes the difference between the predicted value of the outcome for that level of the categorical variable compared to the baseline level.

Changing the baseline level

  • By default, R uses the first level of a categorical variable as the baseline level.

  • We can change the baseline level by reordering the levels of the categorical variable.

Participate πŸ“±πŸ’»

What function goes in the blank to make β€œDream” the baseline level in the island variable?

penguins |>
  mutate(
    island = ___(island, "Dream")
  )
  • level_reorder
  • fct_relevel
  • fct_reorder
  • str_relevel

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Participate πŸ“±πŸ’»

What is the baseline level of island in the following model??

linear_reg() |>
  fit(body_mass_g ~ island, data = penguins) |>
  tidy()
# A tibble: 3 Γ— 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   3706.        87.7   42.3    3.64e-137
2 islandBiscoe  1010.       100.    10.1    4.66e- 21
3 islandDream      6.53     104.     0.0627 9.50e-  1
  • Biscoe
  • Dream
  • Torgersen
  • Cannot tell from the output

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Application exercise

ae-12-penguins-model-multi

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-12-penguins-model-multi.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits.