
Lecture 18
Duke University
STA 199 - Fall 2025
October 30, 2025
Read the NY Times story

Peer eval 2 due tonight at 11:59 pm
Next semester: STA 313 - Advanced Data Visualization - vizdata.org
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-12-penguins-model-multi.qmd.
Work through the application exercise in class, and render, commit, and push your edits.
Fit two (or more) models to the same data set with different sets of predictors, selected based on subject-matter knowledge and/or exploratory data analysis results, compared their adjusted \(R^2\) values, and selected the model with the highest adjusted \(R^2\) value.
Perform stepwise selection (forward or backward) to select a model that maximizes adjusted \(R^2\):
Same model selection approaches (subject-matter knowledge, stepwise selection) can be used with various model comparison criteria, such as:
Note
Only mentioning criteria other than adjusted \(R^2\) here to build awareness that other criteria exist. We will not cover all of them in detail in this course (and this is not even a complete list), but you will encounter them in higher-level modeling courses and in practice.
More complex models (i.e., models with more predictors) tend to fit the data at hand better, but may not generalize well to new data.
Model selection criteria, like adjusted \(R^2\), help balance model fit and complexity to avoid overfitting by penalizing models with more predictors.
Overfitting occurs when a model captures not only the underlying relationship between predictors and outcome but also the random noise in the data.
Overfitted models tend to perform well on the observed data but poorly on new, unseen data.
Outliers are observations that fall far from the main cloud of points.
They can be outlying in:
However, being outlying in a univariate sense does not always mean being outlying from the bivariate model.
Points that are in-line with the bivariate model usually do not influence the least squares line, even if they are extreme in \(x\), \(y\), or both.
Which of the following best describes the circled point?


Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Which of the following best describes the circled point?


Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Outliers: Points or groups of points that stand out from the rest of the data.
Leverage points: Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage or leverage points.
Influential points: Outliers, generally high leverage points, that actually alter the slope or position of the regression line.
Test your analysis with and without outliers.
Compare and discuss the impact of outliers on model fit.
Present both models to stakeholders to choose the most reasonable interpretation.
Warning
Removing outliers should only be done with strong justification – excluding interesting or extreme cases can lead to misleading models, poor predictive performance, and flawed conclusions.
How China Raced Ahead of the U.S. on Nuclear Power