Lecture 23
Duke University
STA 199 - Fall 2025
November 18, 2025
Cheat sheet: 8.5x11, both sides, hand written or typed, any content you want, must be prepared by you
Bring a pencil and eraser (you’re allowed to use a pen, but you might not want to)
Reminder: Academic dishonesty / Duke Community Standard
Finish up the application exercise by finding the area under the ROC curve.
Split data into training and testing sets (generally 75/25)
Fit models on training data and reduce to a few candidate models
Make predictions on testing data
Evaluate predictions on testing data using appropriate predictive performance metrics
Don’t forget to also consider explainability and domain knowledge when selecting a final model
In a future machine learning course: Cross-validation (partitioning training data into training and validation sets and repeating this many times to evaluate model predictive performance before using the testing data), feature engineering, hyperparameter tuning, more complex models (random forests, gradient boosting machines, neural networks, etc.)
is_canceled
hotels <- hotels |>
mutate(
is_canceled = if_else(is_canceled == 1, "canceled", "not canceled"),
is_canceled = fct_relevel(is_canceled, "not canceled", "canceled")
) |>
filter(adr <= 1000, adults < 5)
set.seed(1117)
hotels_split <- initial_split(hotels)
hotels_train <- training(hotels_split)
hotels_test <- testing(hotels_split)What type of model would you use to answer the question:
Are reservations earlier in the month or later in the month more likely to be cancelled?

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Numerical outcome:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon \]
\[ \widehat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k \]
Numerical outcome, log-transformed for a better linear fit:
\[ log(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon \]
\[ log(\widehat{y}) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k \]
Binary outcome, where \(p = P(Y=1)\), the probability of success:
\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon \]
\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k \]
is_canceled_fit_1 <- logistic_reg() |>
fit(is_canceled ~ arrival_date_day_of_month, data = hotels_train)
tidy(is_canceled_fit_1)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.510 0.0142 -35.8 1.80e-280
2 arrival_date_day_of_month -0.00152 0.000789 -1.93 5.41e- 2
# A tibble: 2 × 3
term estimate exp_estimate
<chr> <dbl> <dbl>
1 (Intercept) -0.510 0.601
2 arrival_date_day_of_month -0.00152 0.998
For each day the booking is later in the month, the odds of reservations being canceled is predicted to be lower by a factor of 0.998, on average.
# A tibble: 2 × 3
term estimate exp_estimate
<chr> <dbl> <dbl>
1 (Intercept) -0.510 0.601
2 arrival_date_day_of_month -0.00152 0.998
On the 0th day of the month, the odds of reservations being canceled is predicted to be 0.601, on average. The intercept is not meaningful in this context since there is no 0th day of the month.
Predict the probability of cancellation for a booking made on the 18th day of the month.
augment() vs. predict()
augment() returns the data frame passed to new_data _augment_ed by
# A tibble: 1 × 4
.pred_class `.pred_not canceled` .pred_canceled
<fct> <dbl> <dbl>
1 not canceled 0.631 0.369
# ℹ 1 more variable: arrival_date_day_of_month <dbl>
predict() the predicted class (based on a 0.5 cutoff by default)Fit another model to predict whether a reservation was cancelled from arrival_date_day_of_month and hotel type (Resort or City Hotel), allowing the relationship between arrival_date_day_of_month and is_canceled to not vary based on hotel type.
is_canceled_fit_2 <- logistic_reg() |>
fit(is_canceled ~ arrival_date_day_of_month + hotel, data = hotels_train)
tidy(is_canceled_fit_2)# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.315 0.0151 -20.9 3.46e-97
2 arrival_date_day_of_month -0.00155 0.000797 -1.94 5.18e- 2
3 hotelResort Hotel -0.614 0.0153 -40.0 0
# A tibble: 3 × 3
term estimate exp_estimate
<chr> <dbl> <dbl>
1 (Intercept) -0.315 0.729
2 arrival_date_day_of_month -0.00155 0.998
3 hotelResort Hotel -0.614 0.541
Holding hotel type constant, for each day the booking is later in the month, the odds of reservations being canceled is predicted to be lower by a factor of 0.998, on average.
# A tibble: 3 × 3
term estimate exp_estimate
<chr> <dbl> <dbl>
1 (Intercept) -0.315 0.729
2 arrival_date_day_of_month -0.00155 0.998
3 hotelResort Hotel -0.614 0.541
Holding arrival day of month constant, the odds of Resort Hotel reservations being canceled is predicted to be lower by a factor of 0.541 compared to City Hotel reservations, on average.
# A tibble: 3 × 3
term estimate exp_estimate
<chr> <dbl> <dbl>
1 (Intercept) -0.315 0.729
2 arrival_date_day_of_month -0.00155 0.998
3 hotelResort Hotel -0.614 0.541
On the 0th day of the month, the odds of City Hotel reservations being canceled is predicted to be 0.729, on average. The intercept is not meaningful in this context since there is no 0th day of the month.
Fit another model to predict whether a reservation was cancelled from arrival_date_day_of_month and hotel type (Resort or City Hotel), allowing the relationship between arrival_date_day_of_month and is_canceled to vary based on hotel type.
is_canceled_fit_3 <- logistic_reg() |>
fit(is_canceled ~ arrival_date_day_of_month * hotel, data = hotels_train)
tidy(is_canceled_fit_3)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.321 0.0172 -18.7 6.73e-78
2 arrival_date_day_of_month -0.00117 0.000954 -1.23 2.19e- 1
3 hotelResort Hotel -0.594 0.0313 -19.0 3.50e-80
4 arrival_date_day_of_month:hot… -0.00125 0.00174 -0.719 4.72e- 1
# A tibble: 4 × 2
term estimate
<chr> <dbl>
1 (Intercept) -0.321
2 arrival_date_day_of_month -0.00117
3 hotelResort Hotel -0.594
4 arrival_date_day_of_month:hotelResort Hotel -0.00125
\[ \begin{aligned} \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) &= -0.321 \\ &- 0.00117 \times \texttt{arrival_date_day_of_month} \\ &- 0.594 \times \texttt{hotelResort Hotel} \\ &- 0.00125 \times \texttt{arrival_date_day_of_month:hotelResort Hotel} \\ \end{aligned} \]
\[ \begin{aligned} \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) &= -0.321 \\ &- 0.00117 \times \texttt{arrival_date_day_of_month} \\ &- 0.594 \times \texttt{hotelResort Hotel} \\ &- 0.00125 \times \texttt{arrival_date_day_of_month:hotelResort Hotel} \\ \end{aligned} \]
City Hotel:
\[ \begin{aligned} \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) &= -0.321 \\ &- 0.00117 \times \texttt{arrival_date_day_of_month} \\ &- 0.594 \times 0 \\ &- 0.00125 \times \texttt{arrival_date_day_of_month} \times 0 \\ \\ &= -0.321 - 0.00117 \times \texttt{arrival_date_day_of_month} \\ \end{aligned} \]
\[ \begin{aligned} \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) &= -0.321 \\ &- 0.00117 \times \texttt{arrival_date_day_of_month} \\ &- 0.594 \times \texttt{hotelResort Hotel} \\ &- 0.00125 \times \texttt{arrival_date_day_of_month:hotelResort Hotel} \\ \end{aligned} \]
Resort Hotel:
\[ \begin{aligned} \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) &= -0.321 \\ &- 0.00117 \times \texttt{arrival_date_day_of_month} \\ &- 0.594 \times 1 \\ &- 0.00125 \times \texttt{arrival_date_day_of_month} \times 1 \\ \\ &= -(0.321+0.594) - (0.00117+0.00125) \times \texttt{arrival_date_day_of_month} \\ \\ &= -0.915 - 0.00242 \times \texttt{arrival_date_day_of_month} \\ \end{aligned} \]
City Hotel:
\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = -0.321 - 0.00117 \times \texttt{arrival_date_day_of_month} \]
exp(-0.00117) = 0.999: In City Hotels, for each day the booking is later in the month, the odds of reservations being canceled is predicted to be lower by a factor of 0.999, on average.
exp(−0.321) = 0.725: In City Hotels, on the 0th day of the month, the odds of reservations being canceled is predicted to be 0.725, on average. The intercept is not meaningful in this context since there is no 0th day of the month.
Resort Hotel:
\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = -0.915 - 0.00242 \times \texttt{arrival_date_day_of_month} \]
exp(-0.00117) = 0.998: In Resort Hotels, for each day the booking is later in the month, the odds of reservations being canceled is predicted to be lower by a factor of 0.998, on average.
exp(−0.321) = 0.401: In Resort Hotels, on the 0th day of the month, the odds of reservations being canceled is predicted to be 0.401, on average. The intercept is not meaningful in this context since there is no 0th day of the month.
Suppose we want to select a final model to predict whether a reservation was cancelled. Which metric would be most appropriate to evaluate the predictive performance of our logistic regression models?

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Single predictor:
Main effects:
Which model would you select as your final model based on AUC?
The dataset also contains information about the average daily rate (adr) for each reservation. The following model predicts adr from adults and hotel type.
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 50.5 0.605 83.5 0
2 adults 29.5 0.311 94.9 0
3 hotelResort Hotel -10.7 0.322 -33.1 2.48e-239
Which of the following is the best interpretation of the slope coefficient for adults?
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 50.5 0.605 83.5 0
2 adults 29.5 0.311 94.9 0
3 hotelResort Hotel -10.7 0.322 -33.1 2.48e-239
For each additional adult in the booking, the average daily rate is predicted to be higher by $29.50

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Which of the following is the correct interpretation of the slope coefficient for hotel?
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 50.5 0.605 83.5 0
2 adults 29.5 0.311 94.9 0
3 hotelResort Hotel -10.7 0.322 -33.1 2.48e-239

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Which of the following is the correct interpretation of the intercept?
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 50.5 0.605 83.5 0
2 adults 29.5 0.311 94.9 0
3 hotelResort Hotel -10.7 0.322 -33.1 2.48e-239

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Which of the following (Plot A or Plot B) is the correct visual representation of this model?



Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.