Web scraping
many pages

Lecture 14

Author

Affiliation

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

Published

October 9, 2025

Warm-up

While you wait: Participate 📱💻

The following code in chronicle-scrape.R extracts titles of an opinion article from The Chronicle website:

page <- read_html(
  "https://www2.stat.duke.edu/~cr173/data/dukechronicle-opinion/www.dukechronicle.com/section/opinionabc4.html"
)

titles <- page |>
  html_elements(".space-y-4 .font-extrabold") |>
  html_text()

Which of the following needs to change to extract column titles instead?

Change the URL in read_html()
Change the function html_elements() to html_element()
Change the CSS selector .space-y-4 .font-extrabold to .space-y-4 .text-brand
Change the function html_text() to html_attr()

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

HW 2, Question 7: Reproduce the colorful box plot – We caught an error in grading (any theme with a white background would have worked). If you originally missed points due to not using theme_bw(), but you used another theme with a white background, we’ve updated your grade.
Midsemester course survey due tonight at 11:59pm
Project proposals (Milestone 2) + first peer evaluation due next Thursday at 11:59pm – any questions?

From last time

Opinion articles in The Chronicle

Go to https://www2.stat.duke.edu/~cr173/data/dukechronicle-opinion/www.dukechronicle.com/section/opinionabc4.html (copy of The Chronicle opinion section as of October 7, 2025).

Goal

Scrape data and organize it in a tidy format in R
Perform light text parsing to clean data
Summarize and visualze the data

ae-09-chronicle-scrape

Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-09-chronicle-scrape.qmd and chronicle-scrape.R.

Participate 📱💻

Put the folllowing tasks in order to scrape data from a website:

Use the SelectorGadget identify tags for elements you want to grab
Use read_html() to read the page’s source code into R
Use other functions from the rvest package to parse the elements you’re interested in
Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

A new R workflow

When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:
- Use an R script to save your code
- Saving interim data scraped using the code in the script as CSV or RDS files
- Use the saved data in your analysis in your Quarto document