page <- read_html(
"https://www2.stat.duke.edu/~cr173/data/dukechronicle-opinion/www.dukechronicle.com/section/opinionabc4.html"
)
titles <- page |>
html_elements(".space-y-4 .font-extrabold") |>
html_text()
Web scraping
many pages
Lecture 14
Warm-up
While you wait: Participate 📱💻
The following code in chronicle-scrape.R
extracts titles of an opinion article from The Chronicle website:
Which of the following needs to change to extract column titles instead?
- Change the URL in
read_html()
- Change the function
html_elements()
tohtml_element()
- Change the CSS selector
.space-y-4 .font-extrabold
to.space-y-4 .text-brand
- Change the function
html_text()
tohtml_attr()
Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
Announcements
HW 2, Question 7: Reproduce the colorful box plot – We caught an error in grading (any theme with a white background would have worked). If you originally missed points due to not using
theme_bw()
, but you used another theme with a white background, we’ve updated your grade.Midsemester course survey due tonight at 11:59pm
Project proposals (Milestone 2) + first peer evaluation due next Thursday at 11:59pm – any questions?
From last time
Opinion articles in The Chronicle
Go to https://www2.stat.duke.edu/~cr173/data/dukechronicle-opinion/www.dukechronicle.com/section/opinionabc4.html (copy of The Chronicle opinion section as of October 7, 2025).
Goal
ae-09-chronicle-scrape
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-09-chronicle-scrape.qmd and
chronicle-scrape.R
.
Participate 📱💻
Put the folllowing tasks in order to scrape data from a website:
- Use the SelectorGadget identify tags for elements you want to grab
- Use
read_html()
to read the page’s source code into R - Use other functions from the rvest package to parse the elements you’re interested in
- Put the components together in a data frame (a tibble) and analyze it like you analyze any other data
Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
A new R workflow
When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
-
An alternative workflow:
- Use an R script to save your code
- Saving interim data scraped using the code in the script as CSV or RDS files
- Use the saved data in your analysis in your Quarto document
Web scraping considerations
Ethics: “Can you?” vs “Should you?”
Source: Brian Resnick, Researchers just released profile data on 70,000 OkCupid users without permission, Vox.
“Can you?” vs “Should you?”
Challenges: Unreliable formatting
Challenges: Data broken into many pages
Scraping from many pages
Packages
Columns and words
What additional information do we need to produce the table below?
Column | Avg. # words/article | # articles |
---|---|---|
Campus Voices | 942 | 382 |
Letters To The Editor | 307 | 19 |
Opinion | 1020 | 99 |
Start with the URL for an first article
chronicle <- read_csv("data/chronicle.csv")
article_url <- chronicle |>
slice_head(n = 1) |>
pull()
article_url
[1] "https://www2.stat.duke.edu/~cr173/data/dukechronicle-opinion/www.dukechronicle.com/article/the-united-states-is-a-frat-and-i-cant-unsee-it-20251007.html"
Read in the page
article_page <- read_html(article_url)
article_page
{html_document}
<html lang="en" style="scrollbar-gutter: stable;">
[1] <head>\n<meta http-equiv="content-type" content="text/html;char ...
[2] <body class="snw-antialiased">\n <!-- Google Tag Manager (no ...
Identify the elements you want
.article-content
Parse the elements you want - 1
article_page |>
html_elements(".article-content")
{xml_nodeset (1)}
[1] <article class="full-article arx-styles prose max-w-3xl lg:prose-lg text-gray-900 bs-shim article-content" id="article-content-d976e3a0-027c-447d-a162-bd10f5d921a3"><p spellcheck="false" aria-label="To enrich screen reader interactions, please activate Accessibility in Grammarly extension settings">Fraternities thrive on exclusivity. At Duke, that posture may be tolerable, even expected. But once you’ve seen how Greek life operates (gatekeeping at the door, bending rules for insiders, pun ...
Parse the elements you want - 2
article_page |>
html_elements(".article-content") |>
html_text2()
[1] "Fraternities thrive on exclusivity. At Duke, that posture may be tolerable, even expected. But once you’ve seen how Greek life operates (gatekeeping at the door, bending rules for insiders, punishing outsiders) it’s hard not to notice the same logic playing out on a national scale. The United States, too, runs like a frat: immigration represents the rush process, the Constitution and the insiders who decide when those rules apply and when they don’t.\n\nThe First Test: Do You Belong?\n\nImmigration is America’s bid night. Who gets past the door and who gets turned away, reveals who the frat thinks deserves to wear its letters. \n\nIn the late 1800s, the Chinese Exclusion Act was the ultimate “you don’t belong here.” It was the equivalent of a fraternity cutting pledges at the door, not because of merit, but because of appearance.\n\nHistorian Andrew Gyory argues in “Closing the Gate: Race, Politics, and the Chinese Exclusion Act,” that the 1882 Chinese Exclusion law wasn’t just about jobs or economics — it was about racial scapegoating weaponized in a crisis. In the recession of the 1870s, white workers blamed Chinese laborers for stealing jobs and depressing wages. Politicians ran with it, branding Chinese immigrants as “coolies” (a slur) who threatened white labor. This act suspended immigration, denied naturalization and subjected exempt merchants and students to humiliating interrogation. \n\nThe same script is running today. Just as Chinese laborers were scapegoated during the 1870s downturn, today’s immigrants are painted as threats to American workers. Trump says migrants “steal jobs” and “drive down wages,” while ignoring the actual structural causes of inequality. Wealth-based hurdles for H-1B visas, restrictions on asylum work permits: all built on the same assumption that keeping immigrants out will magically protect American jobs. The target changes, but the logic does not: protect the insiders, keep the outsiders desperate.\n\nIt’s not “No Asians allowed” anymore; it’s “You can only enter if you can pay the toll.” The words have shifted — “merit,” “abuse,” “national security” — but the message hasn’t. Some people belong. Others don’t. \n\nGatekeeping at the border bleeds into gatekeeping the Constitution itself. Once you’re in though, it may not be what you expect it to. \n\nInside the House\n\nOnce you’re in, the house makes the rules. Fraternities are notorious for setting the tone of how members act, who they associate with and even how they speak. The U.S. works the same way: once you’re past the border, culture and politics dictate what’s acceptable and what’s excused.\n\nTrump’s infamous Access Hollywood tape was brushed off as “locker room talk.” Anyone who’s been in a frat knows exactly what that means: a culture where misogyny isn’t condemned; it’s normalized as tradition. And Trump isn’t an outlier. Defense Secretary Pete Hegseth recently scolded generals for being “fat” in the halls of the Pentagon. He said, “It's completely unacceptable to see fat generals and admirals in the halls of the Pentagon and leading commands around the country and the world.”\n\nWhat starts as frat culture seeps into governance. The overlap isn’t just metaphorical anymore. It’s literal.\n\nThe Second Test: Will You Submit?\n\nEvery frat has rules, but everyone knows they bend when it suits those in charge. National bylaws formally ban hazing and restrict alcohol, yet incidents from Penn State and Bowling Green prove those rules are selectively enforced. They are college kids, not U.S. government officials. So, no. No leeway here. \n\nThe Constitution is supposed to be America’s house bylaws, but we’re watching them get bent the same way fraternity brothers bend their codes of conduct. \n\nThe Fifth Amendment says: “No person shall … be deprived of life, liberty, or property, without due process of law.” No person. Not no citizen. Yet under Trump’s use of the Alien Enemies Act, Venezuelan migrants have been deported straight into El Salvador’s prisons — without hearings, charges or convictions. That’s not due process. That’s rule by decree.\n\nThe First Amendment is also under attack. Student activists like Rümeysa Öztürk have been detained by ICE for nothing more than co-writing an article. Visas revoked, scholars deported — not because of crime, but because of speech. This amendment also doesn’t specify “citizens,” it specifies “the people.”\n\nThe Fourteenth Amendment’s Equal Protection Clause is hanging by a thread. Trump’s executive order to end birthright citizenship — blocked in court but revealing in intent — openly defies the guarantee that “All persons born or naturalized in the United States… are citizens.” If the president of the United States feels free to ignore the Constitution’s plain text, what comes next?\n\nThe Third Test: Will You Stay Silent? \n\nI don’t deny a country, or a club, has the right to control its borders or rules. Sovereignty matters. Rules are necessary. But we have slipped into what I call selective constitutionalism: rules enforced for some (Second Amendment) and not others (Fifth Amendment). \n\nViktor Orbán’s Hungary is the case study: elections still happen, but rules are bent until liberal democracy is hollowed out and only the shell remains. \n\nA common rebuttal is: “Trump just says things, he doesn’t mean them.” That’s a lie we tell ourselves to stay comfortable. When I say something outrageous, it’s just words. When a president says the same thing, it becomes marching orders. Presidential words are not idle. They are policy.\n\nEven when I disagree with a law — like the Second Amendment — I still believe in respecting due process. If Biden abolished gun rights tomorrow by executive order, I would oppose it. Because the process matters. Without it, the rule of law is gone.\n\nI use this metaphor because it shouldn’t be foreign to any of us: if you’ve witnessed the selectivism that exists in Greek life at Duke, or seen rules bent to protect a brother, you’ve seen the same logic, when scaled up, applied to our nation's Constitution. What’s tolerable as a flawed college club becomes catastrophic when it defines a nation.\n\nWhen rules are bent long enough, they don’t snap. They disappear. In a frat, that leaves chaos. In a country, it leaves tyranny.\n\nNoor Nazir is a Trinity junior. Her columns typically run on alternate Tuesdays. "
Parse the elements you want - 3
article_page |>
html_elements(".article-content") |>
html_text2() |>
str_remove_all("\n")
[1] "Fraternities thrive on exclusivity. At Duke, that posture may be tolerable, even expected. But once you’ve seen how Greek life operates (gatekeeping at the door, bending rules for insiders, punishing outsiders) it’s hard not to notice the same logic playing out on a national scale. The United States, too, runs like a frat: immigration represents the rush process, the Constitution and the insiders who decide when those rules apply and when they don’t.The First Test: Do You Belong?Immigration is America’s bid night. Who gets past the door and who gets turned away, reveals who the frat thinks deserves to wear its letters. In the late 1800s, the Chinese Exclusion Act was the ultimate “you don’t belong here.” It was the equivalent of a fraternity cutting pledges at the door, not because of merit, but because of appearance.Historian Andrew Gyory argues in “Closing the Gate: Race, Politics, and the Chinese Exclusion Act,” that the 1882 Chinese Exclusion law wasn’t just about jobs or economics — it was about racial scapegoating weaponized in a crisis. In the recession of the 1870s, white workers blamed Chinese laborers for stealing jobs and depressing wages. Politicians ran with it, branding Chinese immigrants as “coolies” (a slur) who threatened white labor. This act suspended immigration, denied naturalization and subjected exempt merchants and students to humiliating interrogation. The same script is running today. Just as Chinese laborers were scapegoated during the 1870s downturn, today’s immigrants are painted as threats to American workers. Trump says migrants “steal jobs” and “drive down wages,” while ignoring the actual structural causes of inequality. Wealth-based hurdles for H-1B visas, restrictions on asylum work permits: all built on the same assumption that keeping immigrants out will magically protect American jobs. The target changes, but the logic does not: protect the insiders, keep the outsiders desperate.It’s not “No Asians allowed” anymore; it’s “You can only enter if you can pay the toll.” The words have shifted — “merit,” “abuse,” “national security” — but the message hasn’t. Some people belong. Others don’t. Gatekeeping at the border bleeds into gatekeeping the Constitution itself. Once you’re in though, it may not be what you expect it to. Inside the HouseOnce you’re in, the house makes the rules. Fraternities are notorious for setting the tone of how members act, who they associate with and even how they speak. The U.S. works the same way: once you’re past the border, culture and politics dictate what’s acceptable and what’s excused.Trump’s infamous Access Hollywood tape was brushed off as “locker room talk.” Anyone who’s been in a frat knows exactly what that means: a culture where misogyny isn’t condemned; it’s normalized as tradition. And Trump isn’t an outlier. Defense Secretary Pete Hegseth recently scolded generals for being “fat” in the halls of the Pentagon. He said, “It's completely unacceptable to see fat generals and admirals in the halls of the Pentagon and leading commands around the country and the world.”What starts as frat culture seeps into governance. The overlap isn’t just metaphorical anymore. It’s literal.The Second Test: Will You Submit?Every frat has rules, but everyone knows they bend when it suits those in charge. National bylaws formally ban hazing and restrict alcohol, yet incidents from Penn State and Bowling Green prove those rules are selectively enforced. They are college kids, not U.S. government officials. So, no. No leeway here. The Constitution is supposed to be America’s house bylaws, but we’re watching them get bent the same way fraternity brothers bend their codes of conduct. The Fifth Amendment says: “No person shall … be deprived of life, liberty, or property, without due process of law.” No person. Not no citizen. Yet under Trump’s use of the Alien Enemies Act, Venezuelan migrants have been deported straight into El Salvador’s prisons — without hearings, charges or convictions. That’s not due process. That’s rule by decree.The First Amendment is also under attack. Student activists like Rümeysa Öztürk have been detained by ICE for nothing more than co-writing an article. Visas revoked, scholars deported — not because of crime, but because of speech. This amendment also doesn’t specify “citizens,” it specifies “the people.”The Fourteenth Amendment’s Equal Protection Clause is hanging by a thread. Trump’s executive order to end birthright citizenship — blocked in court but revealing in intent — openly defies the guarantee that “All persons born or naturalized in the United States… are citizens.” If the president of the United States feels free to ignore the Constitution’s plain text, what comes next?The Third Test: Will You Stay Silent? I don’t deny a country, or a club, has the right to control its borders or rules. Sovereignty matters. Rules are necessary. But we have slipped into what I call selective constitutionalism: rules enforced for some (Second Amendment) and not others (Fifth Amendment). Viktor Orbán’s Hungary is the case study: elections still happen, but rules are bent until liberal democracy is hollowed out and only the shell remains. A common rebuttal is: “Trump just says things, he doesn’t mean them.” That’s a lie we tell ourselves to stay comfortable. When I say something outrageous, it’s just words. When a president says the same thing, it becomes marching orders. Presidential words are not idle. They are policy.Even when I disagree with a law — like the Second Amendment — I still believe in respecting due process. If Biden abolished gun rights tomorrow by executive order, I would oppose it. Because the process matters. Without it, the rule of law is gone.I use this metaphor because it shouldn’t be foreign to any of us: if you’ve witnessed the selectivism that exists in Greek life at Duke, or seen rules bent to protect a brother, you’ve seen the same logic, when scaled up, applied to our nation's Constitution. What’s tolerable as a flawed college club becomes catastrophic when it defines a nation.When rules are bent long enough, they don’t snap. They disappear. In a frat, that leaves chaos. In a country, it leaves tyranny.Noor Nazir is a Trinity junior. Her columns typically run on alternate Tuesdays. "
Wrap in a function
Functions in R
function_name <- function(argument_1, argument_2, ...) {
# what the function does
}
. . .
For example:
multiply_by_3 <- function(x) {
3 * x
}
. . .
multiply_by_3(2)
[1] 6
multiply_by_3(10)
[1] 30
Revisit: parse_article_page()
parse_article_page <- function(url) { # define a function with one argument, url
article_page <- read_html(url) # read in the page at the url
article_page |> # start with the page
html_elements(".article-content") |> # extract element w/ selector .article-content
html_text2() |> # extract text from element (and clean it up)
str_remove_all("\n") # remove all newline characters, return result
}
Test the function
chronicle |>
slice_head(n = 1) |>
mutate(article = parse_article_page(url)) |>
select(title, article)
# A tibble: 1 × 2
title article
<chr> <chr>
1 The United States is a frat. And I can’t unsee it. "Fraternities thrive on exclusivity. At Duke, that posture may be tolerable, even expected. But once you’ve seen how Greek life operates (gatekeep…
Test the function
# A tibble: 3 × 2
# Rowwise:
title article
<chr> <chr>
1 The United States is a frat. And I can’t unsee it. "Fraternities thrive on exclusivity. At Duke, that posture may be tolerable, even expected. But once you’ve seen how Greek life operates (gatekeep…
2 The problem with censorship and discourse at Duke "In the wake of the tragic assassination of right-wing influencer Charlie Kirk, the Trump administration has attempted to crack down on what they …
3 The 'Duke Difference' we actually need "A week ago, hundreds of Duke students filled Page Auditorium to hear Former U.S. Secretary of Transportation Pete Buttigieg speak. He outlined a …
Test the function
# A tibble: 3 × 2
title article
<chr> <chr>
1 The United States is a frat. And I can’t unsee it. "Fraternities thrive on exclusivity. At Duke, that posture may be tolerable, even expected. But once you’ve seen how Greek life operates (gatekeep…
2 The problem with censorship and discourse at Duke "In the wake of the tragic assassination of right-wing influencer Charlie Kirk, the Trump administration has attempted to crack down on what they …
3 The 'Duke Difference' we actually need "A week ago, hundreds of Duke students filled Page Auditorium to hear Former U.S. Secretary of Transportation Pete Buttigieg speak. He outlined a …
All articles
This can take a bit to run!
# A tibble: 500 × 8
title author date_time month day column url article
<chr> <chr> <dttm> <chr> <dbl> <chr> <chr> <chr>
1 The Un… Noor … 2025-10-07 10:00:00 Oct 7 Opini… http… "Frate…
2 The pr… Harri… 2025-10-07 10:00:00 Oct 7 Campu… http… "In th…
3 The 'D… Gabri… 2025-10-06 14:30:00 Oct 6 Campu… http… "A wee…
4 Death … Luke … 2025-10-06 10:00:00 Oct 6 Campu… http… "Some …
5 Hazing… Monda… 2025-10-06 04:00:00 Oct 6 Campu… http… "Edito…
6 Duke’s… Lucas… 2025-10-04 10:00:00 Oct 4 Campu… http… "Duke …
7 The wo… Leo G… 2025-10-03 10:00:00 Oct 3 Campu… http… "Recen…
8 We’ve … Kayle… 2025-10-02 14:00:00 Oct 2 Opini… http… "As a …
9 How Du… Neel … 2025-10-01 10:00:00 Oct 1 Campu… http… "Comin…
10 Why ar… Ryan … 2025-10-01 10:00:00 Oct 1 Campu… http… "The e…
# ℹ 490 more rows
Revisit: Summary table
Now that you have the data, how would you produce the summary table below?
Column | Avg. # words/article | # articles |
---|---|---|
Campus Voices | 942 | 382 |
Letters To The Editor | 307 | 19 |
Opinion | 1020 | 99 |
Summarize 1
# A tibble: 500 × 9
n_words title author date_time month day column url
<dbl> <chr> <chr> <dttm> <chr> <dbl> <chr> <chr>
1 1006 The Un… Noor … 2025-10-07 10:00:00 Oct 7 Opini… http…
2 1189 The pr… Harri… 2025-10-07 10:00:00 Oct 7 Campu… http…
3 778 The 'D… Gabri… 2025-10-06 14:30:00 Oct 6 Campu… http…
4 614 Death … Luke … 2025-10-06 10:00:00 Oct 6 Campu… http…
5 534 Hazing… Monda… 2025-10-06 04:00:00 Oct 6 Campu… http…
6 839 Duke’s… Lucas… 2025-10-04 10:00:00 Oct 4 Campu… http…
7 1212 The wo… Leo G… 2025-10-03 10:00:00 Oct 3 Campu… http…
8 1245 We’ve … Kayle… 2025-10-02 14:00:00 Oct 2 Opini… http…
9 937 How Du… Neel … 2025-10-01 10:00:00 Oct 1 Campu… http…
10 1041 Why ar… Ryan … 2025-10-01 10:00:00 Oct 1 Campu… http…
# ℹ 490 more rows
# ℹ 1 more variable: article <chr>
Summarize 2
# A tibble: 500 × 9
# Groups: column [3]
title author date_time month day column url article
<chr> <chr> <dttm> <chr> <dbl> <chr> <chr> <chr>
1 The Un… Noor … 2025-10-07 10:00:00 Oct 7 Opini… http… "Frate…
2 The pr… Harri… 2025-10-07 10:00:00 Oct 7 Campu… http… "In th…
3 The 'D… Gabri… 2025-10-06 14:30:00 Oct 6 Campu… http… "A wee…
4 Death … Luke … 2025-10-06 10:00:00 Oct 6 Campu… http… "Some …
5 Hazing… Monda… 2025-10-06 04:00:00 Oct 6 Campu… http… "Edito…
6 Duke’s… Lucas… 2025-10-04 10:00:00 Oct 4 Campu… http… "Duke …
7 The wo… Leo G… 2025-10-03 10:00:00 Oct 3 Campu… http… "Recen…
8 We’ve … Kayle… 2025-10-02 14:00:00 Oct 2 Opini… http… "As a …
9 How Du… Neel … 2025-10-01 10:00:00 Oct 1 Campu… http… "Comin…
10 Why ar… Ryan … 2025-10-01 10:00:00 Oct 1 Campu… http… "The e…
# ℹ 490 more rows
# ℹ 1 more variable: n_words <dbl>
Summarize 3
Summarize 4
Make a pretty table
with the kable()
function from the knitr
package:
Update column names
chronicle_article |>
mutate(n_words = str_count(article, " ") + 1) |>
group_by(column) |>
summarize(
avg_n_words = mean(n_words),
n_articles = n()
) |>
kable(
col.names = c("Column", "Avg. # words/article", "# articles")
)
Column | Avg. # words/article | # articles |
---|---|---|
Campus Voices | 941.8848 | 382 |
Letters To The Editor | 306.6316 | 19 |
Opinion | 1020.2121 | 99 |
Update digits
chronicle_article |>
mutate(n_words = str_count(article, " ") + 1) |>
group_by(column) |>
summarize(
avg_n_words = mean(n_words),
n_articles = n()
) |>
kable(
col.names = c("Column", "Avg. # words/article", "# articles"),
digits = 0
)
Column | Avg. # words/article | # articles |
---|---|---|
Campus Voices | 942 | 382 |
Letters To The Editor | 307 | 19 |
Opinion | 1020 | 99 |