Exploratory data analysis I

Lecture 5

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

September 9, 2025

Warm-up

While you wait: Participate 📱💻

Suppose you have a dataset df with 100 rows and 5 columns: x1, x2, x3, x4, and x5. x1 is a categorical variable with levels a and b. You run the following code:

x |>
  filter(x1 == "a") |>
  select(x1, x2, x5)

The resulting data frame will have:

3 columns, 50 rows
3 columns, 100 rows
3 columns, can’t tell how many rows
5 columns, 100 rows
5 columns, can’t tell how many rows

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

Labs:

Submit PDF on Gradescope by the end of lab session
Make regular commits and push .qmd and PDF to GitHub
Graded primarily for attendance, participation, collaboration, and effort primarily
Feedback provided for correctness

Announcements

Homework: HW 1 due Sunday 11:59pm

Part 1: Feedback from AI
- No Gradescope submission necessary
- Make regular commits and push .qmd and PDF to GitHub
- Immediate feedback from AI, no grading
- Important: Don’t forget to set homework and question number in the app when requesting feedback
Part 2: Feedback from humans
- Submit PDF on Gradescope by the deadline
- Make regular commits and push .qmd and PDF to GitHub
- Graded for correctness, feedback provided within ~week
- Important: Don’t forget to select pages corresponding to each question on Gradescope

Exploratory data analysis

Packages

For the data: usdata

library(usdata)

For the analysis: tidyverse, scales, and ggthemes

library(tidyverse)
library(scales)
library(ggthemes)

Data: `gerrymander`

gerrymander

# A tibble: 435 × 12
   district last_name first_name party16 clinton16 trump16 dem16 state
   <chr>    <chr>     <chr>      <chr>       <dbl>   <dbl> <dbl> <chr>
 1 AK-AL    Young     Don        R            37.6    52.8     0 AK   
 2 AL-01    Byrne     Bradley    R            34.1    63.5     0 AL   
 3 AL-02    Roby      Martha     R            33      64.9     0 AL   
 4 AL-03    Rogers    Mike D.    R            32.3    65.3     0 AL   
 5 AL-04    Aderholt  Rob        R            17.4    80.4     0 AL   
 6 AL-05    Brooks    Mo         R            31.3    64.7     0 AL   
 7 AL-06    Palmer    Gary       R            26.1    70.8     0 AL   
 8 AL-07    Sewell    Terri      D            69.8    28.6     1 AL   
 9 AR-01    Crawford  Rick       R            30.2    65       0 AR   
10 AR-02    Hill      French     R            41.7    52.4     0 AR   
# ℹ 425 more rows
# ℹ 4 more variables: party18 <chr>, dem18 <dbl>, flip18 <dbl>,
#   gerry <fct>

What is gerrymandering?

https://www.washingtonpost.com/business/wonkblog/gerrymandering-explained/2016/04/21/e447f5c2-07fe-11e6-bfed-ef65dff5970d_video.html

Participate 📱💻

You are given a new dataset to analyze. What are some of the first things you would do to get to know the data?

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Data: `gerrymander`

glimpse(gerrymander)

Rows: 435
Columns: 12
$ district   <chr> "AK-AL", "AL-01", "AL-02", "AL-03", "AL-04", "AL-…
$ last_name  <chr> "Young", "Byrne", "Roby", "Rogers", "Aderholt", "…
$ first_name <chr> "Don", "Bradley", "Martha", "Mike D.", "Rob", "Mo…
$ party16    <chr> "R", "R", "R", "R", "R", "R", "R", "D", "R", "R",…
$ clinton16  <dbl> 37.6, 34.1, 33.0, 32.3, 17.4, 31.3, 26.1, 69.8, 3…
$ trump16    <dbl> 52.8, 63.5, 64.9, 65.3, 80.4, 64.7, 70.8, 28.6, 6…
$ dem16      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0…
$ state      <chr> "AK", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ party18    <chr> "R", "R", "R", "R", "R", "R", "R", "D", "R", "R",…
$ dem18      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ flip18     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ gerry      <fct> mid, high, high, high, high, high, high, high, mi…

Data: `gerrymander`

Rows: Congressional districts
Columns:
- Congressional district and state
- 2016 election: winning party, % for Clinton, % for Trump, whether a Democrat won the House election, name of election winner
- 2018 election: winning party, whether a Democrat won the 2018 House election
- Whether a Democrat flipped the seat in the 2018 election
- Prevalence of gerrymandering: low, mid, and high

Variable types: `district`

Variable	Type
`district`	categorical, ID
`last_name`
`first_name`
`party16`
`clinton16`
`trump16`
`dem16`
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(district)

# A tibble: 435 × 1
   district
   <chr>   
 1 AK-AL   
 2 AL-01   
 3 AL-02   
 4 AL-03   
 5 AL-04   
 6 AL-05   
 7 AL-06   
 8 AL-07   
 9 AR-01   
10 AR-02   
# ℹ 425 more rows

Variable types: `last_name`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`
`party16`
`clinton16`
`trump16`
`dem16`
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(last_name)

# A tibble: 435 × 1
   last_name
   <chr>    
 1 Young    
 2 Byrne    
 3 Roby     
 4 Rogers   
 5 Aderholt 
 6 Brooks   
 7 Palmer   
 8 Sewell   
 9 Crawford 
10 Hill     
# ℹ 425 more rows

Variable types: `first_name`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`
`clinton16`
`trump16`
`dem16`
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(first_name)

# A tibble: 435 × 1
   first_name
   <chr>     
 1 Don       
 2 Bradley   
 3 Martha    
 4 Mike D.   
 5 Rob       
 6 Mo        
 7 Gary      
 8 Terri     
 9 Rick      
10 French    
# ℹ 425 more rows

Variable types: `party16`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`
`trump16`
`dem16`
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(party16)

# A tibble: 435 × 1
   party16
   <chr>  
 1 R      
 2 R      
 3 R      
 4 R      
 5 R      
 6 R      
 7 R      
 8 D      
 9 R      
10 R      
# ℹ 425 more rows

Variable types: `clinton16`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`
`dem16`
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(clinton16)

# A tibble: 435 × 1
   clinton16
       <dbl>
 1      37.6
 2      34.1
 3      33  
 4      32.3
 5      17.4
 6      31.3
 7      26.1
 8      69.8
 9      30.2
10      41.7
# ℹ 425 more rows

Variable types: `trump16`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(trump16)

# A tibble: 435 × 1
   trump16
     <dbl>
 1    52.8
 2    63.5
 3    64.9
 4    65.3
 5    80.4
 6    64.7
 7    70.8
 8    28.6
 9    65  
10    52.4
# ℹ 425 more rows

Variable types: `dem16`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`	categorical
`state`
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(dem16)

# A tibble: 435 × 1
   dem16
   <dbl>
 1     0
 2     0
 3     0
 4     0
 5     0
 6     0
 7     0
 8     1
 9     0
10     0
# ℹ 425 more rows

Variable types: `state`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`	categorical
`state`	categorical
`party18`
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(state)

# A tibble: 435 × 1
   state
   <chr>
 1 AK   
 2 AL   
 3 AL   
 4 AL   
 5 AL   
 6 AL   
 7 AL   
 8 AL   
 9 AR   
10 AR   
# ℹ 425 more rows

Variable types: `party18`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`	categorical
`state`	categorical
`party18`	categorical
`dem18`
`flip18`
`gerry`

gerrymander |>
  select(party18)

# A tibble: 435 × 1
   party18
   <chr>  
 1 R      
 2 R      
 3 R      
 4 R      
 5 R      
 6 R      
 7 R      
 8 D      
 9 R      
10 R      
# ℹ 425 more rows

Variable types: `dem18`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`	categorical
`state`	categorical
`party18`	categorical
`dem18`	categorical
`flip18`
`gerry`

gerrymander |>
  select(dem18)

# A tibble: 435 × 1
   dem18
   <dbl>
 1     0
 2     0
 3     0
 4     0
 5     0
 6     0
 7     0
 8     1
 9     0
10     0
# ℹ 425 more rows

Variable types: `flip18`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`	categorical
`state`	categorical
`party18`	categorical
`dem18`	categorical
`flip18`	categorical
`gerry`

gerrymander |>
  select(flip18)

# A tibble: 435 × 1
   flip18
    <dbl>
 1      0
 2      0
 3      0
 4      0
 5      0
 6      0
 7      0
 8      0
 9      0
10      0
# ℹ 425 more rows

Variable types: `gerry`

Variable	Type
`district`	categorical, ID
`last_name`	categorical, ID
`first_name`	categorical, ID
`party16`	categorical
`clinton16`	numerical, continuous
`trump16`	numerical, continuous
`dem16`	categorical
`state`	categorical
`party18`	categorical
`dem18`	categorical
`flip18`	categorical
`gerry`	categorical, ordinal

gerrymander |>
  select(gerry)

# A tibble: 435 × 1
   gerry
   <fct>
 1 mid  
 2 high 
 3 high 
 4 high 
 5 high 
 6 high 
 7 high 
 8 high 
 9 mid  
10 mid  
# ℹ 425 more rows

Univariate analysis

Analyzing a single variable:

Numerical: histogram, box plot, density plot, etc.
Categorical: bar plot, pie chart, etc.

Histogram - Step 1

ggplot(gerrymander)

Histogram - Step 2

ggplot(gerrymander, aes(x = trump16))

Histogram - Step 3

ggplot(gerrymander, aes(x = trump16)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Participate 📱💻

Which of the following histograms has the most appropriate binwidth for visualizing the distribution of trump16?

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Histogram - Step 4

ggplot(gerrymander, aes(x = trump16)) +
  geom_histogram(binwidth = 5) +
  scale_x_continuous(labels = label_percent(scale = 1))

Histogram - Step 5

ggplot(gerrymander, aes(x = trump16)) +
  geom_histogram(binwidth = 5) +
  scale_x_continuous(labels = label_percent(scale = 1)) +
  labs(
    title = "Percent of vote received by Trump in 2016 Presidential Election",
    subtitle = "From each Congressional District",
    x = "Percent of vote",
    y = "Count"
  )

Box plot - Step 1

ggplot(gerrymander)

Box plot - Step 2

ggplot(gerrymander, aes(x = trump16))

Box plot - Step 3

ggplot(gerrymander, aes(x = trump16)) +
  geom_boxplot()

Box plot - Alternative Step 2 + 3

ggplot(gerrymander, aes(y = trump16)) +
  geom_boxplot()

Box plot - Step 4

ggplot(gerrymander, aes(x = trump16)) +
  geom_boxplot() +
  labs(
    title = "Percent of vote received by Trump in 2016 Presidential Election",
    subtitle = "From each Congressional District",
    x = "Percent of vote",
    y = NULL
  )

Density plot - Step 1

ggplot(gerrymander)

Density plot - Step 2

ggplot(gerrymander, aes(x = trump16))

Density plot - Step 3

ggplot(gerrymander, aes(x = trump16)) +
  geom_density()

Density plot - Step 4

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick")

Density plot - Step 5

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1")

Density plot - Step 6

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1", alpha = 0.5)

Density plot - Step 7

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1", alpha = 0.5, linewidth = 1)

Density plot - Step 8

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1", alpha = 0.5, linewidth = 2) +
  labs(
    title = "Percent of vote received by Trump in 2016 Presidential Election",
    subtitle = "From each Congressional District",
    x = "Percent of vote",
    y = "Density"
  )

Summary statistics

gerrymander |>
  summarize(
    mean = mean(trump16),
    median = median(trump16)
  )

# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  45.9   48.7

Distribution of votes for Trump in the 2016 election

Describe the distribution of percent of vote received by Trump in 2016 Presidential Election from Congressional Districts.

Shape: The distribution of votes for Trump in the 2016 election from Congressional Districts is unimodal and left-skewed.
Center: The percent of vote received by Trump in the 2016 Presidential Election from a typical Congressional Districts is 48.7%.
Spread: In the middle 50% of Congressional Districts, 34.8% to 58.1% of voters voted for Trump in the 2016 Presidential Election.
Unusual observations: -

Bivariate analysis

Analyzing the relationship between two variables:

Numerical + numerical: scatterplot
Numerical + categorical: side-by-side box plots, violin plots, etc.
Categorical + categorical: stacked bar plots
Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

ggplot(
  gerrymander, 
  aes(
    x = trump16, 
    y = gerry
    )
  ) +
  geom_boxplot()

Summary statistics

gerrymander |>
  summarize(
    min = min(trump16),
    q25 = quantile(trump16, 0.25),
    median = median(trump16),
    q75 = quantile(trump16, 0.75),
    max = max(trump16),
  )

# A tibble: 1 × 5
    min   q25 median   q75   max
  <dbl> <dbl>  <dbl> <dbl> <dbl>
1   4.9  34.8   48.7  58.1  80.4

Participate 📱💻

What goes in the [blank] in the code below to do the following step for each level of gerry?

gerrymander |>
  # [blank]
  summarize(
    min = min(trump16),
    q25 = quantile(trump16, 0.25),
    median = median(trump16),
    q75 = quantile(trump16, 0.75),
    max = max(trump16),
  )

filter(gerry)
group_by(gerry)
mutate(gerry)
select(gerry)

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Grouped summary statistics

gerrymander |>
  group_by(gerry) |>
  summarize(
    min = min(trump16),
    q25 = quantile(trump16, 0.25),
    median = median(trump16),
    q75 = quantile(trump16, 0.75),
    max = max(trump16),
  )

# A tibble: 3 × 6
  gerry   min   q25 median   q75   max
  <fct> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 low     4.9  36.3   48.4  54.7  74.9
2 mid     6.8  34.8   48.0  57.9  79.9
3 high    9.2  33.5   50.5  60.8  80.4

Density plots

ggplot(
  gerrymander, 
  aes(
    x = trump16, 
    color = gerry
    )
  ) +
  geom_density()

Filled density plots

ggplot(
  gerrymander, 
  aes(
    x = trump16, 
    color = gerry,
    fill = gerry
    )
  ) +
  geom_density()

Better filled density plots

ggplot(
  gerrymander, 
  aes(x = trump16, color = gerry, fill = gerry)
  ) +
  geom_density(alpha = 0.5)

Better colors

ggplot(
  gerrymander, 
  aes(x = trump16, color = gerry, fill = gerry)
  ) +
  geom_density(alpha = 0.5) +
  scale_color_colorblind() +
  scale_fill_colorblind()

Violin plots

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  scale_color_colorblind() +
  scale_fill_colorblind()

Multiple geoms

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  geom_point() +
  scale_color_colorblind() +
  scale_fill_colorblind()

Multiple geoms

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  geom_jitter() +
  scale_color_colorblind() +
  scale_fill_colorblind()

Remove legend

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  geom_jitter() +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme(legend.position = "none")

Multivariate analysis

Analyzing the relationship between multiple variables:

In general, one variable is identified as the outcome of interest
The remaining variables are predictors or explanatory variables
Plots for exploring multivariate relationships are the same as those for bivariate relationships, but conditional on one or more variables
- Conditioning can be done via faceting or aesthetic mappings (e.g., scatterplot of y vs. x1, colored by x2, faceted by x3)
Summary statistics for exploring multivariate relationships are the same as those for bivariate relationships, but conditional on one or more variables
- Conditioning can be done via grouping (e.g., correlation between y and x1, grouped by levels of x2 and x3)

Application exercise

ae-03-gerrymander-explore-I

Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-03-gerrymander-explore-I.qmd.
Work through the application exercise in class, and render, commit, and push your edits by the end of class.

Exploratory data analysis I

Warm-up

While you wait: Participate 📱💻

Announcements

Announcements

Exploratory data analysis

Packages

Data: gerrymander

What is gerrymandering?

Participate 📱💻

Data: gerrymander

Data: gerrymander

Variable types: district

Variable types: last_name

Variable types: first_name

Variable types: party16

Variable types: clinton16

Variable types: trump16

Variable types: dem16

Variable types: state

Variable types: party18

Variable types: dem18

Variable types: flip18

Variable types: gerry

Univariate analysis

Univariate analysis

Histogram - Step 1

Histogram - Step 2

Histogram - Step 3

Participate 📱💻

Histogram - Step 4

Histogram - Step 5

Box plot - Step 1

Box plot - Step 2

Box plot - Step 3

Box plot - Alternative Step 2 + 3

Box plot - Step 4

Density plot - Step 1

Density plot - Step 2

Density plot - Step 3

Density plot - Step 4

Density plot - Step 5

Density plot - Step 6

Density plot - Step 7

Density plot - Step 8

Summary statistics

Distribution of votes for Trump in the 2016 election

Bivariate analysis

Bivariate analysis

Side-by-side box plots

Summary statistics

Participate 📱💻

Grouped summary statistics

Density plots

Filled density plots

Better filled density plots

Better colors

Violin plots

Multiple geoms

Multiple geoms

Remove legend

Multivariate analysis

Multivariate analysis

Application exercise

ae-03-gerrymander-explore-I

Data: `gerrymander`

Data: `gerrymander`

Data: `gerrymander`

Variable types: `district`

Variable types: `last_name`

Variable types: `first_name`

Variable types: `party16`

Variable types: `clinton16`

Variable types: `trump16`

Variable types: `dem16`

Variable types: `state`

Variable types: `party18`

Variable types: `dem18`

Variable types: `flip18`

Variable types: `gerry`