Exploratory data analysis I

Lecture 5

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

September 9, 2025

Warm-up

While you wait: Participate 📱💻

Suppose you have a dataset df with 100 rows and 5 columns: x1, x2, x3, x4, and x5. x1 is a categorical variable with levels a and b. You run the following code:

x |>
  filter(x1 == "a") |>
  select(x1, x2, x5)

The resulting data frame will have:

  • 3 columns, 50 rows
  • 3 columns, 100 rows
  • 3 columns, can’t tell how many rows
  • 5 columns, 100 rows
  • 5 columns, can’t tell how many rows

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

Labs:

  • Submit PDF on Gradescope by the end of lab session

  • Make regular commits and push .qmd and PDF to GitHub

  • Graded primarily for attendance, participation, collaboration, and effort primarily

  • Feedback provided for correctness

Announcements

Homework: HW 1 due Sunday 11:59pm

  • Part 1: Feedback from AI
    • No Gradescope submission necessary
    • Make regular commits and push .qmd and PDF to GitHub
    • Immediate feedback from AI, no grading
    • Important: Don’t forget to set homework and question number in the app when requesting feedback
  • Part 2: Feedback from humans
    • Submit PDF on Gradescope by the deadline
    • Make regular commits and push .qmd and PDF to GitHub
    • Graded for correctness, feedback provided within ~week
    • Important: Don’t forget to select pages corresponding to each question on Gradescope

Exploratory data analysis

Packages

library(usdata)
library(tidyverse)
library(scales)
library(ggthemes)

Data: gerrymander

gerrymander
# A tibble: 435 × 12
   district last_name first_name party16 clinton16 trump16 dem16 state
   <chr>    <chr>     <chr>      <chr>       <dbl>   <dbl> <dbl> <chr>
 1 AK-AL    Young     Don        R            37.6    52.8     0 AK   
 2 AL-01    Byrne     Bradley    R            34.1    63.5     0 AL   
 3 AL-02    Roby      Martha     R            33      64.9     0 AL   
 4 AL-03    Rogers    Mike D.    R            32.3    65.3     0 AL   
 5 AL-04    Aderholt  Rob        R            17.4    80.4     0 AL   
 6 AL-05    Brooks    Mo         R            31.3    64.7     0 AL   
 7 AL-06    Palmer    Gary       R            26.1    70.8     0 AL   
 8 AL-07    Sewell    Terri      D            69.8    28.6     1 AL   
 9 AR-01    Crawford  Rick       R            30.2    65       0 AR   
10 AR-02    Hill      French     R            41.7    52.4     0 AR   
# ℹ 425 more rows
# ℹ 4 more variables: party18 <chr>, dem18 <dbl>, flip18 <dbl>,
#   gerry <fct>

What is gerrymandering?

https://www.washingtonpost.com/business/wonkblog/gerrymandering-explained/2016/04/21/e447f5c2-07fe-11e6-bfed-ef65dff5970d_video.html

Participate 📱💻

You are given a new dataset to analyze. What are some of the first things you would do to get to know the data?

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Data: gerrymander

glimpse(gerrymander)
Rows: 435
Columns: 12
$ district   <chr> "AK-AL", "AL-01", "AL-02", "AL-03", "AL-04", "AL-…
$ last_name  <chr> "Young", "Byrne", "Roby", "Rogers", "Aderholt", "…
$ first_name <chr> "Don", "Bradley", "Martha", "Mike D.", "Rob", "Mo…
$ party16    <chr> "R", "R", "R", "R", "R", "R", "R", "D", "R", "R",…
$ clinton16  <dbl> 37.6, 34.1, 33.0, 32.3, 17.4, 31.3, 26.1, 69.8, 3…
$ trump16    <dbl> 52.8, 63.5, 64.9, 65.3, 80.4, 64.7, 70.8, 28.6, 6…
$ dem16      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0…
$ state      <chr> "AK", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ party18    <chr> "R", "R", "R", "R", "R", "R", "R", "D", "R", "R",…
$ dem18      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ flip18     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ gerry      <fct> mid, high, high, high, high, high, high, high, mi…

Data: gerrymander

  • Rows: Congressional districts

  • Columns:

    • Congressional district and state

    • 2016 election: winning party, % for Clinton, % for Trump, whether a Democrat won the House election, name of election winner

    • 2018 election: winning party, whether a Democrat won the 2018 House election

    • Whether a Democrat flipped the seat in the 2018 election

    • Prevalence of gerrymandering: low, mid, and high

Variable types: district

Variable Type
district categorical, ID
last_name
first_name
party16
clinton16
trump16
dem16
state
party18
dem18
flip18
gerry
gerrymander |>
  select(district)
# A tibble: 435 × 1
   district
   <chr>   
 1 AK-AL   
 2 AL-01   
 3 AL-02   
 4 AL-03   
 5 AL-04   
 6 AL-05   
 7 AL-06   
 8 AL-07   
 9 AR-01   
10 AR-02   
# ℹ 425 more rows

Variable types: last_name

Variable Type
district categorical, ID
last_name categorical, ID
first_name
party16
clinton16
trump16
dem16
state
party18
dem18
flip18
gerry
gerrymander |>
  select(last_name)
# A tibble: 435 × 1
   last_name
   <chr>    
 1 Young    
 2 Byrne    
 3 Roby     
 4 Rogers   
 5 Aderholt 
 6 Brooks   
 7 Palmer   
 8 Sewell   
 9 Crawford 
10 Hill     
# ℹ 425 more rows

Variable types: first_name

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16
clinton16
trump16
dem16
state
party18
dem18
flip18
gerry
gerrymander |>
  select(first_name)
# A tibble: 435 × 1
   first_name
   <chr>     
 1 Don       
 2 Bradley   
 3 Martha    
 4 Mike D.   
 5 Rob       
 6 Mo        
 7 Gary      
 8 Terri     
 9 Rick      
10 French    
# ℹ 425 more rows

Variable types: party16

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16
trump16
dem16
state
party18
dem18
flip18
gerry
gerrymander |>
  select(party16)
# A tibble: 435 × 1
   party16
   <chr>  
 1 R      
 2 R      
 3 R      
 4 R      
 5 R      
 6 R      
 7 R      
 8 D      
 9 R      
10 R      
# ℹ 425 more rows

Variable types: clinton16

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16
dem16
state
party18
dem18
flip18
gerry
gerrymander |>
  select(clinton16)
# A tibble: 435 × 1
   clinton16
       <dbl>
 1      37.6
 2      34.1
 3      33  
 4      32.3
 5      17.4
 6      31.3
 7      26.1
 8      69.8
 9      30.2
10      41.7
# ℹ 425 more rows

Variable types: trump16

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16
state
party18
dem18
flip18
gerry
gerrymander |>
  select(trump16)
# A tibble: 435 × 1
   trump16
     <dbl>
 1    52.8
 2    63.5
 3    64.9
 4    65.3
 5    80.4
 6    64.7
 7    70.8
 8    28.6
 9    65  
10    52.4
# ℹ 425 more rows

Variable types: dem16

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16 categorical
state
party18
dem18
flip18
gerry
gerrymander |>
  select(dem16)
# A tibble: 435 × 1
   dem16
   <dbl>
 1     0
 2     0
 3     0
 4     0
 5     0
 6     0
 7     0
 8     1
 9     0
10     0
# ℹ 425 more rows

Variable types: state

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16 categorical
state categorical
party18
dem18
flip18
gerry
gerrymander |>
  select(state)
# A tibble: 435 × 1
   state
   <chr>
 1 AK   
 2 AL   
 3 AL   
 4 AL   
 5 AL   
 6 AL   
 7 AL   
 8 AL   
 9 AR   
10 AR   
# ℹ 425 more rows

Variable types: party18

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16 categorical
state categorical
party18 categorical
dem18
flip18
gerry
gerrymander |>
  select(party18)
# A tibble: 435 × 1
   party18
   <chr>  
 1 R      
 2 R      
 3 R      
 4 R      
 5 R      
 6 R      
 7 R      
 8 D      
 9 R      
10 R      
# ℹ 425 more rows

Variable types: dem18

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16 categorical
state categorical
party18 categorical
dem18 categorical
flip18
gerry
gerrymander |>
  select(dem18)
# A tibble: 435 × 1
   dem18
   <dbl>
 1     0
 2     0
 3     0
 4     0
 5     0
 6     0
 7     0
 8     1
 9     0
10     0
# ℹ 425 more rows

Variable types: flip18

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16 categorical
state categorical
party18 categorical
dem18 categorical
flip18 categorical
gerry
gerrymander |>
  select(flip18)
# A tibble: 435 × 1
   flip18
    <dbl>
 1      0
 2      0
 3      0
 4      0
 5      0
 6      0
 7      0
 8      0
 9      0
10      0
# ℹ 425 more rows

Variable types: gerry

Variable Type
district categorical, ID
last_name categorical, ID
first_name categorical, ID
party16 categorical
clinton16 numerical, continuous
trump16 numerical, continuous
dem16 categorical
state categorical
party18 categorical
dem18 categorical
flip18 categorical
gerry categorical, ordinal
gerrymander |>
  select(gerry)
# A tibble: 435 × 1
   gerry
   <fct>
 1 mid  
 2 high 
 3 high 
 4 high 
 5 high 
 6 high 
 7 high 
 8 high 
 9 mid  
10 mid  
# ℹ 425 more rows

Univariate analysis

Univariate analysis

Analyzing a single variable:

  • Numerical: histogram, box plot, density plot, etc.

  • Categorical: bar plot, pie chart, etc.

Histogram - Step 1

ggplot(gerrymander)

Histogram - Step 2

ggplot(gerrymander, aes(x = trump16))

Histogram - Step 3

ggplot(gerrymander, aes(x = trump16)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Participate 📱💻

Which of the following histograms has the most appropriate binwidth for visualizing the distribution of trump16?

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Histogram - Step 4

ggplot(gerrymander, aes(x = trump16)) +
  geom_histogram(binwidth = 5) +
  scale_x_continuous(labels = label_percent(scale = 1))

Histogram - Step 5

ggplot(gerrymander, aes(x = trump16)) +
  geom_histogram(binwidth = 5) +
  scale_x_continuous(labels = label_percent(scale = 1)) +
  labs(
    title = "Percent of vote received by Trump in 2016 Presidential Election",
    subtitle = "From each Congressional District",
    x = "Percent of vote",
    y = "Count"
  )

Box plot - Step 1

ggplot(gerrymander)

Box plot - Step 2

ggplot(gerrymander, aes(x = trump16))

Box plot - Step 3

ggplot(gerrymander, aes(x = trump16)) +
  geom_boxplot()

Box plot - Alternative Step 2 + 3

ggplot(gerrymander, aes(y = trump16)) +
  geom_boxplot()

Box plot - Step 4

ggplot(gerrymander, aes(x = trump16)) +
  geom_boxplot() +
  labs(
    title = "Percent of vote received by Trump in 2016 Presidential Election",
    subtitle = "From each Congressional District",
    x = "Percent of vote",
    y = NULL
  )

Density plot - Step 1

ggplot(gerrymander)

Density plot - Step 2

ggplot(gerrymander, aes(x = trump16))

Density plot - Step 3

ggplot(gerrymander, aes(x = trump16)) +
  geom_density()

Density plot - Step 4

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick")

Density plot - Step 5

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1")

Density plot - Step 6

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1", alpha = 0.5)

Density plot - Step 7

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1", alpha = 0.5, linewidth = 1)

Density plot - Step 8

ggplot(gerrymander, aes(x = trump16)) +
  geom_density(color = "firebrick", fill = "firebrick1", alpha = 0.5, linewidth = 2) +
  labs(
    title = "Percent of vote received by Trump in 2016 Presidential Election",
    subtitle = "From each Congressional District",
    x = "Percent of vote",
    y = "Density"
  )

Summary statistics

gerrymander |>
  summarize(
    mean = mean(trump16),
    median = median(trump16)
  )
# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  45.9   48.7

Distribution of votes for Trump in the 2016 election

Describe the distribution of percent of vote received by Trump in 2016 Presidential Election from Congressional Districts.

  • Shape: The distribution of votes for Trump in the 2016 election from Congressional Districts is unimodal and left-skewed.

  • Center: The percent of vote received by Trump in the 2016 Presidential Election from a typical Congressional Districts is 48.7%.

  • Spread: In the middle 50% of Congressional Districts, 34.8% to 58.1% of voters voted for Trump in the 2016 Presidential Election.

  • Unusual observations: -

Bivariate analysis

Bivariate analysis

Analyzing the relationship between two variables:

  • Numerical + numerical: scatterplot

  • Numerical + categorical: side-by-side box plots, violin plots, etc.

  • Categorical + categorical: stacked bar plots

  • Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

ggplot(
  gerrymander, 
  aes(
    x = trump16, 
    y = gerry
    )
  ) +
  geom_boxplot()

Summary statistics

gerrymander |>
  summarize(
    min = min(trump16),
    q25 = quantile(trump16, 0.25),
    median = median(trump16),
    q75 = quantile(trump16, 0.75),
    max = max(trump16),
  )
# A tibble: 1 × 5
    min   q25 median   q75   max
  <dbl> <dbl>  <dbl> <dbl> <dbl>
1   4.9  34.8   48.7  58.1  80.4

Participate 📱💻

What goes in the [blank] in the code below to do the following step for each level of gerry?

gerrymander |>
  # [blank]
  summarize(
    min = min(trump16),
    q25 = quantile(trump16, 0.25),
    median = median(trump16),
    q75 = quantile(trump16, 0.75),
    max = max(trump16),
  )
  • filter(gerry)
  • group_by(gerry)
  • mutate(gerry)
  • select(gerry)

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Grouped summary statistics

gerrymander |>
  group_by(gerry) |>
  summarize(
    min = min(trump16),
    q25 = quantile(trump16, 0.25),
    median = median(trump16),
    q75 = quantile(trump16, 0.75),
    max = max(trump16),
  )
# A tibble: 3 × 6
  gerry   min   q25 median   q75   max
  <fct> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 low     4.9  36.3   48.4  54.7  74.9
2 mid     6.8  34.8   48.0  57.9  79.9
3 high    9.2  33.5   50.5  60.8  80.4

Density plots

ggplot(
  gerrymander, 
  aes(
    x = trump16, 
    color = gerry
    )
  ) +
  geom_density()

Filled density plots

ggplot(
  gerrymander, 
  aes(
    x = trump16, 
    color = gerry,
    fill = gerry
    )
  ) +
  geom_density()

Better filled density plots

ggplot(
  gerrymander, 
  aes(x = trump16, color = gerry, fill = gerry)
  ) +
  geom_density(alpha = 0.5)

Better colors

ggplot(
  gerrymander, 
  aes(x = trump16, color = gerry, fill = gerry)
  ) +
  geom_density(alpha = 0.5) +
  scale_color_colorblind() +
  scale_fill_colorblind()

Violin plots

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  scale_color_colorblind() +
  scale_fill_colorblind()

Multiple geoms

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  geom_point() +
  scale_color_colorblind() +
  scale_fill_colorblind()

Multiple geoms

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  geom_jitter() +
  scale_color_colorblind() +
  scale_fill_colorblind()

Remove legend

ggplot(
  gerrymander, 
  aes(x = trump16, y = gerry, color = gerry)
  ) +
  geom_violin() +
  geom_jitter() +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme(legend.position = "none")

Multivariate analysis

Multivariate analysis

Analyzing the relationship between multiple variables:

  • In general, one variable is identified as the outcome of interest

  • The remaining variables are predictors or explanatory variables

  • Plots for exploring multivariate relationships are the same as those for bivariate relationships, but conditional on one or more variables

    • Conditioning can be done via faceting or aesthetic mappings (e.g., scatterplot of y vs. x1, colored by x2, faceted by x3)
  • Summary statistics for exploring multivariate relationships are the same as those for bivariate relationships, but conditional on one or more variables

    • Conditioning can be done via grouping (e.g., correlation between y and x1, grouped by levels of x2 and x3)

Application exercise

ae-03-gerrymander-explore-I

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-03-gerrymander-explore-I.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits by the end of class.