Data types and classes

Lecture 9

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2025

September 23, 2025

Warm-up

While you wait: Participate 📱💻

Fill in the blanks:

I’m a _____ (first-year, sophomore, junior, senior)

and on Tuesdays I have _____ class(es).

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Announcements

Survey: Confidence in STEM courses at Duke

Exam 1:

  • In class on Thu, Oct 2
  • Take home Thu, Oct 2 after class until Sat, Oct 4 at noon
  • Covers lectures 1-10, labs 1-4, and homeworks 1-3
  • Practice exam to be posted on Friday, exam review on Tue, Sep 30

Recap: The tidyverse package

When you load the tidyverse package, you get access to a suite of packages that work well together for data manipulation and visualization:

library(tidyverse)
── Attaching core tidyverse packages ────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4          ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.2          ✔ tibble    3.3.0.9004
✔ lubridate 1.9.4          ✔ tidyr     1.3.1     
✔ purrr     1.1.0          
── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You never need to load one of these packages individually after you load the tidyverse, e.g.,

library(dplyr) # not necessary if you already loaded tidyverse in your document/session

Recap: Loading packages

  • You only need to load a package once per R session or Quarto document.
  • It’s good practice to load all the packages you need at the start of your document, that’s why the templates I give you usually has a load-packages code cell at the top.
```{r}
#| label: load-packages
#| message: false
library(tidyverse)
library(ggthemes)
library(scales)
# etc.
```
  • You never need to load these packages again further down in the same document.
  • If you need a new package further down in the document, go back and add it to the load-packages code cell.

Recap: Pipes

This is not a pipe.

Recap: Pipes

This is not our pipe [operator].

Recap: Pipes

This is our a pipe [operator].

Data types

How many classes do you have on Tuesdays?

survey
# A tibble: 86 × 2
   year       tue_classes
   <chr>      <chr>      
 1 Senior     3          
 2 Sophomore  4          
 3 Sophomore  3          
 4 Junior     4          
 5 Sophomore  2          
 6 First-year 2          
 7 Junior     2          
 8 Sophomore  3          
 9 First-year 2          
10 Senior     3          
# ℹ 76 more rows

Variable types

What type of variable is tue_classes?

survey
# A tibble: 86 × 2
   year       tue_classes
   <chr>      <chr>      
 1 Senior     3          
 2 Sophomore  4          
 3 Sophomore  3          
 4 Junior     4          
 5 Sophomore  2          
 6 First-year 2          
 7 Junior     2          
 8 Sophomore  3          
 9 First-year 2          
10 Senior     3          
# ℹ 76 more rows

Let’s (attempt to) clean it up…

survey <- survey |>
  mutate(
    tue_classes = case_when(
      tue_classes == "one" ~ "1",
      tue_classes == "two" ~ "2",
      tue_classes == "Two" ~ "2",
      .default = tue_classes
    ),
    tue_classes = as.numeric(tue_classes),
    year = case_when(
      year == "Sophmore" ~ "Sophomore",
      year == "Freshman" ~ "First-year",
      .default = year
    )
  ) |>
  filter(year != "29.32%")

survey
# A tibble: 85 × 2
   year       tue_classes
   <chr>            <dbl>
 1 Senior               3
 2 Sophomore            4
 3 Sophomore            3
 4 Junior               4
 5 Sophomore            2
 6 First-year           2
 7 Junior               2
 8 Sophomore            3
 9 First-year           2
10 Senior               3
# ℹ 75 more rows

Data types

Data types in R

  • logical
  • double
  • integer
  • character
  • and some more, but we won’t be focusing on those

Logical & character

logical - Boolean values TRUE and FALSE


typeof(TRUE)
[1] "logical"

character - character strings



typeof("First-year")
[1] "character"

Double & integer

double - floating point numerical values (default numerical type)


typeof(2.5)
[1] "double"
typeof(3)
[1] "double"

integer - integer numerical values (indicated with an L)


typeof(3L)
[1] "integer"
typeof(1:3)
[1] "integer"

Concatenation

Vectors can be constructed using the c() function.

  • Numeric vector:
c(1, 2, 3)
[1] 1 2 3
  • Character vector:
c("Hello", "World!")
[1] "Hello"  "World!"
  • Vector made of vectors:
c(c("hi", "hello"), c("bye", "jello"))
[1] "hi"    "hello" "bye"   "jello"

Converting between types

with intention…

x <- 1:3
x
[1] 1 2 3
typeof(x)
[1] "integer"
y <- as.character(x)
y
[1] "1" "2" "3"
typeof(y)
[1] "character"

Converting between types

with intention…

x <- c(TRUE, FALSE)
x
[1]  TRUE FALSE
typeof(x)
[1] "logical"
y <- as.numeric(x)
y
[1] 1 0
typeof(y)
[1] "double"

Converting between types

without intention…

c(2, "Just this one!")
[1] "2"              "Just this one!"

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that’s not always a great thing!

Converting between types

without intention…

c(FALSE, 3L)
[1] 0 3
c(FALSE, 1.2)
[1] 0.0 1.2
c(2L, "two")
[1] "2"   "two"
c(TRUE, "two")
[1] "TRUE" "two" 

Participate 📱💻

What is the output of typeof(c(1.2, 3L))?

  • "character"
  • "double"
  • "integer"
  • "logical"

Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.

Explicit vs. implicit coercion

Explicit coercion:

When you call a function like as.logical(), as.numeric(), as.integer(), as.double(), or as.character().

Implicit coercion:

Happens when you use a vector in a specific context that expects a certain type of vector.

Data classes

Data classes

  • Vectors are like Lego building blocks
  • We stick them together to build more complicated constructs, e.g. representations of data
  • The class attribute relates to the S3 class of an object which determines its behaviour
    • You don’t need to worry about what S3 classes really mean, but you can read more about it here if you’re curious
  • Examples: factors, dates, and data frames

Factors

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values

class_years <- factor(
  c(
    "First-year",
    "Sophomore",
    "Sophomore",
    "Senior",
    "Junior"
  )
)
class_years
[1] First-year Sophomore  Sophomore  Senior     Junior    
Levels: First-year Junior Senior Sophomore
typeof(class_years)
[1] "integer"
class(class_years)
[1] "factor"

More on factors

We can think of factors like character (level labels) and an integer (level numbers) glued together

glimpse(class_years)
 Factor w/ 4 levels "First-year","Junior",..: 1 4 4 3 2
as.integer(class_years)
[1] 1 4 4 3 2

Dates

today <- as.Date("2025-09-23")
today
[1] "2025-09-23"
typeof(today)
[1] "double"
class(today)
[1] "Date"

More on dates

We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together

as.integer(today)
[1] 20354
as.integer(today) / 365 # roughly 55 yrs
[1] 55.76438

Data frames

We can think of data frames like like vectors of equal length glued together

df <- data.frame(x = 1:2, y = 3:4)
df
  x y
1 1 3
2 2 4
typeof(df)
[1] "list"
class(df)
[1] "data.frame"

Lists

Lists are a generic vector container; vectors of any type can go in them

l <- list(
  x = 1:4,
  y = c("hi", "hello", "jello"),
  z = c(TRUE, FALSE)
)
l
$x
[1] 1 2 3 4

$y
[1] "hi"    "hello" "jello"

$z
[1]  TRUE FALSE

Lists and data frames

  • A data frame is a special list containing vectors of equal length
df
  x y
1 1 3
2 2 4
  • When we use the pull() function, we extract a vector from the data frame
df |>
  pull(y)
[1] 3 4

Working with factors

Read data in as character strings

survey
# A tibble: 85 × 2
   year       tue_classes
   <chr>            <dbl>
 1 Senior               3
 2 Sophomore            4
 3 Sophomore            3
 4 Junior               4
 5 Sophomore            2
 6 First-year           2
 7 Junior               2
 8 Sophomore            3
 9 First-year           2
10 Senior               3
# ℹ 75 more rows

But coerce when plotting

ggplot(survey, mapping = aes(x = year)) +
  geom_bar()

Use forcats to reorder levels

survey |>
  mutate(
    year = fct_relevel(year, "First-year", "Sophomore", "Junior", "Senior")
  ) |>
  ggplot(mapping = aes(x = year)) +
  geom_bar()

A peek into forcats

Reordering levels by:

  • fct_relevel(): hand

  • fct_infreq(): frequency

  • fct_reorder(): sorting along another variable

  • fct_rev(): reversing

Changing level values by:

  • fct_lump(): lumping uncommon levels together into “other”

  • fct_other(): manually replacing some levels with “other”

Application exercise

ae-08-durham-climate-factors

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-08-durham-climate-factors.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits by the end of class.