Lecture 1
Duke University
STA 199 - Fall 2025
August 26, 2025
Dr. Mine Çetinkaya-Rundel
Professor of the Practice
Director of Undergraduate Studies, Statistical Science
Director of First-Year Experience in Trinity College of Arts & Sciences
Office: Old Chem 211C
Office hours:
Please share with at least two classmates…
04:00
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.
We’re going to learn to do this in a modern and tidy
way – more on that later!
This is a course on introduction to data science, with an emphasis on statistical thinking.
Yesterday we collected some data from you!
Today we’re going to explore that data together, following the data science cycle.
Warning
The data is anonymized and randomized and contains only a subset of the data collected. The rows have been shuffled for each column independently, meaning that no row belongs to a single student in the course. Additionally, any identifying information has been removed. Please do not try to de-anonymize or re-identify any individuals in the dataset.
You took a survey:
Canvas stored your data in a CSV (comma-separated values) file:
We want to explore the data that is in that file!
More on what packages are on Thursday, but in a nutshell “get your tools out of the toolbox”:
Import the data called survey-anonymized.csv
in the folder called data
using the read_csv()
function:
# A tibble: 230 × 5
stats_experience programming_experience programming_languages
<chr> <chr> <chr>
1 Yes, I have taken a h… A little — I’ve writt… "Java, Python, Yaml,…
2 No, I have not taken … None <NA>
3 Yes, I have taken a h… A little — I’ve writt… <NA>
4 No, I have not taken … Some — I’ve worked on… <NA>
5 No, I have not taken … Some — I’ve worked on… <NA>
6 No, I have not taken … A little — I’ve writt… <NA>
7 Yes, I have taken a h… None "Python and java"
8 No, I have not taken … None <NA>
9 Yes, I have taken a h… None "Python- a little bi…
10 Yes, I have taken a h… None <NA>
# ℹ 220 more rows
# ℹ 2 more variables: learn_best <chr>, data_interests <chr>
Scan the QR code or go to app.wooclap.com/sta199.
Log in with your Duke NetID.
# A tibble: 230 × 5
stats_experience programming_experience programming_languages
<chr> <chr> <chr>
1 Yes, I have taken a h… A little — I’ve writt… "Java, Python, Yaml,…
2 No, I have not taken … None <NA>
3 Yes, I have taken a h… A little — I’ve writt… <NA>
4 No, I have not taken … Some — I’ve worked on… <NA>
5 No, I have not taken … Some — I’ve worked on… <NA>
6 No, I have not taken … A little — I’ve writt… <NA>
7 Yes, I have taken a h… None "Python and java"
8 No, I have not taken … None <NA>
9 Yes, I have taken a h… None "Python- a little bi…
10 Yes, I have taken a h… None <NA>
# ℹ 220 more rows
# ℹ 2 more variables: learn_best <chr>, data_interests <chr>
Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
We asked you the following multiple-choice question where you could only pick one option:
Have you taken any statistics courses before?
One way to make sense of data collected via a question like this is to visualize it.
survey |>
count(stats_experience) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(y = stats_experience, x = prop))+
geom_col(show.legend = FALSE) +
scale_y_discrete(labels = label_wrap(20)) +
scale_x_continuous(labels = percent_format(accuracy = 1), breaks = c(0, 0.25, 0.5)) +
labs(
title = "Prior statistics experience among STA 199 students",
y = NULL,
x = "Count"
) +
labs(
caption = "Data are self-reported, collected from STA 199 students on August 25, 2025."
)
We also asked you the following multiple-choice question where you could only pick one option:
How much experience do you have with programming?
survey |>
count(programming_experience) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(y = fct_reorder(programming_experience, prop), x = prop, fill = prop))+
geom_col(show.legend = FALSE) +
scale_y_discrete(labels = label_wrap(25)) +
scale_x_continuous(labels = percent_format(accuracy = 1), breaks = c(0, 0.1, 0.2, 0.3, 0.4)) +
scale_fill_viridis_c(option = "E") +
labs(
title = "Prior programming experience among STA 199 students",
y = NULL,
x = "Count"
) +
labs(
caption = "Data are self-reported, collected from STA 199 students on August 25, 2025."
) +
theme_minimal(base_size = 16)
We also asked you the following multiple-choice question where you could as many options as you liked:
What types of data interest you?
# A tibble: 230 × 1
data_interests
<chr>
1 "Economics,Politics"
2 "Health (e.g.\\, social determinants of health\\, medical),Politic…
3 "Crime,Economics,Education,Environment/Climate,Politics,No prefere…
4 "Crime,Education,Politics,Sports"
5 "Crime,Education,Entertainment (e.g.\\, books\\, movies\\, music),…
6 "Crime,Entertainment (e.g.\\, books\\, movies\\, music),Health (e.…
7 "Health (e.g.\\, social determinants of health\\, medical)"
8 "Education,Entertainment (e.g.\\, books\\, movies\\, music),Health…
9 "Economics,Entertainment (e.g.\\, books\\, movies\\, music),Enviro…
10 "Economics,Environment/Climate,Sports"
# ℹ 220 more rows
Before we can visualize this variable, we need to tidy and transform it.
survey |>
# remove text in parentheses
mutate(data_interests = str_remove_all(data_interests, "\\s*\\(.*?\\)")) |>
# separate into multiple rows for each interest, using comma as delimiter
separate_longer_delim(data_interests, delim = ",") |>
count(data_interests, sort = TRUE) |>
mutate(prop = n / sum(n))
# A tibble: 11 × 3
data_interests n prop
<chr> <int> <dbl>
1 Health 142 0.169
2 Sports 111 0.132
3 Politics 110 0.131
4 Entertainment 107 0.128
5 Crime 97 0.116
6 Economics 93 0.111
7 Education 86 0.103
8 Environment/Climate 76 0.0906
9 Other 12 0.0143
10 No preference 3 0.00358
11 <NA> 2 0.00238
survey |>
mutate(data_interests = str_remove_all(data_interests, "\\s*\\(.*?\\)")) |>
separate_longer_delim(data_interests, delim = ",") |>
count(data_interests, sort = TRUE) |>
mutate(prop = n / sum(n)) |>
filter(!is.na(data_interests)) |>
ggplot(aes(y = fct_reorder(data_interests, prop), x = prop, fill = prop)) +
geom_col(show.legend = FALSE) +
scale_y_discrete(labels = label_wrap(20)) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_fill_taylor_c(album = "Speak Now") +
labs(
title = "Data interests among STA 199 students",
y = NULL,
x = "Count",
caption = "Data are self-reported, collected from STA 199 students on August 25, 2025."
) +
theme_minimal(base_size = 16)
We also asked you the following open-ended question:
How do you learn best?
We can use text mining techniques, like tokenizing to words to explore this open-ended question:
# A tibble: 12 × 2
word n
<chr> <int>
1 learn 89
2 practice 66
3 visual 28
4 step 27
5 learning 26
6 hands 24
7 class 22
8 questions 21
9 examples 19
10 lecture 17
11 lectures 17
12 learner 16
We can also tokenize to bigrams (pairs of words):
# A tibble: 20 × 2
bigrams n
<chr> <int>
1 i learn 73
2 learn best 68
3 practice problems 28
4 hands on 24
5 best by 21
6 best through 19
7 i also 16
8 i am 16
9 and then 15
10 in class 14
11 when i 14
12 by step 13
13 i can 13
14 step by 13
15 <NA> 13
16 problems and 12
17 a lot 11
18 a visual 11
19 on my 11
20 visual learner 11
We also asked you the following open-ended question:
If you’ve programmed before, which languages have you used, and how comfortable do you feel with each? If you haven’t programmed before, please leave this question blank.
And the answers are non-trivial to tidy up, e.g.,
- Python - I've used it very little.
- I have not programmed before.
- Java - very comfortable. Python - pretty comfortable. C/C++ - not super comfortable, haven't used since high school.
Scan the QR code or go to app.wooclap.com/sta199. Log in with your Duke NetID.
library(ellmer)
chat <- chat_openai()
prompts <- survey |>
filter(!is.na(programming_languages)) |>
pull(programming_languages) |>
list()
type_language <- type_object(
language = type_string(),
experience = type_string()
)
language_experience <- parallel_chat_structured(chat, prompts, type = type_language)
language_experience
language
1 Python, Java, R, C/C++, JavaScript, HTML/CSS, SQL, C#, VBA, MATLAB, Rust, Assembly, Swift, Stata, Bash, Typescript, Web3 Suite (HTML, CSS, JS), Nextflow, SAS, PsycoPy, Mathematica, React, Vue, Yaml, CourseKata, Flutter, Scratch
experience
1 Most students report a range of programming experience, from never having programmed before to being very comfortable in languages such as Python, Java, and R. Many took introductory courses (e.g., CS101, AP CS A) and are comfortable with basic programming constructs. Proficiency is generally highest in Python and Java, though some only have a little experience with languages like R, C/C++, and JavaScript. Quite a few have only used programming languages for specific classes or research projects and may not feel confident after a long break. Some students have experience with more niche or specialized tools such as MATLAB, SAS, and web technologies (HTML, CSS, JavaScript/JS frameworks). Many express a need for a refresher if not used recently. In summary, while a core group is comfortable with at least one language, especially Python or Java, many others have only basic to moderate experience and express the desire to gain more proficiency.
Mathematica is a software system used for mathematics.
# A tibble: 1 × 1
programming_languages
<chr>
1 Most familiar to least: Java, Mathematica, Python, Matlab, React
Nextflow extends the Unix pipes model with a DSL.
All linked from the course website:
Daily in lecture
Tracked for credit, but not based on correctness, only participation (and often they will be questions designed to make you think that might not have a single right answer!)
Daily-ish in lecture
Not graded, but tracked for feedback on workflow
Hands-on practice with data analysis
A sigle exercise per lab, graded based on being there and in something reasonable + correctness
Completed in-person, in lab, in teams
Teams randomized each week until project teams assigned
Developed collaboratively, but turned in individually by the end of the lab session
8 throughout semester, two lowest scores dropped
No late work accepted
Hands-on practice with data analysis
Some questions for practice with instant feedback by AI
Some questions to be graded by humans for correctness
Can start in lab if time permits, but completed at home
Can consult with course team and peers, but completed and turned in individually by the end of the week
6 throughout semester, lowest score dropped
Up to 3 days late (-5% per day), no late work accepted after that
One-time late penalty waiver: Can be used on any homework assignment, no questions asked, must be requested from Dr. Knox before the deadline
Two exams during semester, comprised of two parts:
In-class: 75 minute in-class exam. Closed book, one sheet of notes – 80% of the grade.
Take-home: Follow from the in class exam and focus on the analysis of a dataset introduced in the take home exam – 20% of the grade.
Final, in-class only: Closed book, one sheet of notes
Notes for exams: No larger than 8 1/2 x 11, both sides, must be prepared by you
No extensions or make-ups
Caution
Exam dates cannot be changed and no make-up exams will be given. If you can’t take the exams on these dates, you should drop this class.
Dataset of your choice, method of your choice
Teamwork
Five milestones, interim deadlines throughout semester
Final milestone: Presentation in lab and write-up
Must be in lab, in-person to present
Peer review between teams for content, peer evaluation within teams for contribution
Some lab sessions allocated to project progress
Caution
Project due date cannot be changed. You must complete the project to pass this class.
Randomized at first for weekly labs
Then selected by you or assigned by me if you don’t express a preference for project and remaining labs
Expectations and roles
For the project: Peer evaluation during teamwork and after completion
Category | Percentage |
---|---|
Lectures (attendance + participation) | 5% |
Labs | 8% |
HW | 12% |
Project | 15% |
Exam 1 | 20% |
Exam 2 | 20% |
Final | 20% |
See course syllabus for how the final letter grade will be determined.
Scan the QR code or go to app.wooclap.com/sta199.
Log in with your Duke NetID.