Lab 8

Leavin’ on a jet plane, hypothetically

Lab

Due: End of lab on Mon, Dec 1

Introduction

Last week you explored the relationship between distance and air time of flights out of RDU in 2024. And you did it in some la la land where we you had access to the full population. Time to get back to reality! This week you will explore the same relationship, but this time you will work with a single sample from the population.

Getting started

By now you should be familiar with how to get started with a lab assignment by cloning the GitHub repo for the assignment. If you’re not sure how, refer back to an earlier lab.

Open the lab-8.qmd template Quarto file and update the authors field to add your name first (first and last) and then your teammates’ names (first and last). Render the document. Examine the rendered document and make sure your and your teammates’ names are updated in the document. Commit and push your changes with a meaningful commit message and push to GitHub.

Click to expand if you need a refresher on assignment guidelines.

Code

Code should follow the tidyverse style. Particularly,

there should be spaces before and line breaks after each + when building a ggplot,
there should also be spaces before and line breaks after each |> in a data transformation pipeline,
code should be properly indented,
there should be spaces around = signs and spaces after commas.

Additionally, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.¹

Plots

Plots should have an informative title and, if needed, also a subtitle.
Axes and legends should be labeled with both the variable name and its units (if applicable).
Careful consideration should be given to aesthetic choices.

Workflow

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.

You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Final versions of both your .qmd file and the rendered PDF should be pushed to GitHub.

Packages

In this lab we will work with the tidyverse package.

library(tidyverse)
library(tidymodels)

Population data

The dataset, called rdu-flights-sample.csv, can be found in the data folder.

Questions

Question 1 - Observe

Read the observed data:

Read in your random sample of 10 flights and store it as rdu_flights_sample. Each person in the class will have the same observed sample of flights.
Visualize the observed relationship:

Visualize the relationship between distance and air_time using a scatter plot. Also add a regression line to the scatter plot. Do not show the standard error ribbon around the regression line.
Model and estimate the observed relationship:

Fit a model predicting air_time using distance using the sample data. Display the model summary.

Question 2 - Hypothesize

Set the null hypothesis:

What if there was no relationship between distance and air_time? What would the true slope of the relationship be in that case? This is your null hypothesis. You can articulate it as:

\(H_0:\) There is no relationship between distance and air time, the value of the true slope is ____, i.e., \(\beta_1\) = ____.

Fill in the blanks.

Tip

The null hypothesis always sets the true population parameter equal to a specific value.
Set the alternative hypothesis:

What if there is a relationship between distance and air_time? This is your alternative hypothesis!

\(H_A:\) There is a relationship between distance and air time, the value of the true slope is ____, i.e., \(\beta_1\) ______.

Fill in the blanks.

Tip

The alternative hypothesis always compares the true population parameter to the same specific value set in the null hypothesis. This comparison ca be “not equal to”, “greater than”, or “less than” and the choice depends on the research question. In this case, our research question is whether there is some relationship between distance and air time.

Question 3 - Generate

Break the relationship:

Now let’s set aside reality for a bit again, and imagine we live in the land of the null hypothesis where there is no relationship between distance and air_time. Using the sample data, simulate such a world by breaking any relationship between distance and air_time.

How? Randomly permute the air_time values in your sample data to break any relationship between distance and air_time, and store the resulting air time values in a new column called air_time_permuted. Each person in the class should have a different permuation of the air_time values. You could achieve this by setting different seeds, e.g., each person could use their birthday. Mine is Feb 5, so I’ll use 52.
```
set.seed(52)
rdu_flights_sample <- rdu_flights_sample |>
  mutate(
    air_time_permuted = sample(
      air_time,
      size = nrow(rdu_flights_sample),
      replace = FALSE
    )
  )
```
Display rdu_flights_sample to see the new column. Confirm that each air_time_permuted value comes from the original air_time values, but the order is different.
Visualize, model, and estimate the broken relationship:

Visualize the relationship between distance and air_time_permuted using a scatter plot. Also add a regression line to the scatter plot. Do not show the standard error ribbon around the regression line.

Then, fit a model predicting air_time_permuted using distance using the sample data. Display the model summary.

How does the slope estimate for your permuted air time values compare to that of others in your team or others in the class? Exactly the same? Wildly different? Somewhere in between?
Generate the null distribution (collectively):

Each person in the class has now simulated a slightly different world where there is no relationship between distance and air_time by breaking any relationship in their sample data. Mark your slope estimate from part (b) on the number line on the board. The distribution you constructed on the board is called the null distribution of the slope estimates. This distribution represents what slope estimates we would expect to see if the null hypothesis were true.

When everyone has marked their slope estimates, what does the overall distribution of slope estimates look like? Is it centered around a particular value?

Question 4 - Calculate and conclude

Calculate the p-value:

The p-value is defined as the probability of the observed or more extreme outcomes assuming the null hypothesis is true. In this case, the p-value is the probability, under the null distribution, of observing a slope estimate as extreme or more extreme than the one you observed in Question 1 - part (c).

Based on the null distribution you constructed collectively in Question 3 - part (c), calculate your p-value.

Tip

First mark the slope estimate you observed in Question 1 - part (c) on the number line on the board. Then count how many slope estimates in the null distribution are beyond your observed slope estimate. Then, don’t forget about the other side of the distribution – we generally account for that side by multiplying the count we obtained by 2. Finally, divide that count by the total number of slope estimates in the null distribution to get your p-value.
Conclude the hypothesis test:

Using a significance level of \(\alpha = 0.05\), what is your conclusion regarding the hypothesis test you conducted?

Tip

If your p-value is less than or equal to \(\alpha\), you reject the null hypothesis, and conclude that the data provide convincing evidence of a discernible relationship between distance and air time. If your p-value is greater than \(\alpha\), you fail to reject the null hypothesis, and conclude that the data do not provide convincing evidence of a discernible relationship between distance and air time. Whatever you do, your conclusion should always be stated in the context of the research question.

Wrap-up

Warning

Before you wrap up the assignment, make sure that you render, commit, and push one final time so that the final versions of both your .qmd file and the rendered PDF are pushed to GitHub and your Git pane is empty. We will be checking these to make sure you have been practicing how to commit and push changes.

Submission

By now you should also be familiar with how to submit your assignment in Gradescope.

Click to expand if you need a refresher on how to get started with a lab assignment.

Submit your PDF document to Gradescope by the end of the lab to be considered “on time”:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
Click on your STA 199 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with question. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).

Checklist

Make sure you have:

attempted all questions
rendered your Quarto document
committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
uploaded your PDF to Gradescope

Grading and feedback

This lab is worth 30 points:
- 10 points for being in lab and turning in something – no partial credit for this part.
- 20 points for:
  - answering the questions correctly – there is partial credit for this part.
  - following the workflow – there is partial credit for this part.
The workflow points are for:
- committing at least three times as you work through your lab,
- having your final version of .qmd and .pdf files in your GitHub repository, and
- overall organization.
You’ll receive feedback on your lab on Gradescope within a week.

Good luck, and have fun with it!

Footnotes

Remember, haikus not novellas when writing code!↩︎