Confidence intervals

Author

Dr. Alexander Fisher

View libraries and data sets used in these notes
library(tidyverse)
library(tidymodels)
library(DT) # datatable viewing
library(patchwork)
library(knitr)

knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%",
  fig.align = "center"
)

set.seed(221)

football <- 
  read_csv("https://sta221-fa25.github.io/data/ncaa-football-exp.csv") |>
  mutate(nonsense = runif(n(), 0, 10))

Data: NCAA Football expenditures

Same data as before (reminder):

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 (2019 - 2020 season) expenditures on football for institutions in the NCAA - Division 1 FBS (Football Bowl Subdivision). The variables are :

  • total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)

  • enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)

  • type: institution type (Public or Private)

  • nonsense: a created variable (see above) which has nothing to do with expenditure

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type + nonsense, data = football)
tidy(exp_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 17.833 3.523 5.061 0.000
enrollment_th 0.796 0.112 7.095 0.000
typePublic -13.520 3.178 -4.254 0.000
nonsense 0.298 0.371 0.803 0.423

Confidence interval in R

We can compute the confidence intervals in R easily:

# alpha = 0.05; CI = 95%
tidy(exp_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 17.833 3.523 5.061 0.000 10.859 24.807
enrollment_th 0.796 0.112 7.095 0.000 0.574 1.018
typePublic -13.520 3.178 -4.254 0.000 -19.812 -7.229
nonsense 0.298 0.371 0.803 0.423 -0.436 1.032

We can verify manually. For example, for the intercept:

n = nrow(football)
p = 4 # 4 entries in beta (dimension p)
17.833 - (3.523 * qt(0.975, df = n - p))
[1] 10.85944
17.833 + (3.523 * qt(0.975, df = n - p))
[1] 24.80656

Notice that

qt(0.975, df = n - p)
[1] 1.979439

is close to the number 2.

We could approximate the 95% CI:

17.833 - (3.523 * 2)
[1] 10.787
17.833 + (3.523 * 2)
[1] 24.879

Question: when is this approximation valid?

Prediction interval

For a public college with an enrollment of 50,000 and a nonsense variable of 9, what is the predicted 95% expenditure interval?

Prediction:

xstar_df <- tibble(
  enrollment_th = 50,
  nonsense = 9,
  type = "Public"
)
predict(exp_fit, newdata = xstar_df, interval = "prediction",
        level = 0.95)
     fit      lwr     upr
1 46.811 22.54961 71.0724

Compute the CI manually.

\(\hat{y}_*\):

X <- model.matrix(total_exp_m ~ enrollment_th + type + nonsense, 
                  data = football)

y = football$total_exp_m

betaHat = solve(t(X) %*% X) %*% t(X) %*% y

xstar = matrix(c(1, 50, 1, 9), ncol = 1)
prediction = t(xstar) %*% betaHat
prediction
       [,1]
[1,] 46.811

Confidence interval:

sigma_hat <- glance(exp_fit)$sigma
t <- qt(0.975, df = n - p)
se = sigma_hat * sqrt(1 + t(xstar) %*% solve(t(X) %*% X) %*% xstar)

prediction - (se*t)
         [,1]
[1,] 22.54961
prediction + (se*t)
        [,1]
[1,] 71.0724