Confidence intervals

Author

Dr. Alexander Fisher

View libraries and data sets used in these notes

library(tidyverse)
library(tidymodels)
library(DT) # datatable viewing
library(patchwork)
library(knitr)

knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%",
  fig.align = "center"
)

set.seed(221)

football <- 
  read_csv("https://sta221-fa25.github.io/data/ncaa-football-exp.csv") |>
  mutate(nonsense = runif(n(), 0, 10))

Data: NCAA Football expenditures

Same data as before (reminder):

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 (2019 - 2020 season) expenditures on football for institutions in the NCAA - Division 1 FBS (Football Bowl Subdivision). The variables are :

total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)
enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)
type: institution type (Public or Private)
nonsense: a created variable (see above) which has nothing to do with expenditure

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type + nonsense, data = football)
tidy(exp_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	17.833	3.523	5.061	0.000
enrollment_th	0.796	0.112	7.095	0.000
typePublic	-13.520	3.178	-4.254	0.000
nonsense	0.298	0.371	0.803	0.423

Confidence interval in R

We can compute the confidence intervals in R easily:

# alpha = 0.05; CI = 95%
tidy(exp_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	17.833	3.523	5.061	0.000	10.859	24.807
enrollment_th	0.796	0.112	7.095	0.000	0.574	1.018
typePublic	-13.520	3.178	-4.254	0.000	-19.812	-7.229
nonsense	0.298	0.371	0.803	0.423	-0.436	1.032

We can verify manually. For example, for the intercept:

n = nrow(football)
p = 4 # 4 entries in beta (dimension p)
17.833 - (3.523 * qt(0.975, df = n - p))

[1] 10.85944

17.833 + (3.523 * qt(0.975, df = n - p))

[1] 24.80656

Notice that

qt(0.975, df = n - p)

[1] 1.979439

is close to the number 2.

We could approximate the 95% CI:

17.833 - (3.523 * 2)

[1] 10.787

17.833 + (3.523 * 2)

[1] 24.879

Question: when is this approximation valid?

Prediction interval

For a public college with an enrollment of 50,000 and a nonsense variable of 9, what is the predicted 95% expenditure interval?

Prediction:

xstar_df <- tibble(
  enrollment_th = 50,
  nonsense = 9,
  type = "Public"
)
predict(exp_fit, newdata = xstar_df, interval = "prediction",
        level = 0.95)

     fit      lwr     upr
1 46.811 22.54961 71.0724

Exercise
Solution

Compute the CI manually.

\(\hat{y}_*\):

X <- model.matrix(total_exp_m ~ enrollment_th + type + nonsense, 
                  data = football)

y = football$total_exp_m

betaHat = solve(t(X) %*% X) %*% t(X) %*% y

xstar = matrix(c(1, 50, 1, 9), ncol = 1)
prediction = t(xstar) %*% betaHat
prediction

       [,1]
[1,] 46.811

Confidence interval:

sigma_hat <- glance(exp_fit)$sigma
t <- qt(0.975, df = n - p)
se = sigma_hat * sqrt(1 + t(xstar) %*% solve(t(X) %*% X) %*% xstar)

prediction - (se*t)

         [,1]
[1,] 22.54961

prediction + (se*t)

        [,1]
[1,] 71.0724