Hypothesis testing

Author

Dr. Alexander Fisher

View libraries and data sets used in these notes

library(tidyverse)
library(tidymodels)
library(DT) # datatable viewing
library(patchwork)
library(knitr)

knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%",
  fig.align = "center"
)

set.seed(221)

football <- 
  read_csv("https://sta221-fa25.github.io/data/ncaa-football-exp.csv") |>
  mutate(nonsense = runif(n(), 0, 10))

Data: NCAA Football expenditures

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 (2019 - 2020 season) expenditures on football for institutions in the NCAA - Division 1 FBS (Football Bowl Subdivision). The variables are :

total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)
enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)
type: institution type (Public or Private)
nonsense: a created variable (see above) which has nothing to do with expenditure

Univariate EDA

Bivariate EDA

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type + nonsense, data = football)
tidy(exp_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	17.833	3.523	5.061	0.000
enrollment_th	0.796	0.112	7.095	0.000
typePublic	-13.520	3.178	-4.254	0.000
nonsense	0.298	0.371	0.803	0.423

From sample to population

For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.

This estimate is valid for the single sample of 127 higher education institutions in the 2019 - 2020 academic year.
But what if we’re not interested quantifying the relationship between student enrollment, institution type, and football expenditures for this single sample?
What if we want to say something about the relationship between these variables for all colleges and universities with football programs and across different years?

Hypothesis testing

Goal: evaluate the evidence that $\beta_j \neq 0$.

Recall that our assumptions imply

\[ \hat{\beta} \sim \text{MVN}(\beta, \sigma^2 (\boldsymbol{X}^T \boldsymbol{X})^{-1}) \] which further implies

\[ \hat{\beta_j} \sim N(\beta_j, \sigma^2 C_{jj}), \] where $C = \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}$.

If $\beta_j = 0$, then $\hat{\beta}_j \sim N(0, \sigma^2 C_{jj})$.

Testing question: is the observed value of $\hat{\beta}_j$ “consistent” with the hypothesis $H_0: \beta_j = 0$? In other words, is $\hat{\beta_j} \sim N(0, \sigma^2 C_{jj})$ a reasonable conclusion?

Idea: $Z_{\beta_j} = \frac{\hat{\beta_j} - 0}{\sqrt{\sigma^2 C_{jj}}} \sim N(0, 1)$.

If $|Z_{\beta_j}|$ is large, we will reject the null. If it is small, we will ‘fail to reject’ the null.

Note

We can’t compute $Z_{\beta_j}$ directly because we don’t know $\sigma^2$. What we’ll do is plug in the unbiased estimator $\hat{\sigma}^2 = \frac{\text{RSS}}{n-p}$ for $\sigma^2$.

\[ t_{\beta_j} = \frac{\hat{\beta}_j}{\sqrt{\hat{\sigma}^2 C_{jj}}} \sim t_{n-p} \]

$t_{\beta_j}$ is called a test statistic because it is the statistic (function of the data) that summarizes our evidence against the null.

Let $t_{1-\frac{\alpha}{2}, n- p}$ be defined as qt(1 - (alpha/2), n - p), i.e. the (1-$\alpha/2$) quantile of a t-distribution with $n-p$ degrees of freedom. Then

\[ \begin{aligned} Pr(\text{reject } H_0 | \beta_j = 0) &= Pr(|t_{\beta_j}| > t_{1-\frac{\alpha}{2}}|\beta_j = 0)\\ &= Pr(|t| > t_{1-\frac{\alpha}{2}})\\ &= \frac{\alpha}{2} + \frac{\alpha}{2}\\ &= \alpha \end{aligned} \]

p-value

A p-value is: (1) a tail probability of a statistic under the null hypothesis, (2) the lowest value of $\alpha$ such that $H_0$ is rejected, (3) $Pr(|t| > t_{\beta_j (obs)})$.

In R we can compute it: $2 \times$ (1 - pt(abs(t), n - p)).

Hypothesis testing in R

Manually computing the standard error

$Var(\hat{\boldsymbol{\beta}})$ for NCAA data

X <- model.matrix(total_exp_m ~ enrollment_th + type + nonsense, 
                  data = football)
sigma_sq <- glance(exp_fit)$sigma^2

var_beta <- sigma_sq * solve(t(X) %*% X)
var_beta

              (Intercept) enrollment_th typePublic     nonsense
(Intercept)    12.4139465  -0.170593886 -5.4231006 -0.692309267
enrollment_th  -0.1705939   0.012597357 -0.1315619  0.007350219
typePublic     -5.4231006  -0.131561941 10.1018139 -0.136025423
nonsense       -0.6923093   0.007350219 -0.1360254  0.137611482

$SE(\hat{\boldsymbol{\beta}})$ for NCAA data

term	estimate	std.error	statistic	p.value
(Intercept)	17.833	3.523	5.061	0.000
enrollment_th	0.796	0.112	7.095	0.000
typePublic	-13.520	3.178	-4.254	0.000
nonsense	0.298	0.371	0.803	0.423

sqrt(diag(var_beta))

  (Intercept) enrollment_th    typePublic      nonsense 
    3.5233431     0.1122379     3.1783351     0.3709602

Compute p-value

n <- nrow(football)
p <- 4
2 * (1 - pt(abs(0.803), n - p))

[1] 0.4235237

Visually,

t vs. N(0,1)

Figure 1: Standard normal vs. t distributions

Hypothesis testing

Data: NCAA Football expenditures

Univariate EDA

Bivariate EDA

Regression model

From sample to population

Hypothesis testing

p-value

Hypothesis testing in R

Manually computing the standard error

\(Var(\hat{\boldsymbol{\beta}})\) for NCAA data

\(SE(\hat{\boldsymbol{\beta}})\) for NCAA data

Compute p-value

t vs. N(0,1)