View libraries and data sets used in these notes
library(tidyverse) # data wrangling and visualization
library(tidymodels) # modeling (includes broom, yardstick, and other packages)Dr. Alexander Fisher
By the end of today you will be able to
We write down simple linear regression
\[ \boldsymbol{y}= \boldsymbol{X}\beta + \boldsymbol{\varepsilon}, \]
where
and \(p = 2\), i.e. there’s an intercept and 1 slope. More explicitly:
\[ \mathbf{y} = \mathbf{X}\beta + \boldsymbol{\varepsilon}, \]
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}. \]
\(\boldsymbol{X}\) is called the “design matrix”, “covariate matrix”, “model matrix” or even sometimes the “data matrix”. It includes columns each predictors and an intercept (if applicable).
The identity matrix of size \(n \times n\), denoted \(\mathbf{I}_n\), is a square matrix with ones on the diagonal and zeros elsewhere:
\[ \mathbf{I}_n = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix}. \]
For any vector \(\mathbf{v}\),
\[
\mathbf{I}_n \mathbf{v} = \mathbf{v}.
\]
An \(m \times n\) matrix \(\mathbf{A}\) represents a linear operator that maps vectors to vectors:
\[ \mathbf{A} : \mathbb{R}^n \to \mathbb{R}^m, \quad \mathbf{v} \mapsto \mathbf{A}\mathbf{v}. \]
Matrix multiplication distributes over addition:
\[ \mathbf{A}(\mathbf{u} + \mathbf{v}) = \mathbf{A}\mathbf{u} + \mathbf{A}\mathbf{v}, \]
and scalar multiplication:
\[ \mathbf{A}(c \mathbf{v}) = c (\mathbf{A}\mathbf{v}). \]
A square matrix \(\mathbf{A}\) of size \(n \times n\) has an inverse \(\mathbf{A}^{-1}\) if
\[ \mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n. \]
Only full-rank matrices (rank \(n\)) are invertible.
If \(\det(\mathbf{A}) = 0\), then \(\mathbf{A}\) is singular and has no inverse.
To invert a matrix in R, use solve(), for example:
Check:
A square matrix \(\mathbf{A}\) is symmetric if
\[ \mathbf{A} = \mathbf{A}^T. \]
That is, \(\mathbf{A}_{ij} = \mathbf{A}_{ji}\) for all \(i,j\).
To take a transpose in R, use t(A), for example:
A symmetric matrix \(\mathbf{A}\) is
Positive Definite (PD) if
\[
\mathbf{x}^T \mathbf{A} \mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}.
\]
Positive Semi-Definite (PSD) if
\[
\mathbf{x}^T \mathbf{A} \mathbf{x} \geq 0 \quad \text{for all } \mathbf{x}.
\]
Equivalently, all eigenvalues of \(\mathbf{A}\) are positive (the matrix is PD) or nonnegative (the matrix is PSD).
A vector \(\mathbf{v} \in \mathbb{R}^n\) is said to be orthogonal to a matrix \(\mathbf{A} \in \mathbb{R}^{n \times p}\) if it is orthogonal to every column of \(\mathbf{A}\).
That is,
\[ \mathbf{v}^T \mathbf{a}_j = 0 \quad \text{for each column } \mathbf{a}_j \text{ of } \mathbf{A}. \]
Equivalently,
\[ \mathbf{A}^T \mathbf{v} = \mathbf{0}. \]
In this case, \(\mathbf{v}\) lies in the orthogonal complement of the column space of \(\mathbf{A}\), often written as
\[ \mathbf{v} \in \text{Col}(\mathbf{A})^\perp. \]
Let \(\mathbf{x} \in \mathbb{R}^n\) be a column vector and \(\mathbf{A} \in \mathbb{R}^{n \times n}\) a matrix.
Gradients with respect to vectors are written as
\[ \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}. \]
Consider the scalar function
\[ f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}, \]
where \(\mathbf{a} \in \mathbb{R}^n\).
\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{a}^T \mathbf{x}) = \mathbf{a}. \]
Equivalently,
\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{x}^T \mathbf{a}) = \mathbf{a}. \]
Consider the quadratic form
\[ f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}. \]
\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}. \]
\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{x}^T \mathbf{A} \mathbf{x}) = 2 \mathbf{A} \mathbf{x}. \]
For all else, see the matrix cookbook.
The ordinary least squares (OLS) estimator is defined as
\[ \hat{\beta}_{\text{OLS}} = \arg \min_{\beta} \; \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}, \]
where
\[ \boldsymbol{\varepsilon} = \mathbf{y} - \mathbf{X}\beta. \]
Notice this is equivalent to the sum of squares we saw before:
\[ \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} = \begin{bmatrix} \varepsilon_1 & \varepsilon_2 & \cdots & \varepsilon_n \end{bmatrix} \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} = \sum_{i=1}^n \varepsilon_i^2. \]
To optimize, we need to take the derivative with respect to the vector \(\beta\) and set it equal to zero. To begin, we differentiate,
\[ \begin{aligned} \frac{\partial}{\partial{\beta}} \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} &= \frac{\partial}{\partial{\beta}} (\boldsymbol{y}- \boldsymbol{X}\beta)^T(\boldsymbol{y}- \boldsymbol{X}\beta)\\ &= \frac{\partial}{\partial{\beta}} \left( \boldsymbol{y}^T \boldsymbol{y}- \boldsymbol{y}^T \boldsymbol{X}\beta - \beta^T \boldsymbol{X}^T \boldsymbol{y}+ \beta^T \boldsymbol{X}^T \boldsymbol{X}\beta \right)\\ &= \frac{\partial}{\partial{\beta}} \left( \boldsymbol{y}^T \boldsymbol{y} -2 \beta^T \boldsymbol{X}^T \boldsymbol{y}+ \beta^T \boldsymbol{X}^T \boldsymbol{X}\beta \right)\\ &= -2 \boldsymbol{X}^T \boldsymbol{y}+ 2 \boldsymbol{X}^T \boldsymbol{X}\beta. \end{aligned} \]
When we set the derivative equal to zero and solve,
\[ \begin{aligned} -2 \boldsymbol{X}^T \boldsymbol{y}+ 2 \boldsymbol{X}^T \boldsymbol{X}\hat{\beta} &= 0 \implies\\ \hat{\beta} = \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{aligned} \]
Question: this may be some optima, but did we find the value that minimizes the sum of squared residuals?
Answer: check the Hessian matrix. The Hessian matrix is the square matrix of partial second derivatives.
\[ \begin{aligned} \frac{\partial^2}{\partial{\beta}^2} \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} &= \frac{\partial}{\partial{\beta}} \left( -2 \boldsymbol{X}^T \boldsymbol{y} + 2 \boldsymbol{X}^T \boldsymbol{X}\beta \right)\\ &= 2 \boldsymbol{X}^T \boldsymbol{X} \end{aligned} \]
If the Hessian is positive definite, we have a minimum.
If the Hessian is negative definite, we have a maximum.
If the Hessian is neither, then we have a saddle point.
Exercise on homework 1: show that the Hessian matrix \(\boldsymbol{X}^T \boldsymbol{X}\) is positive definite.
See handwritten notes.