Matrix algebra of regression

Author

Dr. Alexander Fisher

View libraries and data sets used in these notes
library(tidyverse) # data wrangling and visualization
library(tidymodels)  # modeling (includes broom, yardstick, and other packages)

Learning objectives

By the end of today you will be able to

  1. write down simple linear regression in matrix form
  2. understand the geometry of OLS regression
  3. be able to explain the following vocabulary: design matrix, Hessian matrix, projection

Simple linear regression (p = 2)

We write down simple linear regression

\[ \boldsymbol{y}= \boldsymbol{X}\beta + \boldsymbol{\varepsilon}, \]

where

  • \(\boldsymbol{y}\in \mathbb{R}^n\)
  • \(\boldsymbol{X}\in \mathbb{R}^{n \times p}\)
  • \(\beta \in \mathbb{R}^p\)
  • \(\boldsymbol{\varepsilon}\in \mathbb{R}^n\)

and \(p = 2\), i.e. there’s an intercept and 1 slope. More explicitly:

\[ \mathbf{y} = \mathbf{X}\beta + \boldsymbol{\varepsilon}, \]

\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}. \]

Definition: design matrix

\(\boldsymbol{X}\) is called the “design matrix”, “covariate matrix”, “model matrix” or even sometimes the “data matrix”. It includes columns each predictors and an intercept (if applicable).

Linear algebra background

Identity Matrix

The identity matrix of size \(n \times n\), denoted \(\mathbf{I}_n\), is a square matrix with ones on the diagonal and zeros elsewhere:

\[ \mathbf{I}_n = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix}. \]

For any vector \(\mathbf{v}\),
\[ \mathbf{I}_n \mathbf{v} = \mathbf{v}. \]


Matrices as Linear Operators

An \(m \times n\) matrix \(\mathbf{A}\) represents a linear operator that maps vectors to vectors:

\[ \mathbf{A} : \mathbb{R}^n \to \mathbb{R}^m, \quad \mathbf{v} \mapsto \mathbf{A}\mathbf{v}. \]

Matrix multiplication distributes over addition:

\[ \mathbf{A}(\mathbf{u} + \mathbf{v}) = \mathbf{A}\mathbf{u} + \mathbf{A}\mathbf{v}, \]

and scalar multiplication:

\[ \mathbf{A}(c \mathbf{v}) = c (\mathbf{A}\mathbf{v}). \]


Matrix Inverses

A square matrix \(\mathbf{A}\) of size \(n \times n\) has an inverse \(\mathbf{A}^{-1}\) if

\[ \mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n. \]

Only full-rank matrices (rank \(n\)) are invertible.
If \(\det(\mathbf{A}) = 0\), then \(\mathbf{A}\) is singular and has no inverse.

To invert a matrix in R, use solve(), for example:

(A = matrix(c(1,0,3,4), ncol = 2))
     [,1] [,2]
[1,]    1    3
[2,]    0    4
solve(A)
     [,1]  [,2]
[1,]    1 -0.75
[2,]    0  0.25

Check:

solve(A) %*% A
     [,1] [,2]
[1,]    1    0
[2,]    0    1

Symmetric Matrices

A square matrix \(\mathbf{A}\) is symmetric if

\[ \mathbf{A} = \mathbf{A}^T. \]

That is, \(\mathbf{A}_{ij} = \mathbf{A}_{ji}\) for all \(i,j\).

To take a transpose in R, use t(A), for example:

(A = matrix(c(1,2,3,4), ncol = 2))
     [,1] [,2]
[1,]    1    3
[2,]    2    4
t(A)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Positive Definite and Positive Semi-Definite Matrices

A symmetric matrix \(\mathbf{A}\) is

  • Positive Definite (PD) if
    \[ \mathbf{x}^T \mathbf{A} \mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}. \]

  • Positive Semi-Definite (PSD) if
    \[ \mathbf{x}^T \mathbf{A} \mathbf{x} \geq 0 \quad \text{for all } \mathbf{x}. \]

Equivalently, all eigenvalues of \(\mathbf{A}\) are positive (the matrix is PD) or nonnegative (the matrix is PSD).


Vector Orthogonal to a Matrix

A vector \(\mathbf{v} \in \mathbb{R}^n\) is said to be orthogonal to a matrix \(\mathbf{A} \in \mathbb{R}^{n \times p}\) if it is orthogonal to every column of \(\mathbf{A}\).

That is,

\[ \mathbf{v}^T \mathbf{a}_j = 0 \quad \text{for each column } \mathbf{a}_j \text{ of } \mathbf{A}. \]

Equivalently,

\[ \mathbf{A}^T \mathbf{v} = \mathbf{0}. \]

In this case, \(\mathbf{v}\) lies in the orthogonal complement of the column space of \(\mathbf{A}\), often written as

\[ \mathbf{v} \in \text{Col}(\mathbf{A})^\perp. \]

Matrix calculus

Let \(\mathbf{x} \in \mathbb{R}^n\) be a column vector and \(\mathbf{A} \in \mathbb{R}^{n \times n}\) a matrix.

Gradients with respect to vectors are written as

\[ \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}. \]


1. Derivative of an Inner Product

Consider the scalar function

\[ f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}, \]

where \(\mathbf{a} \in \mathbb{R}^n\).

  • Derivative with respect to \(\mathbf{x}\):

\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{a}^T \mathbf{x}) = \mathbf{a}. \]

Equivalently,

\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{x}^T \mathbf{a}) = \mathbf{a}. \]


2. Derivative of a Quadratic Form

Consider the quadratic form

\[ f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}. \]

  • If \(\mathbf{A}\) is not assumed symmetric:

\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}. \]

  • If \(\mathbf{A}\) is symmetric (\(\mathbf{A} = \mathbf{A}^T\)):

\[ \frac{\partial}{\partial \mathbf{x}} \, (\mathbf{x}^T \mathbf{A} \mathbf{x}) = 2 \mathbf{A} \mathbf{x}. \]

For all else, see the matrix cookbook.

OLS in Matrix form

The ordinary least squares (OLS) estimator is defined as

\[ \hat{\beta}_{\text{OLS}} = \arg \min_{\beta} \; \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}, \]

where

\[ \boldsymbol{\varepsilon} = \mathbf{y} - \mathbf{X}\beta. \]

Notice this is equivalent to the sum of squares we saw before:

\[ \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} = \begin{bmatrix} \varepsilon_1 & \varepsilon_2 & \cdots & \varepsilon_n \end{bmatrix} \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} = \sum_{i=1}^n \varepsilon_i^2. \]

To optimize, we need to take the derivative with respect to the vector \(\beta\) and set it equal to zero. To begin, we differentiate,

\[ \begin{aligned} \frac{\partial}{\partial{\beta}} \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} &= \frac{\partial}{\partial{\beta}} (\boldsymbol{y}- \boldsymbol{X}\beta)^T(\boldsymbol{y}- \boldsymbol{X}\beta)\\ &= \frac{\partial}{\partial{\beta}} \left( \boldsymbol{y}^T \boldsymbol{y}- \boldsymbol{y}^T \boldsymbol{X}\beta - \beta^T \boldsymbol{X}^T \boldsymbol{y}+ \beta^T \boldsymbol{X}^T \boldsymbol{X}\beta \right)\\ &= \frac{\partial}{\partial{\beta}} \left( \boldsymbol{y}^T \boldsymbol{y} -2 \beta^T \boldsymbol{X}^T \boldsymbol{y}+ \beta^T \boldsymbol{X}^T \boldsymbol{X}\beta \right)\\ &= -2 \boldsymbol{X}^T \boldsymbol{y}+ 2 \boldsymbol{X}^T \boldsymbol{X}\beta. \end{aligned} \]

When we set the derivative equal to zero and solve,

\[ \begin{aligned} -2 \boldsymbol{X}^T \boldsymbol{y}+ 2 \boldsymbol{X}^T \boldsymbol{X}\hat{\beta} &= 0 \implies\\ \hat{\beta} = \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{aligned} \]

Did we find a minimum?

Question: this may be some optima, but did we find the value that minimizes the sum of squared residuals?

Answer: check the Hessian matrix. The Hessian matrix is the square matrix of partial second derivatives.

\[ \begin{aligned} \frac{\partial^2}{\partial{\beta}^2} \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} &= \frac{\partial}{\partial{\beta}} \left( -2 \boldsymbol{X}^T \boldsymbol{y} + 2 \boldsymbol{X}^T \boldsymbol{X}\beta \right)\\ &= 2 \boldsymbol{X}^T \boldsymbol{X} \end{aligned} \]

  • If the Hessian is positive definite, we have a minimum.

  • If the Hessian is negative definite, we have a maximum.

  • If the Hessian is neither, then we have a saddle point.

Exercise on homework 1: show that the Hessian matrix \(\boldsymbol{X}^T \boldsymbol{X}\) is positive definite.

Geometry

See handwritten notes.