LU decomposition (Gaussian Elimination) is not used in statistics so often because most of time statisticians deal with positive (semi)definite matrix.
That’s why we often criticize use of the solve() function in base R code, which inverts a matrix using LU decomposition. First, most likely we don’t need a matrix inverse. Second, most likely we are dealing with a pd/psd matrix and should use Cholesky.
For example, in the normal equation \[
\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}
\] for linear regression, the coefficient matrix \(\mathbf{X}^T \mathbf{X}\) is symmetric and positive semidefinite. How to exploit this structure?
2 Cholesky decomposition
Theorem: Let \(\mathbf{A} \in \mathbb{R}^{n \times n}\) be symmetric and positive definite. Then \(\mathbf{A} = \mathbf{L} \mathbf{L}^T\), where \(\mathbf{L}\) is lower triangular with positive diagonal entries and is unique. Proof (by induction):
If \(n=1\), then \(\ell = \sqrt{a}\). For \(n>1\), the block equation \[
\begin{eqnarray*}
\begin{pmatrix}
a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22}
\end{pmatrix} =
\begin{pmatrix}
\ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{l} & \mathbf{L}_{22}
\end{pmatrix}
\begin{pmatrix}
\ell_{11} & \mathbf{l}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T
\end{pmatrix}
\end{eqnarray*}
\] has solution \[
\begin{eqnarray*}
\ell_{11} &=& \sqrt{a_{11}} \\
\mathbf{l} &=& \ell_{11}^{-1} \mathbf{a} \\
\mathbf{L}_{22} \mathbf{L}_{22}^T &=& \mathbf{A}_{22} - \mathbf{l} \mathbf{l}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T.
\end{eqnarray*}
\]
Now \(a_{11}>0\) (why?), so \(\ell_{11}\) and \(\mathbf{l}\) are uniquely determined. \(\mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T\) is positive definite because \(\mathbf{A}\) is positive definite (why?). By induction hypothesis, \(\mathbf{L}_{22}\) exists and is unique.
The constructive proof completely specifies the algorithm:
Computational cost: \[
\frac{1}{2} [2(n-1)^2 + 2(n-2)^2 + \cdots + 2 \cdot 1^2] \approx \frac{1}{3} n^3 \quad \text{flops}
\] plus \(n\) square roots. Half the cost of LU decomposition by utilizing symmetry.
In general Cholesky decomposition is very stable. Failure of the decomposition simply means \(\mathbf{A}\) is not positive definite. It is an efficient way to test positive definiteness.
3 Pivoting
When \(\mathbf{A}\) does not have full rank, e.g., \(\mathbf{X}^T \mathbf{X}\) with a non-full column rank \(\mathbf{X}\), we encounter \(a_{kk} = 0\) during the procedure.
Symmetric pivoting. At each stage \(k\), we permute both row and column such that \(\max_{k \le i \le n} a_{ii}\) becomes the pivot. If we encounter \(\max_{k \le i \le n} a_{ii} = 0\), then \(\mathbf{A}[k:n,k:n] = \mathbf{0}\) (why?) and the algorithm terminates.
With symmetric pivoting: \[
\mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T,
\] where \(\mathbf{P}\) is a permutation matrix and \(\mathbf{L} \in \mathbb{R}^{n \times r}\), \(r = \text{rank}(\mathbf{A})\).
# P A P' = L UA[Achol.p, Achol.p] ≈ Achol.L * Achol.U
true
5 Applications
No inversion mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition.
5.1 Multivariate normal density
Multivariate normal density \(\text{MVN}(0, \Sigma)\), where \(\Sigma\) is p.d., is \[
\, - \frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma - \frac{1}{2} \mathbf{y}^T \Sigma^{-1} \mathbf{y}.
\]
Method 2: (a) Cholesky decomposition \(\Sigma = \mathbf{L} \mathbf{L}^T\) (\(n^3/3\) flops), (b) Solve \(\mathbf{L} \mathbf{x} = \mathbf{y}\) by forward substitutions (\(n^2\) flops), (c) compute quadratic form \(\mathbf{x}^T \mathbf{x}\) (\(2n\) flops), and (d) compute determinant from Cholesky factor (\(n\) flops).
Which method is better?
# this is a person w/o numerical analysis trainingfunctionlogpdf_mvn_1(y::Vector, Σ::Matrix) n =length(y)- (n//2) *log(2π) - (1//2) *logdet(Symmetric(Σ)) - (1//2) *transpose(y) *inv(Σ) * yend# this is a computing-savvy person functionlogpdf_mvn_2(y::Vector, Σ::Matrix) n =length(y) Σchol =cholesky(Symmetric(Σ))- (n//2) *log(2π) - (1//2) *logdet(Σchol) - (1//2) *abs2(norm(Σchol.L \ y))end# better memory efficiency - input Σ is overwrittenfunctionlogpdf_mvn_3(y::Vector, Σ::Matrix) n =length(y) Σchol =cholesky!(Symmetric(Σ)) # Σ is overwritten- (n//2) *log(2π) - (1//2) *logdet(Σchol) - (1//2) *dot(y, Σchol \ y)end
logpdf_mvn_3 (generic function with 1 method)
usingBenchmarkTools, Distributions, RandomRandom.seed!(257) # seedn =1000# a pd matrixΣ =convert(Matrix{Float64}, Symmetric([i * (n - j +1) for i in1:n, j in1:n]))y =rand(MvNormal(Σ)) # one random sample from N(0, Σ)# at least they should give the same answer@showlogpdf_mvn_1(y, Σ)@showlogpdf_mvn_2(y, Σ)Σc =copy(Σ)@showlogpdf_mvn_3(y, Σc);
BenchmarkTools.Trial: 2279 samples with 1 evaluation.
Range (min … max): 2.031 ms … 5.622 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.046 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.052 ms ± 123.650 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▆▅▅▅▆▅███▇▆█▅▅▁▁
▂▃▃▅▇██████████████████▇▆▅▅▄▃▃▃▃▂▃▃▂▂▂▂▁▁▁▂▁▁▁▂▂▁▁▂▁▂▁▂▂▂▁▂ ▄
2.03 ms Histogram: frequency by time 2.09 ms <
Memory estimate: 7.94 KiB, allocs estimate: 1.
To evaluate same multivariate normal density at many observations \(y_1, y_2, \ldots\), we pre-compute the Cholesky decomposition (\(n^3/3\) flops), then each evaluation costs \(n^2\) flops.
5.2 Linear regression
Cholesky decomposition is one approach to solve linear regression \[
\widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}.
\]
Assume \(\mathbf{X} \in \mathbb{R}^{n \times p}\) and \(\mathbf{y} \in \mathbb{R}^n\).